You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Sand Stone <sa...@gmail.com> on 2016/05/06 21:00:46 UTC

Partition and Split rows

Hi, I am new to Kudu. I wonder how the split rows work. I know from some
docs, this is currently for pre-creation the table. I am researching how to
partition (hash+range) some time series test data.

Is there an example? or notes somewhere I could read upon.

Thanks much.

Re: Partition and Split rows

Posted by Sand Stone <sa...@gmail.com>.

Cool, I will study it.

On Thu, May 12, 2016 at 2:17 PM, Dan Burkert <da...@cloudera.com> wrote:

>
>
> On Thu, May 12, 2016 at 2:04 PM, Sand Stone <sa...@gmail.com>
> wrote:
>
> >Instead, take advantage of the index capability of Primary Keys.
>> Currently I did make the "5-min" field a part of the primary key as well.
>> I am most likely overdoing it. I will play around with the schema and use
>> cases around it.
>>
>
> Definitely take a look at the data model in Kudu TS, it has extremely
> efficient scan semantics (all scans retrieve only the necessary data by
> using the primary key, no client *or* server side filtering), and it works
> with arbitrarily large time range partitions.
>
>
>> >since each tablet server should only have on the order of 10-20 tablets.
>> How does this 10-20 heuristics come out? Is it based on certain machine
>> profile? Or some default parameters in the code/config?
>>
>
> 10-20 isn't a hard and fast rule, and it is very dependent on the dataset
> and the machine size.  We routinely run tablet servers with 300+ tablets,
> so it's definitely not a hard limitation.  My intention was to stress that
> instead of pursuing fine grained partitions as a method of limiting the
> size of scans, instead take advantage of the primary key indexing that Kudu
> provides, just as you might in a traditional relational database.
> Partitions are better suited for ensuring that inserts and large scans can
> be parallelized across multiple machines.
>
> - Dan
>
>
>>
>> On Thu, May 12, 2016 at 1:45 PM, Dan Burkert <da...@cloudera.com> wrote:
>>
>>>
>>> On Thu, May 12, 2016 at 11:39 AM, Sand Stone <sa...@gmail.com>
>>> wrote:
>>>
>>> I don't know how Kudu load balance the data across the tablet servers.
>>>>
>>>
>>> Individual tablets are replicated and balanced across all available
>>> tablet servers, for more on that see
>>> http://getkudu.io/docs/schema_design.html#data-distribution.
>>>
>>>
>>>
>>>> For example, do I need to pre-calculate every day, a list of 5 minutes
>>>> apart timestamps at table creation? [assume I have to create a new table
>>>> every day].
>>>>
>>>
>>> If you wish to range partition on the time column, then yes, currently
>>> you must specify the splits upfront during table creation (but this will
>>> change with the non-covering range partitions work).
>>>
>>>
>>>>
>>>> My hope, with the additional 5-min column, and use it as the range
>>>> partition column, is that so I could spread the data evenly across the
>>>> tablet servers.
>>>>
>>>
>>> I don't think this is meaningfully different than range partitioning on
>>> the full time column with splits every 5 minutes.
>>>
>>>
>>>> Also, since 5-min interval data are always colocated together, the read
>>>> query could be efficient too.
>>>>
>>>
>>> Data colocation is a function of the partitioning and indexing.  As I
>>> mentioned before, if you have timestamp as part of your primary key then
>>> you can guarantee that scans specifying a time range are efficient. Overall
>>> it sounds like you are attempting to get fast scans by creating many fine
>>> grained partitions, as you might with Parquet.  This won't be an efficient
>>> strategy in Kudu, since each tablet server should only have on the order of
>>> 10-20 tablets.  Instead, take advantage of the index capability of Primary
>>> Keys.
>>>
>>> - Dan
>>>
>>>
>>>> On Thu, May 12, 2016 at 11:13 AM, Dan Burkert <da...@cloudera.com> wrote:
>>>>
>>>>> Forgot to add the PK specification to the CREATE TABLE, it should have
>>>>> read as follows:
>>>>>
>>>>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE)
>>>>> PRIMARY KEY (metric, time);
>>>>>
>>>>> - Dan
>>>>>
>>>>>
>>>>> On Thu, May 12, 2016 at 11:12 AM, Dan Burkert <da...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > Is the requirement to pre-aggregate by time window?
>>>>>>> No, I am thinking to create a column say, "minute". It's basically
>>>>>>> the minute field of the timestamp column(even round to 5-min bucket
>>>>>>> depending on the needs). So it's a computed column being filled in on data
>>>>>>> ingestion. My goal is that this field would help with data filtering at
>>>>>>> read/query time, say select certain projection at minute 10-15, to speed up
>>>>>>> the read queries.
>>>>>>>
>>>>>>
>>>>>> In many cases, Kudu can do his for you without having to add special
>>>>>> columns.  The requirements are that the timestamp is part of the primary
>>>>>> key, and any columns that come before the timestamp in the primary key (if
>>>>>> it's a compound PK), have equality predicates.  So for instance, if you
>>>>>> create a table such as:
>>>>>>
>>>>>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);
>>>>>>
>>>>>> then queries such as
>>>>>>
>>>>>> SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
>>>>>> 2016-05-01T00:00 AND time < 2016-05-01T00:05
>>>>>>
>>>>>> Then only the data for that 5 minute time window will be read from
>>>>>> disk.  If the query didn't have the equality predicate on the 'metric'
>>>>>> column, then it would do a much bigger scan + filter operation.  If you
>>>>>> want more background on how this is achieved, check out the partition
>>>>>> pruning design doc:
>>>>>> https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
>>>>>> .
>>>>>>
>>>>>> - Dan
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thanks for the info., I will follow them.
>>>>>>>
>>>>>>> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <da...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Sand,
>>>>>>>>
>>>>>>>> Sorry for the delayed response.  I'm not quite following your use
>>>>>>>> case.  Is the requirement to pre-aggregate by time window? I don't think
>>>>>>>> Kudu can help you directly with that (nothing built in), but you could
>>>>>>>> always create a separate table to store the pre-aggregated values.  As far
>>>>>>>> as applying functions to do row splits, that is an interesting idea, but I
>>>>>>>> think once Kudu has support for range bounds (the non-covering range
>>>>>>>> partition design doc linked above), you can simply create the bounds where
>>>>>>>> the function would have put them.  For example, if you want a partition for
>>>>>>>> every five minutes, you can create the bounds accordingly.
>>>>>>>>
>>>>>>>> Earlier this week I gave a talk on timeseries in Kudu, I've
>>>>>>>> included some slides that may be interesting to you.  Additionally, you may
>>>>>>>> want to check out https://github.com/danburkert/kudu-ts, it's a
>>>>>>>> very young  (not feature complete) metrics layer on top of Kudu, it may
>>>>>>>> give you some ideas.
>>>>>>>>
>>>>>>>> - Dan
>>>>>>>>
>>>>>>>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for sharing, Dan. The diagrams explained clearly how the
>>>>>>>>> current system works.
>>>>>>>>>
>>>>>>>>> As for things in my mind. Take the schema of
>>>>>>>>> <host,metric,time,...>, say, I am interested in data for the past 5 mins,
>>>>>>>>> 10 mins, etc. Or, aggregate at 5 mins interval for the past 3 days, 7 days,
>>>>>>>>> ... Looks like I need to introduce a special 5-min bar column, use that
>>>>>>>>> column to do range partition to spread data across the tablet servers so
>>>>>>>>> that I could leverage parallel filtering.
>>>>>>>>>
>>>>>>>>> The cost of this extra column (INT8) is not ideal but not too bad
>>>>>>>>> either (storage cost wise, compression should do wonders). So I am thinking
>>>>>>>>> whether it would be better to take "functions" as row split instead of only
>>>>>>>>> constants. Of course if business requires to drop down to 1-min bar, the
>>>>>>>>> data has to be re-sharded again. So a more cost effective way of doing this
>>>>>>>>> on a production cluster would be good.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <da...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Sand,
>>>>>>>>>>
>>>>>>>>>> I've been working on some diagrams to help explain some of the
>>>>>>>>>> more advanced partitioning types, it's attached.   Still pretty rough at
>>>>>>>>>> this point, but the goal is to clean it up and move it into the Kudu
>>>>>>>>>> documentation proper.  I'm interested to hear what kind of time series you
>>>>>>>>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>>>>>>>>> series, you can follow progress here
>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have
>>>>>>>>>> any additional ideas I'd love to hear them.  You may also be interested in
>>>>>>>>>> a small project that a JD and I have been working on in the past week to
>>>>>>>>>> build an OpenTSDB style store on top of Kudu, you can find it
>>>>>>>>>> here <https://github.com/danburkert/kudu-ts>.  Still quite
>>>>>>>>>> feature limited at this point.
>>>>>>>>>>
>>>>>>>>>> - Dan
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <
>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks. Will read.
>>>>>>>>>>>
>>>>>>>>>>> Given that I am researching time series data, row locality is
>>>>>>>>>>> crucial :-)
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <
>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> We do have non-covering range partitions coming in the next few
>>>>>>>>>>>> months, here's the design (in review):
>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>>>>>>>>
>>>>>>>>>>>> The "Background & Motivation" section should give you a good
>>>>>>>>>>>> idea of why I'm mentioning this.
>>>>>>>>>>>>
>>>>>>>>>>>> Meanwhile, if you don't need row locality, using hash
>>>>>>>>>>>> partitioning could be good enough.
>>>>>>>>>>>>
>>>>>>>>>>>> J-D
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <
>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Makes sense.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yeah it would be cool if users could specify/control the split
>>>>>>>>>>>>> rows after the table is created. Now, I have to "think ahead" to pre-create
>>>>>>>>>>>>> the range buckets.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You will only get 1 tablet and no data distribution, which is
>>>>>>>>>>>>>> bad.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That's also how HBase works, but it will split regions as you
>>>>>>>>>>>>>> insert data and eventually you'll get some data distribution even if it
>>>>>>>>>>>>>> doesn't start in an ideal situation. Tablet splitting will come later for
>>>>>>>>>>>>>> Kudu.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <
>>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One more questions, how does the range partition work if I
>>>>>>>>>>>>>>> don't specify the split rows?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <
>>>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was just reading the Java API,CreateTableOptions.java,
>>>>>>>>>>>>>>>> it's unclear how the range partition column names associated with the
>>>>>>>>>>>>>>>> partial rows params in the addSplitRow API.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Sand,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please have a look at
>>>>>>>>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Misty
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I
>>>>>>>>>>>>>>>>>> know from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is there an example? or notes somewhere I could read
>>>>>>>>>>>>>>>>>> upon.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks much.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Dan Burkert <da...@cloudera.com>.

On Thu, May 12, 2016 at 2:04 PM, Sand Stone <sa...@gmail.com> wrote:

>Instead, take advantage of the index capability of Primary Keys.
> Currently I did make the "5-min" field a part of the primary key as well.
> I am most likely overdoing it. I will play around with the schema and use
> cases around it.
>

Definitely take a look at the data model in Kudu TS, it has extremely
efficient scan semantics (all scans retrieve only the necessary data by
using the primary key, no client *or* server side filtering), and it works
with arbitrarily large time range partitions.


> >since each tablet server should only have on the order of 10-20 tablets.
> How does this 10-20 heuristics come out? Is it based on certain machine
> profile? Or some default parameters in the code/config?
>

10-20 isn't a hard and fast rule, and it is very dependent on the dataset
and the machine size.  We routinely run tablet servers with 300+ tablets,
so it's definitely not a hard limitation.  My intention was to stress that
instead of pursuing fine grained partitions as a method of limiting the
size of scans, instead take advantage of the primary key indexing that Kudu
provides, just as you might in a traditional relational database.
Partitions are better suited for ensuring that inserts and large scans can
be parallelized across multiple machines.

- Dan


>
> On Thu, May 12, 2016 at 1:45 PM, Dan Burkert <da...@cloudera.com> wrote:
>
>>
>> On Thu, May 12, 2016 at 11:39 AM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>> I don't know how Kudu load balance the data across the tablet servers.
>>>
>>
>> Individual tablets are replicated and balanced across all available
>> tablet servers, for more on that see
>> http://getkudu.io/docs/schema_design.html#data-distribution.
>>
>>
>>
>>> For example, do I need to pre-calculate every day, a list of 5 minutes
>>> apart timestamps at table creation? [assume I have to create a new table
>>> every day].
>>>
>>
>> If you wish to range partition on the time column, then yes, currently
>> you must specify the splits upfront during table creation (but this will
>> change with the non-covering range partitions work).
>>
>>
>>>
>>> My hope, with the additional 5-min column, and use it as the range
>>> partition column, is that so I could spread the data evenly across the
>>> tablet servers.
>>>
>>
>> I don't think this is meaningfully different than range partitioning on
>> the full time column with splits every 5 minutes.
>>
>>
>>> Also, since 5-min interval data are always colocated together, the read
>>> query could be efficient too.
>>>
>>
>> Data colocation is a function of the partitioning and indexing.  As I
>> mentioned before, if you have timestamp as part of your primary key then
>> you can guarantee that scans specifying a time range are efficient. Overall
>> it sounds like you are attempting to get fast scans by creating many fine
>> grained partitions, as you might with Parquet.  This won't be an efficient
>> strategy in Kudu, since each tablet server should only have on the order of
>> 10-20 tablets.  Instead, take advantage of the index capability of Primary
>> Keys.
>>
>> - Dan
>>
>>
>>> On Thu, May 12, 2016 at 11:13 AM, Dan Burkert <da...@cloudera.com> wrote:
>>>
>>>> Forgot to add the PK specification to the CREATE TABLE, it should have
>>>> read as follows:
>>>>
>>>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE)
>>>> PRIMARY KEY (metric, time);
>>>>
>>>> - Dan
>>>>
>>>>
>>>> On Thu, May 12, 2016 at 11:12 AM, Dan Burkert <da...@cloudera.com> wrote:
>>>>
>>>>>
>>>>> On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> > Is the requirement to pre-aggregate by time window?
>>>>>> No, I am thinking to create a column say, "minute". It's basically
>>>>>> the minute field of the timestamp column(even round to 5-min bucket
>>>>>> depending on the needs). So it's a computed column being filled in on data
>>>>>> ingestion. My goal is that this field would help with data filtering at
>>>>>> read/query time, say select certain projection at minute 10-15, to speed up
>>>>>> the read queries.
>>>>>>
>>>>>
>>>>> In many cases, Kudu can do his for you without having to add special
>>>>> columns.  The requirements are that the timestamp is part of the primary
>>>>> key, and any columns that come before the timestamp in the primary key (if
>>>>> it's a compound PK), have equality predicates.  So for instance, if you
>>>>> create a table such as:
>>>>>
>>>>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);
>>>>>
>>>>> then queries such as
>>>>>
>>>>> SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
>>>>> 2016-05-01T00:00 AND time < 2016-05-01T00:05
>>>>>
>>>>> Then only the data for that 5 minute time window will be read from
>>>>> disk.  If the query didn't have the equality predicate on the 'metric'
>>>>> column, then it would do a much bigger scan + filter operation.  If you
>>>>> want more background on how this is achieved, check out the partition
>>>>> pruning design doc:
>>>>> https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
>>>>> .
>>>>>
>>>>> - Dan
>>>>>
>>>>>
>>>>>
>>>>>> Thanks for the info., I will follow them.
>>>>>>
>>>>>> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <da...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Sand,
>>>>>>>
>>>>>>> Sorry for the delayed response.  I'm not quite following your use
>>>>>>> case.  Is the requirement to pre-aggregate by time window? I don't think
>>>>>>> Kudu can help you directly with that (nothing built in), but you could
>>>>>>> always create a separate table to store the pre-aggregated values.  As far
>>>>>>> as applying functions to do row splits, that is an interesting idea, but I
>>>>>>> think once Kudu has support for range bounds (the non-covering range
>>>>>>> partition design doc linked above), you can simply create the bounds where
>>>>>>> the function would have put them.  For example, if you want a partition for
>>>>>>> every five minutes, you can create the bounds accordingly.
>>>>>>>
>>>>>>> Earlier this week I gave a talk on timeseries in Kudu, I've included
>>>>>>> some slides that may be interesting to you.  Additionally, you may want to
>>>>>>> check out https://github.com/danburkert/kudu-ts, it's a very young
>>>>>>>  (not feature complete) metrics layer on top of Kudu, it may give you some
>>>>>>> ideas.
>>>>>>>
>>>>>>> - Dan
>>>>>>>
>>>>>>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for sharing, Dan. The diagrams explained clearly how the
>>>>>>>> current system works.
>>>>>>>>
>>>>>>>> As for things in my mind. Take the schema of
>>>>>>>> <host,metric,time,...>, say, I am interested in data for the past 5 mins,
>>>>>>>> 10 mins, etc. Or, aggregate at 5 mins interval for the past 3 days, 7 days,
>>>>>>>> ... Looks like I need to introduce a special 5-min bar column, use that
>>>>>>>> column to do range partition to spread data across the tablet servers so
>>>>>>>> that I could leverage parallel filtering.
>>>>>>>>
>>>>>>>> The cost of this extra column (INT8) is not ideal but not too bad
>>>>>>>> either (storage cost wise, compression should do wonders). So I am thinking
>>>>>>>> whether it would be better to take "functions" as row split instead of only
>>>>>>>> constants. Of course if business requires to drop down to 1-min bar, the
>>>>>>>> data has to be re-sharded again. So a more cost effective way of doing this
>>>>>>>> on a production cluster would be good.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <da...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Sand,
>>>>>>>>>
>>>>>>>>> I've been working on some diagrams to help explain some of the
>>>>>>>>> more advanced partitioning types, it's attached.   Still pretty rough at
>>>>>>>>> this point, but the goal is to clean it up and move it into the Kudu
>>>>>>>>> documentation proper.  I'm interested to hear what kind of time series you
>>>>>>>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>>>>>>>> series, you can follow progress here
>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have
>>>>>>>>> any additional ideas I'd love to hear them.  You may also be interested in
>>>>>>>>> a small project that a JD and I have been working on in the past week to
>>>>>>>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>>>>>>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature
>>>>>>>>> limited at this point.
>>>>>>>>>
>>>>>>>>> - Dan
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.stone@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Thanks. Will read.
>>>>>>>>>>
>>>>>>>>>> Given that I am researching time series data, row locality is
>>>>>>>>>> crucial :-)
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <
>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> We do have non-covering range partitions coming in the next few
>>>>>>>>>>> months, here's the design (in review):
>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>>>>>>>
>>>>>>>>>>> The "Background & Motivation" section should give you a good
>>>>>>>>>>> idea of why I'm mentioning this.
>>>>>>>>>>>
>>>>>>>>>>> Meanwhile, if you don't need row locality, using hash
>>>>>>>>>>> partitioning could be good enough.
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <
>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Makes sense.
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah it would be cool if users could specify/control the split
>>>>>>>>>>>> rows after the table is created. Now, I have to "think ahead" to pre-create
>>>>>>>>>>>> the range buckets.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> You will only get 1 tablet and no data distribution, which is
>>>>>>>>>>>>> bad.
>>>>>>>>>>>>>
>>>>>>>>>>>>> That's also how HBase works, but it will split regions as you
>>>>>>>>>>>>> insert data and eventually you'll get some data distribution even if it
>>>>>>>>>>>>> doesn't start in an ideal situation. Tablet splitting will come later for
>>>>>>>>>>>>> Kudu.
>>>>>>>>>>>>>
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <
>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> One more questions, how does the range partition work if I
>>>>>>>>>>>>>> don't specify the split rows?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <
>>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I was just reading the Java API,CreateTableOptions.java,
>>>>>>>>>>>>>>> it's unclear how the range partition column names associated with the
>>>>>>>>>>>>>>> partial rows params in the addSplitRow API.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Sand,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please have a look at
>>>>>>>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Misty
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I
>>>>>>>>>>>>>>>>> know from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks much.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Sand Stone <sa...@gmail.com>.

Thanks for the advice, Dan.

>Instead, take advantage of the index capability of Primary Keys.
Currently I did make the "5-min" field a part of the primary key as well. I
am most likely overdoing it. I will play around with the schema and use
cases around it.

>since each tablet server should only have on the order of 10-20 tablets.
How does this 10-20 heuristics come out? Is it based on certain machine
profile? Or some default parameters in the code/config?





On Thu, May 12, 2016 at 1:45 PM, Dan Burkert <da...@cloudera.com> wrote:

>
> On Thu, May 12, 2016 at 11:39 AM, Sand Stone <sa...@gmail.com>
> wrote:
>
> I don't know how Kudu load balance the data across the tablet servers.
>>
>
> Individual tablets are replicated and balanced across all available tablet
> servers, for more on that see
> http://getkudu.io/docs/schema_design.html#data-distribution.
>
>
>
>> For example, do I need to pre-calculate every day, a list of 5 minutes
>> apart timestamps at table creation? [assume I have to create a new table
>> every day].
>>
>
> If you wish to range partition on the time column, then yes, currently you
> must specify the splits upfront during table creation (but this will change
> with the non-covering range partitions work).
>
>
>>
>> My hope, with the additional 5-min column, and use it as the range
>> partition column, is that so I could spread the data evenly across the
>> tablet servers.
>>
>
> I don't think this is meaningfully different than range partitioning on
> the full time column with splits every 5 minutes.
>
>
>> Also, since 5-min interval data are always colocated together, the read
>> query could be efficient too.
>>
>
> Data colocation is a function of the partitioning and indexing.  As I
> mentioned before, if you have timestamp as part of your primary key then
> you can guarantee that scans specifying a time range are efficient. Overall
> it sounds like you are attempting to get fast scans by creating many fine
> grained partitions, as you might with Parquet.  This won't be an efficient
> strategy in Kudu, since each tablet server should only have on the order of
> 10-20 tablets.  Instead, take advantage of the index capability of Primary
> Keys.
>
> - Dan
>
>
>> On Thu, May 12, 2016 at 11:13 AM, Dan Burkert <da...@cloudera.com> wrote:
>>
>>> Forgot to add the PK specification to the CREATE TABLE, it should have
>>> read as follows:
>>>
>>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE)
>>> PRIMARY KEY (metric, time);
>>>
>>> - Dan
>>>
>>>
>>> On Thu, May 12, 2016 at 11:12 AM, Dan Burkert <da...@cloudera.com> wrote:
>>>
>>>>
>>>> On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> > Is the requirement to pre-aggregate by time window?
>>>>> No, I am thinking to create a column say, "minute". It's basically the
>>>>> minute field of the timestamp column(even round to 5-min bucket depending
>>>>> on the needs). So it's a computed column being filled in on data ingestion.
>>>>> My goal is that this field would help with data filtering at read/query
>>>>> time, say select certain projection at minute 10-15, to speed up the read
>>>>> queries.
>>>>>
>>>>
>>>> In many cases, Kudu can do his for you without having to add special
>>>> columns.  The requirements are that the timestamp is part of the primary
>>>> key, and any columns that come before the timestamp in the primary key (if
>>>> it's a compound PK), have equality predicates.  So for instance, if you
>>>> create a table such as:
>>>>
>>>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);
>>>>
>>>> then queries such as
>>>>
>>>> SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
>>>> 2016-05-01T00:00 AND time < 2016-05-01T00:05
>>>>
>>>> Then only the data for that 5 minute time window will be read from
>>>> disk.  If the query didn't have the equality predicate on the 'metric'
>>>> column, then it would do a much bigger scan + filter operation.  If you
>>>> want more background on how this is achieved, check out the partition
>>>> pruning design doc:
>>>> https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
>>>> .
>>>>
>>>> - Dan
>>>>
>>>>
>>>>
>>>>> Thanks for the info., I will follow them.
>>>>>
>>>>> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <da...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Sand,
>>>>>>
>>>>>> Sorry for the delayed response.  I'm not quite following your use
>>>>>> case.  Is the requirement to pre-aggregate by time window? I don't think
>>>>>> Kudu can help you directly with that (nothing built in), but you could
>>>>>> always create a separate table to store the pre-aggregated values.  As far
>>>>>> as applying functions to do row splits, that is an interesting idea, but I
>>>>>> think once Kudu has support for range bounds (the non-covering range
>>>>>> partition design doc linked above), you can simply create the bounds where
>>>>>> the function would have put them.  For example, if you want a partition for
>>>>>> every five minutes, you can create the bounds accordingly.
>>>>>>
>>>>>> Earlier this week I gave a talk on timeseries in Kudu, I've included
>>>>>> some slides that may be interesting to you.  Additionally, you may want to
>>>>>> check out https://github.com/danburkert/kudu-ts, it's a very young
>>>>>>  (not feature complete) metrics layer on top of Kudu, it may give you some
>>>>>> ideas.
>>>>>>
>>>>>> - Dan
>>>>>>
>>>>>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for sharing, Dan. The diagrams explained clearly how the
>>>>>>> current system works.
>>>>>>>
>>>>>>> As for things in my mind. Take the schema of <host,metric,time,...>,
>>>>>>> say, I am interested in data for the past 5 mins, 10 mins, etc. Or,
>>>>>>> aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like I
>>>>>>> need to introduce a special 5-min bar column, use that column to do range
>>>>>>> partition to spread data across the tablet servers so that I could leverage
>>>>>>> parallel filtering.
>>>>>>>
>>>>>>> The cost of this extra column (INT8) is not ideal but not too bad
>>>>>>> either (storage cost wise, compression should do wonders). So I am thinking
>>>>>>> whether it would be better to take "functions" as row split instead of only
>>>>>>> constants. Of course if business requires to drop down to 1-min bar, the
>>>>>>> data has to be re-sharded again. So a more cost effective way of doing this
>>>>>>> on a production cluster would be good.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <da...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Sand,
>>>>>>>>
>>>>>>>> I've been working on some diagrams to help explain some of the more
>>>>>>>> advanced partitioning types, it's attached.   Still pretty rough at this
>>>>>>>> point, but the goal is to clean it up and move it into the Kudu
>>>>>>>> documentation proper.  I'm interested to hear what kind of time series you
>>>>>>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>>>>>>> series, you can follow progress here
>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>>>>>>>> additional ideas I'd love to hear them.  You may also be interested in a
>>>>>>>> small project that a JD and I have been working on in the past week to
>>>>>>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>>>>>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature
>>>>>>>> limited at this point.
>>>>>>>>
>>>>>>>> - Dan
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks. Will read.
>>>>>>>>>
>>>>>>>>> Given that I am researching time series data, row locality is
>>>>>>>>> crucial :-)
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <
>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> We do have non-covering range partitions coming in the next few
>>>>>>>>>> months, here's the design (in review):
>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>>>>>>
>>>>>>>>>> The "Background & Motivation" section should give you a good idea
>>>>>>>>>> of why I'm mentioning this.
>>>>>>>>>>
>>>>>>>>>> Meanwhile, if you don't need row locality, using hash
>>>>>>>>>> partitioning could be good enough.
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <
>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Makes sense.
>>>>>>>>>>>
>>>>>>>>>>> Yeah it would be cool if users could specify/control the split
>>>>>>>>>>> rows after the table is created. Now, I have to "think ahead" to pre-create
>>>>>>>>>>> the range buckets.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You will only get 1 tablet and no data distribution, which is
>>>>>>>>>>>> bad.
>>>>>>>>>>>>
>>>>>>>>>>>> That's also how HBase works, but it will split regions as you
>>>>>>>>>>>> insert data and eventually you'll get some data distribution even if it
>>>>>>>>>>>> doesn't start in an ideal situation. Tablet splitting will come later for
>>>>>>>>>>>> Kudu.
>>>>>>>>>>>>
>>>>>>>>>>>> J-D
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <
>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> One more questions, how does the range partition work if I
>>>>>>>>>>>>> don't specify the split rows?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <
>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Sand,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please have a look at
>>>>>>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Misty
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I
>>>>>>>>>>>>>>>> know from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks much.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Dan Burkert <da...@cloudera.com>.

On Thu, May 12, 2016 at 11:39 AM, Sand Stone <sa...@gmail.com> wrote:

I don't know how Kudu load balance the data across the tablet servers.
>

Individual tablets are replicated and balanced across all available tablet
servers, for more on that see
http://getkudu.io/docs/schema_design.html#data-distribution.



> For example, do I need to pre-calculate every day, a list of 5 minutes
> apart timestamps at table creation? [assume I have to create a new table
> every day].
>

If you wish to range partition on the time column, then yes, currently you
must specify the splits upfront during table creation (but this will change
with the non-covering range partitions work).


>
> My hope, with the additional 5-min column, and use it as the range
> partition column, is that so I could spread the data evenly across the
> tablet servers.
>

I don't think this is meaningfully different than range partitioning on the
full time column with splits every 5 minutes.


> Also, since 5-min interval data are always colocated together, the read
> query could be efficient too.
>

Data colocation is a function of the partitioning and indexing.  As I
mentioned before, if you have timestamp as part of your primary key then
you can guarantee that scans specifying a time range are efficient. Overall
it sounds like you are attempting to get fast scans by creating many fine
grained partitions, as you might with Parquet.  This won't be an efficient
strategy in Kudu, since each tablet server should only have on the order of
10-20 tablets.  Instead, take advantage of the index capability of Primary
Keys.

- Dan


> On Thu, May 12, 2016 at 11:13 AM, Dan Burkert <da...@cloudera.com> wrote:
>
>> Forgot to add the PK specification to the CREATE TABLE, it should have
>> read as follows:
>>
>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE)
>> PRIMARY KEY (metric, time);
>>
>> - Dan
>>
>>
>> On Thu, May 12, 2016 at 11:12 AM, Dan Burkert <da...@cloudera.com> wrote:
>>
>>>
>>> On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sa...@gmail.com>
>>> wrote:
>>>
>>>> > Is the requirement to pre-aggregate by time window?
>>>> No, I am thinking to create a column say, "minute". It's basically the
>>>> minute field of the timestamp column(even round to 5-min bucket depending
>>>> on the needs). So it's a computed column being filled in on data ingestion.
>>>> My goal is that this field would help with data filtering at read/query
>>>> time, say select certain projection at minute 10-15, to speed up the read
>>>> queries.
>>>>
>>>
>>> In many cases, Kudu can do his for you without having to add special
>>> columns.  The requirements are that the timestamp is part of the primary
>>> key, and any columns that come before the timestamp in the primary key (if
>>> it's a compound PK), have equality predicates.  So for instance, if you
>>> create a table such as:
>>>
>>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);
>>>
>>> then queries such as
>>>
>>> SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
>>> 2016-05-01T00:00 AND time < 2016-05-01T00:05
>>>
>>> Then only the data for that 5 minute time window will be read from
>>> disk.  If the query didn't have the equality predicate on the 'metric'
>>> column, then it would do a much bigger scan + filter operation.  If you
>>> want more background on how this is achieved, check out the partition
>>> pruning design doc:
>>> https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
>>> .
>>>
>>> - Dan
>>>
>>>
>>>
>>>> Thanks for the info., I will follow them.
>>>>
>>>> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <da...@cloudera.com> wrote:
>>>>
>>>>> Hey Sand,
>>>>>
>>>>> Sorry for the delayed response.  I'm not quite following your use
>>>>> case.  Is the requirement to pre-aggregate by time window? I don't think
>>>>> Kudu can help you directly with that (nothing built in), but you could
>>>>> always create a separate table to store the pre-aggregated values.  As far
>>>>> as applying functions to do row splits, that is an interesting idea, but I
>>>>> think once Kudu has support for range bounds (the non-covering range
>>>>> partition design doc linked above), you can simply create the bounds where
>>>>> the function would have put them.  For example, if you want a partition for
>>>>> every five minutes, you can create the bounds accordingly.
>>>>>
>>>>> Earlier this week I gave a talk on timeseries in Kudu, I've included
>>>>> some slides that may be interesting to you.  Additionally, you may want to
>>>>> check out https://github.com/danburkert/kudu-ts, it's a very young
>>>>>  (not feature complete) metrics layer on top of Kudu, it may give you some
>>>>> ideas.
>>>>>
>>>>> - Dan
>>>>>
>>>>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for sharing, Dan. The diagrams explained clearly how the
>>>>>> current system works.
>>>>>>
>>>>>> As for things in my mind. Take the schema of <host,metric,time,...>,
>>>>>> say, I am interested in data for the past 5 mins, 10 mins, etc. Or,
>>>>>> aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like I
>>>>>> need to introduce a special 5-min bar column, use that column to do range
>>>>>> partition to spread data across the tablet servers so that I could leverage
>>>>>> parallel filtering.
>>>>>>
>>>>>> The cost of this extra column (INT8) is not ideal but not too bad
>>>>>> either (storage cost wise, compression should do wonders). So I am thinking
>>>>>> whether it would be better to take "functions" as row split instead of only
>>>>>> constants. Of course if business requires to drop down to 1-min bar, the
>>>>>> data has to be re-sharded again. So a more cost effective way of doing this
>>>>>> on a production cluster would be good.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <da...@cloudera.com> wrote:
>>>>>>
>>>>>>> Hi Sand,
>>>>>>>
>>>>>>> I've been working on some diagrams to help explain some of the more
>>>>>>> advanced partitioning types, it's attached.   Still pretty rough at this
>>>>>>> point, but the goal is to clean it up and move it into the Kudu
>>>>>>> documentation proper.  I'm interested to hear what kind of time series you
>>>>>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>>>>>> series, you can follow progress here
>>>>>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>>>>>>> additional ideas I'd love to hear them.  You may also be interested in a
>>>>>>> small project that a JD and I have been working on in the past week to
>>>>>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>>>>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature
>>>>>>> limited at this point.
>>>>>>>
>>>>>>> - Dan
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks. Will read.
>>>>>>>>
>>>>>>>> Given that I am researching time series data, row locality is
>>>>>>>> crucial :-)
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <
>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>
>>>>>>>>> We do have non-covering range partitions coming in the next few
>>>>>>>>> months, here's the design (in review):
>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>>>>>
>>>>>>>>> The "Background & Motivation" section should give you a good idea
>>>>>>>>> of why I'm mentioning this.
>>>>>>>>>
>>>>>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>>>>>> could be good enough.
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.stone@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Makes sense.
>>>>>>>>>>
>>>>>>>>>> Yeah it would be cool if users could specify/control the split
>>>>>>>>>> rows after the table is created. Now, I have to "think ahead" to pre-create
>>>>>>>>>> the range buckets.
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> You will only get 1 tablet and no data distribution, which is
>>>>>>>>>>> bad.
>>>>>>>>>>>
>>>>>>>>>>> That's also how HBase works, but it will split regions as you
>>>>>>>>>>> insert data and eventually you'll get some data distribution even if it
>>>>>>>>>>> doesn't start in an ideal situation. Tablet splitting will come later for
>>>>>>>>>>> Kudu.
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <
>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> One more questions, how does the range partition work if I
>>>>>>>>>>>> don't specify the split rows?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <
>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Sand,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please have a look at
>>>>>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Misty
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I
>>>>>>>>>>>>>>> know from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks much.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Sand Stone <sa...@gmail.com>.

Thanks, Dan.

In your scheme, I assume you suggest the range partition on the timestamp.
I don't know how Kudu load balance the data across the tablet servers. For
example, do I need to pre-calculate every day, a list of 5 minutes apart
timestamps at table creation? [assume I have to create a new table every
day].

My hope, with the additional 5-min column, and use it as the range
partition column, is that so I could spread the data evenly across the
tablet servers. Once the partition level deletion works, I don't need to
re-create a table.
Also, since 5-min interval data are always colocated together, the read
query could be efficient too.

P.S: there are some cases I would like to compute aggregations across all
metrics at 5-min intervals.


On Thu, May 12, 2016 at 11:13 AM, Dan Burkert <da...@cloudera.com> wrote:

> Forgot to add the PK specification to the CREATE TABLE, it should have
> read as follows:
>
> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE)
> PRIMARY KEY (metric, time);
>
> - Dan
>
>
> On Thu, May 12, 2016 at 11:12 AM, Dan Burkert <da...@cloudera.com> wrote:
>
>>
>> On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>>> > Is the requirement to pre-aggregate by time window?
>>> No, I am thinking to create a column say, "minute". It's basically the
>>> minute field of the timestamp column(even round to 5-min bucket depending
>>> on the needs). So it's a computed column being filled in on data ingestion.
>>> My goal is that this field would help with data filtering at read/query
>>> time, say select certain projection at minute 10-15, to speed up the read
>>> queries.
>>>
>>
>> In many cases, Kudu can do his for you without having to add special
>> columns.  The requirements are that the timestamp is part of the primary
>> key, and any columns that come before the timestamp in the primary key (if
>> it's a compound PK), have equality predicates.  So for instance, if you
>> create a table such as:
>>
>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);
>>
>> then queries such as
>>
>> SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
>> 2016-05-01T00:00 AND time < 2016-05-01T00:05
>>
>> Then only the data for that 5 minute time window will be read from disk.
>> If the query didn't have the equality predicate on the 'metric' column,
>> then it would do a much bigger scan + filter operation.  If you want more
>> background on how this is achieved, check out the partition pruning design
>> doc:
>> https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
>> .
>>
>> - Dan
>>
>>
>>
>>> Thanks for the info., I will follow them.
>>>
>>> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <da...@cloudera.com> wrote:
>>>
>>>> Hey Sand,
>>>>
>>>> Sorry for the delayed response.  I'm not quite following your use
>>>> case.  Is the requirement to pre-aggregate by time window? I don't think
>>>> Kudu can help you directly with that (nothing built in), but you could
>>>> always create a separate table to store the pre-aggregated values.  As far
>>>> as applying functions to do row splits, that is an interesting idea, but I
>>>> think once Kudu has support for range bounds (the non-covering range
>>>> partition design doc linked above), you can simply create the bounds where
>>>> the function would have put them.  For example, if you want a partition for
>>>> every five minutes, you can create the bounds accordingly.
>>>>
>>>> Earlier this week I gave a talk on timeseries in Kudu, I've included
>>>> some slides that may be interesting to you.  Additionally, you may want to
>>>> check out https://github.com/danburkert/kudu-ts, it's a very young
>>>>  (not feature complete) metrics layer on top of Kudu, it may give you some
>>>> ideas.
>>>>
>>>> - Dan
>>>>
>>>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for sharing, Dan. The diagrams explained clearly how the
>>>>> current system works.
>>>>>
>>>>> As for things in my mind. Take the schema of <host,metric,time,...>,
>>>>> say, I am interested in data for the past 5 mins, 10 mins, etc. Or,
>>>>> aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like I
>>>>> need to introduce a special 5-min bar column, use that column to do range
>>>>> partition to spread data across the tablet servers so that I could leverage
>>>>> parallel filtering.
>>>>>
>>>>> The cost of this extra column (INT8) is not ideal but not too bad
>>>>> either (storage cost wise, compression should do wonders). So I am thinking
>>>>> whether it would be better to take "functions" as row split instead of only
>>>>> constants. Of course if business requires to drop down to 1-min bar, the
>>>>> data has to be re-sharded again. So a more cost effective way of doing this
>>>>> on a production cluster would be good.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <da...@cloudera.com> wrote:
>>>>>
>>>>>> Hi Sand,
>>>>>>
>>>>>> I've been working on some diagrams to help explain some of the more
>>>>>> advanced partitioning types, it's attached.   Still pretty rough at this
>>>>>> point, but the goal is to clean it up and move it into the Kudu
>>>>>> documentation proper.  I'm interested to hear what kind of time series you
>>>>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>>>>> series, you can follow progress here
>>>>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>>>>>> additional ideas I'd love to hear them.  You may also be interested in a
>>>>>> small project that a JD and I have been working on in the past week to
>>>>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>>>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature
>>>>>> limited at this point.
>>>>>>
>>>>>> - Dan
>>>>>>
>>>>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks. Will read.
>>>>>>>
>>>>>>> Given that I am researching time series data, row locality is
>>>>>>> crucial :-)
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <
>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>
>>>>>>>> We do have non-covering range partitions coming in the next few
>>>>>>>> months, here's the design (in review):
>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>>>>
>>>>>>>> The "Background & Motivation" section should give you a good idea
>>>>>>>> of why I'm mentioning this.
>>>>>>>>
>>>>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>>>>> could be good enough.
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Makes sense.
>>>>>>>>>
>>>>>>>>> Yeah it would be cool if users could specify/control the split
>>>>>>>>> rows after the table is created. Now, I have to "think ahead" to pre-create
>>>>>>>>> the range buckets.
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>>>>>>
>>>>>>>>>> That's also how HBase works, but it will split regions as you
>>>>>>>>>> insert data and eventually you'll get some data distribution even if it
>>>>>>>>>> doesn't start in an ideal situation. Tablet splitting will come later for
>>>>>>>>>> Kudu.
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <
>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>>>>>> specify the split rows?
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <
>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>>>>
>>>>>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Sand,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please have a look at
>>>>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Misty
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I
>>>>>>>>>>>>>> know from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks much.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Dan Burkert <da...@cloudera.com>.

Forgot to add the PK specification to the CREATE TABLE, it should have read
as follows:

CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE)
PRIMARY KEY (metric, time);

- Dan


On Thu, May 12, 2016 at 11:12 AM, Dan Burkert <da...@cloudera.com> wrote:

>
> On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sa...@gmail.com>
> wrote:
>
>> > Is the requirement to pre-aggregate by time window?
>> No, I am thinking to create a column say, "minute". It's basically the
>> minute field of the timestamp column(even round to 5-min bucket depending
>> on the needs). So it's a computed column being filled in on data ingestion.
>> My goal is that this field would help with data filtering at read/query
>> time, say select certain projection at minute 10-15, to speed up the read
>> queries.
>>
>
> In many cases, Kudu can do his for you without having to add special
> columns.  The requirements are that the timestamp is part of the primary
> key, and any columns that come before the timestamp in the primary key (if
> it's a compound PK), have equality predicates.  So for instance, if you
> create a table such as:
>
> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);
>
> then queries such as
>
> SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
> 2016-05-01T00:00 AND time < 2016-05-01T00:05
>
> Then only the data for that 5 minute time window will be read from disk.
> If the query didn't have the equality predicate on the 'metric' column,
> then it would do a much bigger scan + filter operation.  If you want more
> background on how this is achieved, check out the partition pruning design
> doc:
> https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
> .
>
> - Dan
>
>
>
>> Thanks for the info., I will follow them.
>>
>> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <da...@cloudera.com> wrote:
>>
>>> Hey Sand,
>>>
>>> Sorry for the delayed response.  I'm not quite following your use case.
>>> Is the requirement to pre-aggregate by time window? I don't think Kudu can
>>> help you directly with that (nothing built in), but you could always create
>>> a separate table to store the pre-aggregated values.  As far as applying
>>> functions to do row splits, that is an interesting idea, but I think once
>>> Kudu has support for range bounds (the non-covering range partition design
>>> doc linked above), you can simply create the bounds where the function
>>> would have put them.  For example, if you want a partition for every five
>>> minutes, you can create the bounds accordingly.
>>>
>>> Earlier this week I gave a talk on timeseries in Kudu, I've included
>>> some slides that may be interesting to you.  Additionally, you may want to
>>> check out https://github.com/danburkert/kudu-ts, it's a very young
>>>  (not feature complete) metrics layer on top of Kudu, it may give you some
>>> ideas.
>>>
>>> - Dan
>>>
>>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sa...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for sharing, Dan. The diagrams explained clearly how the current
>>>> system works.
>>>>
>>>> As for things in my mind. Take the schema of <host,metric,time,...>,
>>>> say, I am interested in data for the past 5 mins, 10 mins, etc. Or,
>>>> aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like I
>>>> need to introduce a special 5-min bar column, use that column to do range
>>>> partition to spread data across the tablet servers so that I could leverage
>>>> parallel filtering.
>>>>
>>>> The cost of this extra column (INT8) is not ideal but not too bad
>>>> either (storage cost wise, compression should do wonders). So I am thinking
>>>> whether it would be better to take "functions" as row split instead of only
>>>> constants. Of course if business requires to drop down to 1-min bar, the
>>>> data has to be re-sharded again. So a more cost effective way of doing this
>>>> on a production cluster would be good.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <da...@cloudera.com> wrote:
>>>>
>>>>> Hi Sand,
>>>>>
>>>>> I've been working on some diagrams to help explain some of the more
>>>>> advanced partitioning types, it's attached.   Still pretty rough at this
>>>>> point, but the goal is to clean it up and move it into the Kudu
>>>>> documentation proper.  I'm interested to hear what kind of time series you
>>>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>>>> series, you can follow progress here
>>>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>>>>> additional ideas I'd love to hear them.  You may also be interested in a
>>>>> small project that a JD and I have been working on in the past week to
>>>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited
>>>>> at this point.
>>>>>
>>>>> - Dan
>>>>>
>>>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks. Will read.
>>>>>>
>>>>>> Given that I am researching time series data, row locality is crucial
>>>>>> :-)
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <
>>>>>> jdcryans@apache.org> wrote:
>>>>>>
>>>>>>> We do have non-covering range partitions coming in the next few
>>>>>>> months, here's the design (in review):
>>>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>>>
>>>>>>> The "Background & Motivation" section should give you a good idea of
>>>>>>> why I'm mentioning this.
>>>>>>>
>>>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>>>> could be good enough.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Makes sense.
>>>>>>>>
>>>>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>>>>>> range buckets.
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>
>>>>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>>>>>
>>>>>>>>> That's also how HBase works, but it will split regions as you
>>>>>>>>> insert data and eventually you'll get some data distribution even if it
>>>>>>>>> doesn't start in an ideal situation. Tablet splitting will come later for
>>>>>>>>> Kudu.
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.stone@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>>>>> specify the split rows?
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <
>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>>>
>>>>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Sand,
>>>>>>>>>>>>
>>>>>>>>>>>> Please have a look at
>>>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Misty
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks much.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Dan Burkert <da...@cloudera.com>.

On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sa...@gmail.com> wrote:

> > Is the requirement to pre-aggregate by time window?
> No, I am thinking to create a column say, "minute". It's basically the
> minute field of the timestamp column(even round to 5-min bucket depending
> on the needs). So it's a computed column being filled in on data ingestion.
> My goal is that this field would help with data filtering at read/query
> time, say select certain projection at minute 10-15, to speed up the read
> queries.
>

In many cases, Kudu can do his for you without having to add special
columns.  The requirements are that the timestamp is part of the primary
key, and any columns that come before the timestamp in the primary key (if
it's a compound PK), have equality predicates.  So for instance, if you
create a table such as:

CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);

then queries such as

SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
2016-05-01T00:00 AND time < 2016-05-01T00:05

Then only the data for that 5 minute time window will be read from disk.
If the query didn't have the equality predicate on the 'metric' column,
then it would do a much bigger scan + filter operation.  If you want more
background on how this is achieved, check out the partition pruning design
doc:
https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
.

- Dan



> Thanks for the info., I will follow them.
>
> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <da...@cloudera.com> wrote:
>
>> Hey Sand,
>>
>> Sorry for the delayed response.  I'm not quite following your use case.
>> Is the requirement to pre-aggregate by time window? I don't think Kudu can
>> help you directly with that (nothing built in), but you could always create
>> a separate table to store the pre-aggregated values.  As far as applying
>> functions to do row splits, that is an interesting idea, but I think once
>> Kudu has support for range bounds (the non-covering range partition design
>> doc linked above), you can simply create the bounds where the function
>> would have put them.  For example, if you want a partition for every five
>> minutes, you can create the bounds accordingly.
>>
>> Earlier this week I gave a talk on timeseries in Kudu, I've included some
>> slides that may be interesting to you.  Additionally, you may want to check
>> out https://github.com/danburkert/kudu-ts, it's a very young  (not
>> feature complete) metrics layer on top of Kudu, it may give you some ideas.
>>
>> - Dan
>>
>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>>> Thanks for sharing, Dan. The diagrams explained clearly how the current
>>> system works.
>>>
>>> As for things in my mind. Take the schema of <host,metric,time,...>,
>>> say, I am interested in data for the past 5 mins, 10 mins, etc. Or,
>>> aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like I
>>> need to introduce a special 5-min bar column, use that column to do range
>>> partition to spread data across the tablet servers so that I could leverage
>>> parallel filtering.
>>>
>>> The cost of this extra column (INT8) is not ideal but not too bad either
>>> (storage cost wise, compression should do wonders). So I am thinking
>>> whether it would be better to take "functions" as row split instead of only
>>> constants. Of course if business requires to drop down to 1-min bar, the
>>> data has to be re-sharded again. So a more cost effective way of doing this
>>> on a production cluster would be good.
>>>
>>>
>>>
>>>
>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <da...@cloudera.com> wrote:
>>>
>>>> Hi Sand,
>>>>
>>>> I've been working on some diagrams to help explain some of the more
>>>> advanced partitioning types, it's attached.   Still pretty rough at this
>>>> point, but the goal is to clean it up and move it into the Kudu
>>>> documentation proper.  I'm interested to hear what kind of time series you
>>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>>> series, you can follow progress here
>>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>>>> additional ideas I'd love to hear them.  You may also be interested in a
>>>> small project that a JD and I have been working on in the past week to
>>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited
>>>> at this point.
>>>>
>>>> - Dan
>>>>
>>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks. Will read.
>>>>>
>>>>> Given that I am researching time series data, row locality is crucial
>>>>> :-)
>>>>>
>>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <
>>>>> jdcryans@apache.org> wrote:
>>>>>
>>>>>> We do have non-covering range partitions coming in the next few
>>>>>> months, here's the design (in review):
>>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>>
>>>>>> The "Background & Motivation" section should give you a good idea of
>>>>>> why I'm mentioning this.
>>>>>>
>>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>>> could be good enough.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Makes sense.
>>>>>>>
>>>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>>>>> range buckets.
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>
>>>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>>>>
>>>>>>>> That's also how HBase works, but it will split regions as you
>>>>>>>> insert data and eventually you'll get some data distribution even if it
>>>>>>>> doesn't start in an ideal situation. Tablet splitting will come later for
>>>>>>>> Kudu.
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>>>> specify the split rows?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.stone@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>>
>>>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Sand,
>>>>>>>>>>>
>>>>>>>>>>> Please have a look at
>>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Misty
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks much.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Sand Stone <sa...@gmail.com>.

> Is the requirement to pre-aggregate by time window?
No, I am thinking to create a column say, "minute". It's basically the
minute field of the timestamp column(even round to 5-min bucket depending
on the needs). So it's a computed column being filled in on data ingestion.
My goal is that this field would help with data filtering at read/query
time, say select certain projection at minute 10-15, to speed up the read
queries.

Thanks for the info., I will follow them.

On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <da...@cloudera.com> wrote:

> Hey Sand,
>
> Sorry for the delayed response.  I'm not quite following your use case.
> Is the requirement to pre-aggregate by time window? I don't think Kudu can
> help you directly with that (nothing built in), but you could always create
> a separate table to store the pre-aggregated values.  As far as applying
> functions to do row splits, that is an interesting idea, but I think once
> Kudu has support for range bounds (the non-covering range partition design
> doc linked above), you can simply create the bounds where the function
> would have put them.  For example, if you want a partition for every five
> minutes, you can create the bounds accordingly.
>
> Earlier this week I gave a talk on timeseries in Kudu, I've included some
> slides that may be interesting to you.  Additionally, you may want to check
> out https://github.com/danburkert/kudu-ts, it's a very young  (not
> feature complete) metrics layer on top of Kudu, it may give you some ideas.
>
> - Dan
>
> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sa...@gmail.com> wrote:
>
>> Thanks for sharing, Dan. The diagrams explained clearly how the current
>> system works.
>>
>> As for things in my mind. Take the schema of <host,metric,time,...>, say,
>> I am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at
>> 5 mins interval for the past 3 days, 7 days, ... Looks like I need to
>> introduce a special 5-min bar column, use that column to do range partition
>> to spread data across the tablet servers so that I could leverage parallel
>> filtering.
>>
>> The cost of this extra column (INT8) is not ideal but not too bad either
>> (storage cost wise, compression should do wonders). So I am thinking
>> whether it would be better to take "functions" as row split instead of only
>> constants. Of course if business requires to drop down to 1-min bar, the
>> data has to be re-sharded again. So a more cost effective way of doing this
>> on a production cluster would be good.
>>
>>
>>
>>
>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <da...@cloudera.com> wrote:
>>
>>> Hi Sand,
>>>
>>> I've been working on some diagrams to help explain some of the more
>>> advanced partitioning types, it's attached.   Still pretty rough at this
>>> point, but the goal is to clean it up and move it into the Kudu
>>> documentation proper.  I'm interested to hear what kind of time series you
>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>> series, you can follow progress here
>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>>> additional ideas I'd love to hear them.  You may also be interested in a
>>> small project that a JD and I have been working on in the past week to
>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited
>>> at this point.
>>>
>>> - Dan
>>>
>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com>
>>> wrote:
>>>
>>>> Thanks. Will read.
>>>>
>>>> Given that I am researching time series data, row locality is crucial
>>>> :-)
>>>>
>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jdcryans@apache.org
>>>> > wrote:
>>>>
>>>>> We do have non-covering range partitions coming in the next few
>>>>> months, here's the design (in review):
>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>
>>>>> The "Background & Motivation" section should give you a good idea of
>>>>> why I'm mentioning this.
>>>>>
>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>> could be good enough.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Makes sense.
>>>>>>
>>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>>>> range buckets.
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>> jdcryans@apache.org> wrote:
>>>>>>
>>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>>>
>>>>>>> That's also how HBase works, but it will split regions as you insert
>>>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>>> specify the split rows?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>
>>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Sand,
>>>>>>>>>>
>>>>>>>>>> Please have a look at
>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Misty
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>>
>>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>>
>>>>>>>>>>> Thanks much.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Dan Burkert <da...@cloudera.com>.

Hey Sand,

Sorry for the delayed response.  I'm not quite following your use case.  Is
the requirement to pre-aggregate by time window? I don't think Kudu can
help you directly with that (nothing built in), but you could always create
a separate table to store the pre-aggregated values.  As far as applying
functions to do row splits, that is an interesting idea, but I think once
Kudu has support for range bounds (the non-covering range partition design
doc linked above), you can simply create the bounds where the function
would have put them.  For example, if you want a partition for every five
minutes, you can create the bounds accordingly.

Earlier this week I gave a talk on timeseries in Kudu, I've included some
slides that may be interesting to you.  Additionally, you may want to check
out https://github.com/danburkert/kudu-ts, it's a very young  (not feature
complete) metrics layer on top of Kudu, it may give you some ideas.

- Dan

On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sa...@gmail.com> wrote:

> Thanks for sharing, Dan. The diagrams explained clearly how the current
> system works.
>
> As for things in my mind. Take the schema of <host,metric,time,...>, say,
> I am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at
> 5 mins interval for the past 3 days, 7 days, ... Looks like I need to
> introduce a special 5-min bar column, use that column to do range partition
> to spread data across the tablet servers so that I could leverage parallel
> filtering.
>
> The cost of this extra column (INT8) is not ideal but not too bad either
> (storage cost wise, compression should do wonders). So I am thinking
> whether it would be better to take "functions" as row split instead of only
> constants. Of course if business requires to drop down to 1-min bar, the
> data has to be re-sharded again. So a more cost effective way of doing this
> on a production cluster would be good.
>
>
>
>
> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <da...@cloudera.com> wrote:
>
>> Hi Sand,
>>
>> I've been working on some diagrams to help explain some of the more
>> advanced partitioning types, it's attached.   Still pretty rough at this
>> point, but the goal is to clean it up and move it into the Kudu
>> documentation proper.  I'm interested to hear what kind of time series you
>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>> series, you can follow progress here
>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>> additional ideas I'd love to hear them.  You may also be interested in a
>> small project that a JD and I have been working on in the past week to
>> build an OpenTSDB style store on top of Kudu, you can find it here
>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited at
>> this point.
>>
>> - Dan
>>
>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>>> Thanks. Will read.
>>>
>>> Given that I am researching time series data, row locality is crucial
>>> :-)
>>>
>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jd...@apache.org>
>>> wrote:
>>>
>>>> We do have non-covering range partitions coming in the next few months,
>>>> here's the design (in review):
>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>
>>>> The "Background & Motivation" section should give you a good idea of
>>>> why I'm mentioning this.
>>>>
>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>> could be good enough.
>>>>
>>>> J-D
>>>>
>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Makes sense.
>>>>>
>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>>> range buckets.
>>>>>
>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>> jdcryans@apache.org> wrote:
>>>>>
>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>>
>>>>>> That's also how HBase works, but it will split regions as you insert
>>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>> specify the split rows?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>
>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Sand,
>>>>>>>>>
>>>>>>>>> Please have a look at
>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Misty
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sand.m.stone@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>
>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>
>>>>>>>>>> Thanks much.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Sand Stone <sa...@gmail.com>.

Thanks for sharing, Dan. The diagrams explained clearly how the current
system works.

As for things in my mind. Take the schema of <host,metric,time,...>, say, I
am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at 5
mins interval for the past 3 days, 7 days, ... Looks like I need to
introduce a special 5-min bar column, use that column to do range partition
to spread data across the tablet servers so that I could leverage parallel
filtering.

The cost of this extra column (INT8) is not ideal but not too bad either
(storage cost wise, compression should do wonders). So I am thinking
whether it would be better to take "functions" as row split instead of only
constants. Of course if business requires to drop down to 1-min bar, the
data has to be re-sharded again. So a more cost effective way of doing this
on a production cluster would be good.




On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <da...@cloudera.com> wrote:

> Hi Sand,
>
> I've been working on some diagrams to help explain some of the more
> advanced partitioning types, it's attached.   Still pretty rough at this
> point, but the goal is to clean it up and move it into the Kudu
> documentation proper.  I'm interested to hear what kind of time series you
> are interested in Kudu for.  I'm tasked with improving Kudu for time
> series, you can follow progress here
> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
> additional ideas I'd love to hear them.  You may also be interested in a
> small project that a JD and I have been working on in the past week to
> build an OpenTSDB style store on top of Kudu, you can find it here
> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited at
> this point.
>
> - Dan
>
> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com> wrote:
>
>> Thanks. Will read.
>>
>> Given that I am researching time series data, row locality is crucial :-)
>>
>>
>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jd...@apache.org>
>> wrote:
>>
>>> We do have non-covering range partitions coming in the next few months,
>>> here's the design (in review):
>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>
>>> The "Background & Motivation" section should give you a good idea of why
>>> I'm mentioning this.
>>>
>>> Meanwhile, if you don't need row locality, using hash partitioning could
>>> be good enough.
>>>
>>> J-D
>>>
>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>>> wrote:
>>>
>>>> Makes sense.
>>>>
>>>> Yeah it would be cool if users could specify/control the split rows
>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>> range buckets.
>>>>
>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jdcryans@apache.org
>>>> > wrote:
>>>>
>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>
>>>>> That's also how HBase works, but it will split regions as you insert
>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> One more questions, how does the range partition work if I don't
>>>>>> specify the split rows?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>
>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>> rows params in the addSplitRow API.
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>
>>>>>>>> Hi Sand,
>>>>>>>>
>>>>>>>> Please have a look at
>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>> and see if it is helpful to you.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Misty
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>
>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>
>>>>>>>>> Thanks much.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Fwd: Partition and Split rows

Posted by Casey Ching <ca...@cloudera.com>.

I’ll talk with Dan, I’m guessing he’s referring to some Kudu timestamp features, not sure. I am sure that Impala doesn’t yet support Kudu’s timestamps. I just filed an issue for adding this.


On May 17, 2016 at 11:47:30 AM, Amit Adhau (amit.adhau@globant.com) wrote:

Hi Casey,

As per Dan's reply as per below it is supported in kudu 0.8, I'm bit confused now, can you please confirm, so that we can use use either int64 in impala if not supported or use timestamp(but then it is giving error in 0.8).

Q: Are you going to provide the Kimpala merge or kudu timestamp support in Impala/Tableu in near future.


TIMESTAMP typed columns should work in Impala/Kudu as of the 0.8 release (the latest), however I know there were a few bugs in the previous releases.  If your using the latest and are still seeing errors, please send the error and/or file an issue on JIRA.

Thanks,
Amit

On May 18, 2016 12:03 AM, "Casey Ching" <ca...@cloudera.com> wrote:
Hi Amit,

Impala doesn’t yet support Kudu’s timestamps. I think the best solution we have for now is to use Unix time style timestamp values. Impala has functions to convert to/from ints/timestamps (example CAST can be used).

Casey

On May 17, 2016 at 12:26:18 AM, Amit Adhau (amit.adhau@globant.com) wrote:

Thanks Dan,

In our CDH 5.7 cluster, we are using Kudu parcel "0.8.0-1.kudu0.8.0.p0.11" and Impala_Kudu parcel "2.6.0-1.cdh5.8.0.p0.17".
we have a table created in kudu using java code as per below;


CREATE EXTERNAL TABLE `tablename` (
`long_value` BIGINT,
`timestamp_value` TIMESTAMP,
`string_value` STRING,

`addrmetric` STRING
)
TBLPROPERTIES(
  'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
  'kudu.table_name' = 'tablename',
  'kudu.master_addresses' = 'cluster:7051',
  'kudu.key_columns' = 'long_value, timestamp_value'
);
But, when we try to create the same table in Impala, we are getting below error;

May 17, 4:00:51.477 AM	INFO	jni-util.cc:177		

com.cloudera.impala.common.ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu
        at com.cloudera.impala.util.KuduUtil.fromImpalaType(KuduUtil.java:251)
        at com.cloudera.impala.util.KuduUtil.compareSchema(KuduUtil.java:67)
        at com.cloudera.impala.catalog.delegates.KuduDdlDelegate.createTable(KuduDdlDelegate.java:87)
        at com.cloudera.impala.service.CatalogOpExecutor.createTable(CatalogOpExecutor.java:1516)
        at com.cloudera.impala.service.CatalogOpExecutor.createTable(CatalogOpExecutor.java:1365)
        at com.cloudera.impala.service.CatalogOpExecutor.execDdlRequest(CatalogOpExecutor.java:249)
        at com.cloudera.impala.service.JniCatalog.execDdl(JniCatalog.java:131)



May 17, 4:00:51.479 AM	INFO	status.cc:112		

ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu
May 17, 4:00:51.479 AM	ERROR	catalog-server.cc:64		

ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu

Can you please help us on the same.

Thanks,

Amit
---------- Forwarded message ----------
From: Dan Burkert <da...@cloudera.com>
Date: Tue, May 17, 2016 at 2:02 AM
Subject: Re: Partition and Split rows
To: user@kudu.incubator.apache.org


Hi Amit, responses inline
 
Q: Can I fetch the kudu Timestamp data from Tableu/Pentaho for reporting and analytics purpose or I need Int64 datatype only.

Is Tableu/Pentaho using Impala to query?  If so, see the answers below.
 
Q: Are you going to provide the Kimpala merge or kudu timestamp support in Impala/Tableu in near future.
 
TIMESTAMP typed columns should work in Impala/Kudu as of the 0.8 release (the latest), however I know there were a few bugs in the previous releases.  If your using the latest and are still seeing errors, please send the error and/or file an issue on JIRA.
 
Q: At this moment, instead of Timestamp we are using Int64 type in kudu, will it be equally helpful for performance, if we use the partition and split by explanation given for timestamp datatype? e.g. can we get a better performance for a composite key on a given kudu table defined as 'metric(s),Int64(has timestamp data) and with a partition on Int64 column with Split by clause?

Internally, the TIMESTAMP type is just an alias to INT64, so it has the exact same performance characteristics.  The only difference is how the values are displayed in log messages and on the web UI.

- Dan
 
On Sat, May 7, 2016 at 9:20 PM, Dan Burkert <da...@cloudera.com> wrote:
Hi Sand,

I've been working on some diagrams to help explain some of the more advanced partitioning types, it's attached.   Still pretty rough at this point, but the goal is to clean it up and move it into the Kudu documentation proper.  I'm interested to hear what kind of time series you are interested in Kudu for.  I'm tasked with improving Kudu for time series, you can follow progress here. If you have any additional ideas I'd love to hear them.  You may also be interested in a small project that a JD and I have been working on in the past week to build an OpenTSDB style store on top of Kudu, you can find it here.  Still quite feature limited at this point.

- Dan

On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com> wrote:
Thanks. Will read. 

Given that I am researching time series data, row locality is crucial :-)  

On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
We do have non-covering range partitions coming in the next few months, here's the design (in review): http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md

The "Background & Motivation" section should give you a good idea of why I'm mentioning this.

Meanwhile, if you don't need row locality, using hash partitioning could be good enough.

J-D

On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com> wrote:
Makes sense. 

Yeah it would be cool if users could specify/control the split rows after the table is created. Now, I have to "think ahead" to pre-create the range buckets. 

On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
You will only get 1 tablet and no data distribution, which is bad.

That's also how HBase works, but it will split regions as you insert data and eventually you'll get some data distribution even if it doesn't start in an ideal situation. Tablet splitting will come later for Kudu.

J-D

On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com> wrote:
One more questions, how does the range partition work if I don't specify the split rows? 

Thanks! 

On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com> wrote:
Thanks, Misty. The "advanced" impala example helped. 

I was just reading the Java API,CreateTableOptions.java, it's unclear how the range partition column names associated with the partial rows params in the addSplitRow
API.

On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <ms...@cloudera.com> wrote:
Hi Sand,

Please have a look at http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables and see if it is helpful to you.

Thanks,
Misty

On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com> wrote:
Hi, I am new to Kudu. I wonder how the split rows work. I know from some docs, this is currently for pre-creation the table. I am researching how to partition (hash+range) some time series test data. 

Is there an example? or notes somewhere I could read upon. 

Thanks much. 











--
Thanks & Regards,
Amit Adhau | Data Architect

GLOBANT | IND:+91 9821518132














The information contained in this e-mail may be confidential. It has been sent for the sole use of the intended recipient(s). If the reader of this message is not an intended recipient, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution or copying of this communication, or any of its contents, is strictly prohibited. If you have received it by mistake please let us know by e-mail immediately and delete it from your system. Many thanks.
 
La información contenida en este mensaje puede ser confidencial. Ha sido enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de este mensaje no fuera el destinatario previsto, por el presente queda Ud. notificado que cualquier lectura, uso, publicación, diseminación, distribución o copiado de esta comunicación o su contenido está estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje por error le agradeceremos notificarnos por e-mail inmediatamente y eliminarlo de su sistema. Muchas gracias.





--
Thanks & Regards,
Amit Adhau | Data Architect

GLOBANT | IND:+91 9821518132














The information contained in this e-mail may be confidential. It has been sent for the sole use of the intended recipient(s). If the reader of this message is not an intended recipient, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution or copying of this communication, or any of its contents, is strictly prohibited. If you have received it by mistake please let us know by e-mail immediately and delete it from your system. Many thanks.
 
La información contenida en este mensaje puede ser confidencial. Ha sido enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de este mensaje no fuera el destinatario previsto, por el presente queda Ud. notificado que cualquier lectura, uso, publicación, diseminación, distribución o copiado de esta comunicación o su contenido está estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje por error le agradeceremos notificarnos por e-mail inmediatamente y eliminarlo de su sistema. Muchas gracias.


The information contained in this e-mail may be confidential. It has been sent for the sole use of the intended recipient(s). If the reader of this message is not an intended recipient, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution or copying of this communication, or any of its contents, is strictly prohibited. If you have received it by mistake please let us know by e-mail immediately and delete it from your system. Many thanks.
 
La información contenida en este mensaje puede ser confidencial. Ha sido enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de este mensaje no fuera el destinatario previsto, por el presente queda Ud. notificado que cualquier lectura, uso, publicación, diseminación, distribución o copiado de esta comunicación o su contenido está estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje por error le agradeceremos notificarnos por e-mail inmediatamente y eliminarlo de su sistema. Muchas gracias.

Re: Fwd: Partition and Split rows

Posted by Amit Adhau <am...@globant.com>.

Hi Casey,

As per Dan's reply as per below it is supported in kudu 0.8, I'm bit
confused now, can you please confirm, so that we can use use either int64
in impala if not supported or use timestamp(but then it is giving error in
0.8).

Q: Are you going to provide the Kimpala merge or kudu timestamp support in
Impala/Tableu in near future.


TIMESTAMP typed columns should work in Impala/Kudu as of the 0.8 release
(the latest), however I know there were a few bugs in the previous
releases.  If your using the latest and are still seeing errors, please
send the error and/or file an issue on JIRA.

Thanks,
Amit
On May 18, 2016 12:03 AM, "Casey Ching" <ca...@cloudera.com> wrote:

> Hi Amit,
>
> Impala doesn’t yet support Kudu’s timestamps. I think the best solution we
> have for now is to use Unix time style timestamp values. Impala has
> functions to convert to/from ints/timestamps (example CAST can be used).
>
> Casey
>
> On May 17, 2016 at 12:26:18 AM, Amit Adhau (amit.adhau@globant.com) wrote:
>
> Thanks Dan,
>
> In our CDH 5.7 cluster, we are using Kudu parcel "0.8.0-1.kudu0.8.0.p0.11" and
> Impala_Kudu parcel "2.6.0-1.cdh5.8.0.p0.17".
> we have a table created in kudu using java code as per below;
>
>
> CREATE EXTERNAL TABLE `tablename` (
> `long_value` BIGINT,
> `timestamp_value` TIMESTAMP,
> `string_value` STRING,
>
>
> `addrmetric` STRING
> )
> TBLPROPERTIES(
>   'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
>   'kudu.table_name' = 'tablename',
>   'kudu.master_addresses' = 'cluster:7051',
>   'kudu.key_columns' = 'long_value, timestamp_value'
> );
>
> *But, when we try to create the same table in Impala, we are getting below
> error;*
>
> May 17, 4:00:51.477 AM INFO jni-util.cc:177
>
>
> com.cloudera.impala.common.ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu
>         at com.cloudera.impala.util.KuduUtil.fromImpalaType(KuduUtil.java:251)
>         at com.cloudera.impala.util.KuduUtil.compareSchema(KuduUtil.java:67)
>         at com.cloudera.impala.catalog.delegates.KuduDdlDelegate.createTable(KuduDdlDelegate.java:87)
>         at com.cloudera.impala.service.CatalogOpExecutor.createTable(CatalogOpExecutor.java:1516)
>         at com.cloudera.impala.service.CatalogOpExecutor.createTable(CatalogOpExecutor.java:1365)
>         at com.cloudera.impala.service.CatalogOpExecutor.execDdlRequest(CatalogOpExecutor.java:249)
>         at com.cloudera.impala.service.JniCatalog.execDdl(JniCatalog.java:131)
>
>
>  May 17, 4:00:51.479 AM INFO status.cc:112
>
>
> ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu
>
> May 17, 4:00:51.479 AM ERROR catalog-server.cc:64
>
>
> ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu
>
>
> Can you please help us on the same.
>
> Thanks,
>
> Amit
> ---------- Forwarded message ----------
> From: Dan Burkert <da...@cloudera.com>
> Date: Tue, May 17, 2016 at 2:02 AM
> Subject: Re: Partition and Split rows
> To: user@kudu.incubator.apache.org
>
>
> Hi Amit, responses inline
>
>
>> Q: Can I fetch the kudu Timestamp data from Tableu/Pentaho for reporting
>> and analytics purpose or I need Int64 datatype only.
>>
>
> Is Tableu/Pentaho using Impala to query?  If so, see the answers below.
>
>
>> Q: Are you going to provide the Kimpala merge or kudu timestamp support
>> in Impala/Tableu in near future.
>>
>
> TIMESTAMP typed columns should work in Impala/Kudu as of the 0.8 release
> (the latest), however I know there were a few bugs in the previous
> releases.  If your using the latest and are still seeing errors, please
> send the error and/or file an issue on JIRA.
>
>
>> Q: At this moment, instead of Timestamp we are using Int64 type in kudu,
>> will it be equally helpful for performance, if we use the partition and
>> split by explanation given for timestamp datatype? e.g. can we get a better
>> performance for a composite key on a given kudu table defined as
>> 'metric(s),Int64(has timestamp data) and with a partition on Int64 column
>> with Split by clause?
>>
>
> Internally, the TIMESTAMP type is just an alias to INT64, so it has the
> exact same performance characteristics.  The only difference is how the
> values are displayed in log messages and on the web UI.
>
> - Dan
>
>
>> On Sat, May 7, 2016 at 9:20 PM, Dan Burkert <da...@cloudera.com> wrote:
>>
>>> Hi Sand,
>>>
>>> I've been working on some diagrams to help explain some of the more
>>> advanced partitioning types, it's attached.   Still pretty rough at this
>>> point, but the goal is to clean it up and move it into the Kudu
>>> documentation proper.  I'm interested to hear what kind of time series you
>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>> series, you can follow progress here
>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>>> additional ideas I'd love to hear them.  You may also be interested in a
>>> small project that a JD and I have been working on in the past week to
>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited
>>> at this point.
>>>
>>> - Dan
>>>
>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com>
>>> wrote:
>>>
>>>> Thanks. Will read.
>>>>
>>>> Given that I am researching time series data, row locality is crucial
>>>> :-)
>>>>
>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jdcryans@apache.org
>>>> > wrote:
>>>>
>>>>> We do have non-covering range partitions coming in the next few
>>>>> months, here's the design (in review):
>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>
>>>>> The "Background & Motivation" section should give you a good idea of
>>>>> why I'm mentioning this.
>>>>>
>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>> could be good enough.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Makes sense.
>>>>>>
>>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>>>> range buckets.
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>> jdcryans@apache.org> wrote:
>>>>>>
>>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>>>
>>>>>>> That's also how HBase works, but it will split regions as you insert
>>>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>>> specify the split rows?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>
>>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Sand,
>>>>>>>>>>
>>>>>>>>>> Please have a look at
>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Misty
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>>
>>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>>
>>>>>>>>>>> Thanks much.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> *Amit Adhau* | Data Architect
>>
>> * GLOBANT* | IND:+91 9821518132
>>
>> [image: Facebook] <https://www.facebook.com/Globant>
>>
>> [image: Twitter] <http://www.twitter.com/globant>
>>
>> [image: Youtube] <http://www.youtube.com/Globant>
>>
>> [image: Linkedin] <http://www.linkedin.com/company/globant>
>>
>> [image: Pinterest] <http://pinterest.com/globant/>
>>
>> [image: Globant] <http://www.globant.com/>
>>
>> The information contained in this e-mail may be confidential. It has been
>> sent for the sole use of the intended recipient(s). If the reader of this
>> message is not an intended recipient, you are hereby notified that any
>> unauthorized review, use, disclosure, dissemination, distribution or
>> copying of this communication, or any of its contents,
>> is strictly prohibited. If you have received it by mistake please let us
>> know by e-mail immediately and delete it from your system. Many thanks.
>>
>>
>>
>> La información contenida en este mensaje puede ser confidencial. Ha sido
>> enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de
>> este mensaje no fuera el destinatario previsto, por el presente queda Ud.
>> notificado que cualquier lectura, uso, publicación, diseminación,
>> distribución o copiado de esta comunicación o su contenido está
>> estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje
>> por error le agradeceremos notificarnos por e-mail inmediatamente y
>> eliminarlo de su sistema. Muchas gracias.
>>
>>
>
>
>
> --
> Thanks & Regards,
>
> *Amit Adhau* | Data Architect
>
> * GLOBANT* | IND:+91 9821518132
>
> [image: Facebook] <https://www.facebook.com/Globant>
>
> [image: Twitter] <http://www.twitter.com/globant>
>
> [image: Youtube] <http://www.youtube.com/Globant>
>
> [image: Linkedin] <http://www.linkedin.com/company/globant>
>
> [image: Pinterest] <http://pinterest.com/globant/>
>
> [image: Globant] <http://www.globant.com/>
>
> The information contained in this e-mail may be confidential. It has been
> sent for the sole use of the intended recipient(s). If the reader of this
> message is not an intended recipient, you are hereby notified that any
> unauthorized review, use, disclosure, dissemination, distribution or
> copying of this communication, or any of its contents,
> is strictly prohibited. If you have received it by mistake please let us
> know by e-mail immediately and delete it from your system. Many thanks.
>
>
>
> La información contenida en este mensaje puede ser confidencial. Ha sido
> enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de
> este mensaje no fuera el destinatario previsto, por el presente queda Ud.
> notificado que cualquier lectura, uso, publicación, diseminación,
> distribución o copiado de esta comunicación o su contenido está
> estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje
> por error le agradeceremos notificarnos por e-mail inmediatamente y
> eliminarlo de su sistema. Muchas gracias.
>
>

-- 


The information contained in this e-mail may be confidential. It has been 
sent for the sole use of the intended recipient(s). If the reader of this 
message is not an intended recipient, you are hereby notified that any 
unauthorized review, use, disclosure, dissemination, distribution or 
copying of this communication, or any of its contents, 
is strictly prohibited. If you have received it by mistake please let us 
know by e-mail immediately and delete it from your system. Many thanks.

 

La información contenida en este mensaje puede ser confidencial. Ha sido 
enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de 
este mensaje no fuera el destinatario previsto, por el presente queda Ud. 
notificado que cualquier lectura, uso, publicación, diseminación, 
distribución o copiado de esta comunicación o su contenido está 
estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje 
por error le agradeceremos notificarnos por e-mail inmediatamente y 
eliminarlo de su sistema. Muchas gracias.

Re: Fwd: Partition and Split rows

Posted by Casey Ching <ca...@cloudera.com>.

Hi Amit,

Impala doesn’t yet support Kudu’s timestamps. I think the best solution we have for now is to use Unix time style timestamp values. Impala has functions to convert to/from ints/timestamps (example CAST can be used).

Casey

On May 17, 2016 at 12:26:18 AM, Amit Adhau (amit.adhau@globant.com) wrote:

Thanks Dan,

In our CDH 5.7 cluster, we are using Kudu parcel "0.8.0-1.kudu0.8.0.p0.11" and Impala_Kudu parcel "2.6.0-1.cdh5.8.0.p0.17".
we have a table created in kudu using java code as per below;


CREATE EXTERNAL TABLE `tablename` (
`long_value` BIGINT,
`timestamp_value` TIMESTAMP,
`string_value` STRING,

`addrmetric` STRING
)
TBLPROPERTIES(
  'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
  'kudu.table_name' = 'tablename',
  'kudu.master_addresses' = 'cluster:7051',
  'kudu.key_columns' = 'long_value, timestamp_value'
);
But, when we try to create the same table in Impala, we are getting below error;

May 17, 4:00:51.477 AM	INFO	jni-util.cc:177		

com.cloudera.impala.common.ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu
        at com.cloudera.impala.util.KuduUtil.fromImpalaType(KuduUtil.java:251)
        at com.cloudera.impala.util.KuduUtil.compareSchema(KuduUtil.java:67)
        at com.cloudera.impala.catalog.delegates.KuduDdlDelegate.createTable(KuduDdlDelegate.java:87)
        at com.cloudera.impala.service.CatalogOpExecutor.createTable(CatalogOpExecutor.java:1516)
        at com.cloudera.impala.service.CatalogOpExecutor.createTable(CatalogOpExecutor.java:1365)
        at com.cloudera.impala.service.CatalogOpExecutor.execDdlRequest(CatalogOpExecutor.java:249)
        at com.cloudera.impala.service.JniCatalog.execDdl(JniCatalog.java:131)



May 17, 4:00:51.479 AM	INFO	status.cc:112		

ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu
May 17, 4:00:51.479 AM	ERROR	catalog-server.cc:64		

ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu

Can you please help us on the same.

Thanks,

Amit
---------- Forwarded message ----------
From: Dan Burkert <da...@cloudera.com>
Date: Tue, May 17, 2016 at 2:02 AM
Subject: Re: Partition and Split rows
To: user@kudu.incubator.apache.org


Hi Amit, responses inline
 
Q: Can I fetch the kudu Timestamp data from Tableu/Pentaho for reporting and analytics purpose or I need Int64 datatype only.

Is Tableu/Pentaho using Impala to query?  If so, see the answers below.
 
Q: Are you going to provide the Kimpala merge or kudu timestamp support in Impala/Tableu in near future.
 
TIMESTAMP typed columns should work in Impala/Kudu as of the 0.8 release (the latest), however I know there were a few bugs in the previous releases.  If your using the latest and are still seeing errors, please send the error and/or file an issue on JIRA.
 
Q: At this moment, instead of Timestamp we are using Int64 type in kudu, will it be equally helpful for performance, if we use the partition and split by explanation given for timestamp datatype? e.g. can we get a better performance for a composite key on a given kudu table defined as 'metric(s),Int64(has timestamp data) and with a partition on Int64 column with Split by clause?

Internally, the TIMESTAMP type is just an alias to INT64, so it has the exact same performance characteristics.  The only difference is how the values are displayed in log messages and on the web UI.

- Dan
 
On Sat, May 7, 2016 at 9:20 PM, Dan Burkert <da...@cloudera.com> wrote:
Hi Sand,

I've been working on some diagrams to help explain some of the more advanced partitioning types, it's attached.   Still pretty rough at this point, but the goal is to clean it up and move it into the Kudu documentation proper.  I'm interested to hear what kind of time series you are interested in Kudu for.  I'm tasked with improving Kudu for time series, you can follow progress here. If you have any additional ideas I'd love to hear them.  You may also be interested in a small project that a JD and I have been working on in the past week to build an OpenTSDB style store on top of Kudu, you can find it here.  Still quite feature limited at this point.

- Dan

On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com> wrote:
Thanks. Will read. 

Given that I am researching time series data, row locality is crucial :-)  

On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
We do have non-covering range partitions coming in the next few months, here's the design (in review): http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md

The "Background & Motivation" section should give you a good idea of why I'm mentioning this.

Meanwhile, if you don't need row locality, using hash partitioning could be good enough.

J-D

On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com> wrote:
Makes sense. 

Yeah it would be cool if users could specify/control the split rows after the table is created. Now, I have to "think ahead" to pre-create the range buckets. 

On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
You will only get 1 tablet and no data distribution, which is bad.

That's also how HBase works, but it will split regions as you insert data and eventually you'll get some data distribution even if it doesn't start in an ideal situation. Tablet splitting will come later for Kudu.

J-D

On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com> wrote:
One more questions, how does the range partition work if I don't specify the split rows? 

Thanks! 

On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com> wrote:
Thanks, Misty. The "advanced" impala example helped. 

I was just reading the Java API,CreateTableOptions.java, it's unclear how the range partition column names associated with the partial rows params in the addSplitRow
API.

On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <ms...@cloudera.com> wrote:
Hi Sand,

Please have a look at http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables and see if it is helpful to you.

Thanks,
Misty

On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com> wrote:
Hi, I am new to Kudu. I wonder how the split rows work. I know from some docs, this is currently for pre-creation the table. I am researching how to partition (hash+range) some time series test data. 

Is there an example? or notes somewhere I could read upon. 

Thanks much. 











--
Thanks & Regards,
Amit Adhau | Data Architect

GLOBANT | IND:+91 9821518132














The information contained in this e-mail may be confidential. It has been sent for the sole use of the intended recipient(s). If the reader of this message is not an intended recipient, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution or copying of this communication, or any of its contents, is strictly prohibited. If you have received it by mistake please let us know by e-mail immediately and delete it from your system. Many thanks.
 
La información contenida en este mensaje puede ser confidencial. Ha sido enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de este mensaje no fuera el destinatario previsto, por el presente queda Ud. notificado que cualquier lectura, uso, publicación, diseminación, distribución o copiado de esta comunicación o su contenido está estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje por error le agradeceremos notificarnos por e-mail inmediatamente y eliminarlo de su sistema. Muchas gracias.





--
Thanks & Regards,
Amit Adhau | Data Architect

GLOBANT | IND:+91 9821518132














The information contained in this e-mail may be confidential. It has been sent for the sole use of the intended recipient(s). If the reader of this message is not an intended recipient, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution or copying of this communication, or any of its contents, is strictly prohibited. If you have received it by mistake please let us know by e-mail immediately and delete it from your system. Many thanks.
 
La información contenida en este mensaje puede ser confidencial. Ha sido enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de este mensaje no fuera el destinatario previsto, por el presente queda Ud. notificado que cualquier lectura, uso, publicación, diseminación, distribución o copiado de esta comunicación o su contenido está estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje por error le agradeceremos notificarnos por e-mail inmediatamente y eliminarlo de su sistema. Muchas gracias.

Fwd: Partition and Split rows

Posted by Amit Adhau <am...@globant.com>.

Thanks Dan,

In our CDH 5.7 cluster, we are using Kudu parcel "0.8.0-1.kudu0.8.0.p0.11" and
Impala_Kudu parcel "2.6.0-1.cdh5.8.0.p0.17".
we have a table created in kudu using java code as per below;

CREATE EXTERNAL TABLE `tablename` (
`long_value` BIGINT,
`timestamp_value` TIMESTAMP,
`string_value` STRING,

`addrmetric` STRING
)
TBLPROPERTIES(
  'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
  'kudu.table_name' = 'tablename',
  'kudu.master_addresses' = 'cluster:7051',
  'kudu.key_columns' = 'long_value, timestamp_value'
);

*But, when we try to create the same table in Impala, we are getting below
error;*

May 17, 4:00:51.477 AM INFO jni-util.cc:177

com.cloudera.impala.common.ImpalaRuntimeException: Type TIMESTAMP is
not supported in Kudu
	at com.cloudera.impala.util.KuduUtil.fromImpalaType(KuduUtil.java:251)
	at com.cloudera.impala.util.KuduUtil.compareSchema(KuduUtil.java:67)
	at com.cloudera.impala.catalog.delegates.KuduDdlDelegate.createTable(KuduDdlDelegate.java:87)
	at com.cloudera.impala.service.CatalogOpExecutor.createTable(CatalogOpExecutor.java:1516)
	at com.cloudera.impala.service.CatalogOpExecutor.createTable(CatalogOpExecutor.java:1365)
	at com.cloudera.impala.service.CatalogOpExecutor.execDdlRequest(CatalogOpExecutor.java:249)
	at com.cloudera.impala.service.JniCatalog.execDdl(JniCatalog.java:131)


May 17, 4:00:51.479 AM INFO status.cc:112

ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu

May 17, 4:00:51.479 AM ERROR catalog-server.cc:64

ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu


Can you please help us on the same.

Thanks,

Amit
---------- Forwarded message ----------
From: Dan Burkert <da...@cloudera.com>
Date: Tue, May 17, 2016 at 2:02 AM
Subject: Re: Partition and Split rows
To: user@kudu.incubator.apache.org


Hi Amit, responses inline


> Q: Can I fetch the kudu Timestamp data from Tableu/Pentaho for reporting
> and analytics purpose or I need Int64 datatype only.
>

Is Tableu/Pentaho using Impala to query?  If so, see the answers below.


> Q: Are you going to provide the Kimpala merge or kudu timestamp support in
> Impala/Tableu in near future.
>

TIMESTAMP typed columns should work in Impala/Kudu as of the 0.8 release
(the latest), however I know there were a few bugs in the previous
releases.  If your using the latest and are still seeing errors, please
send the error and/or file an issue on JIRA.


> Q: At this moment, instead of Timestamp we are using Int64 type in kudu,
> will it be equally helpful for performance, if we use the partition and
> split by explanation given for timestamp datatype? e.g. can we get a better
> performance for a composite key on a given kudu table defined as
> 'metric(s),Int64(has timestamp data) and with a partition on Int64 column
> with Split by clause?
>

Internally, the TIMESTAMP type is just an alias to INT64, so it has the
exact same performance characteristics.  The only difference is how the
values are displayed in log messages and on the web UI.

- Dan


> On Sat, May 7, 2016 at 9:20 PM, Dan Burkert <da...@cloudera.com> wrote:
>
>> Hi Sand,
>>
>> I've been working on some diagrams to help explain some of the more
>> advanced partitioning types, it's attached.   Still pretty rough at this
>> point, but the goal is to clean it up and move it into the Kudu
>> documentation proper.  I'm interested to hear what kind of time series you
>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>> series, you can follow progress here
>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>> additional ideas I'd love to hear them.  You may also be interested in a
>> small project that a JD and I have been working on in the past week to
>> build an OpenTSDB style store on top of Kudu, you can find it here
>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited at
>> this point.
>>
>> - Dan
>>
>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>>> Thanks. Will read.
>>>
>>> Given that I am researching time series data, row locality is crucial
>>> :-)
>>>
>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jd...@apache.org>
>>> wrote:
>>>
>>>> We do have non-covering range partitions coming in the next few months,
>>>> here's the design (in review):
>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>
>>>> The "Background & Motivation" section should give you a good idea of
>>>> why I'm mentioning this.
>>>>
>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>> could be good enough.
>>>>
>>>> J-D
>>>>
>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Makes sense.
>>>>>
>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>>> range buckets.
>>>>>
>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>> jdcryans@apache.org> wrote:
>>>>>
>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>>
>>>>>> That's also how HBase works, but it will split regions as you insert
>>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>> specify the split rows?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>
>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Sand,
>>>>>>>>>
>>>>>>>>> Please have a look at
>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Misty
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sand.m.stone@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>
>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>
>>>>>>>>>> Thanks much.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Thanks & Regards,
>
> *Amit Adhau* | Data Architect
>
> *GLOBANT* | IND:+91 9821518132
>
> [image: Facebook] <https://www.facebook.com/Globant>
>
> [image: Twitter] <http://www.twitter.com/globant>
>
> [image: Youtube] <http://www.youtube.com/Globant>
>
> [image: Linkedin] <http://www.linkedin.com/company/globant>
>
> [image: Pinterest] <http://pinterest.com/globant/>
>
> [image: Globant] <http://www.globant.com/>
>
> The information contained in this e-mail may be confidential. It has been
> sent for the sole use of the intended recipient(s). If the reader of this
> message is not an intended recipient, you are hereby notified that any
> unauthorized review, use, disclosure, dissemination, distribution or
> copying of this communication, or any of its contents,
> is strictly prohibited. If you have received it by mistake please let us
> know by e-mail immediately and delete it from your system. Many thanks.
>
>
>
> La información contenida en este mensaje puede ser confidencial. Ha sido
> enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de
> este mensaje no fuera el destinatario previsto, por el presente queda Ud.
> notificado que cualquier lectura, uso, publicación, diseminación,
> distribución o copiado de esta comunicación o su contenido está
> estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje
> por error le agradeceremos notificarnos por e-mail inmediatamente y
> eliminarlo de su sistema. Muchas gracias.
>
>



-- 
Thanks & Regards,

*Amit Adhau* | Data Architect

*GLOBANT* | IND:+91 9821518132

[image: Facebook] <https://www.facebook.com/Globant>

[image: Twitter] <http://www.twitter.com/globant>

[image: Youtube] <http://www.youtube.com/Globant>

[image: Linkedin] <http://www.linkedin.com/company/globant>

[image: Pinterest] <http://pinterest.com/globant/>

[image: Globant] <http://www.globant.com/>

-- 


The information contained in this e-mail may be confidential. It has been 
sent for the sole use of the intended recipient(s). If the reader of this 
message is not an intended recipient, you are hereby notified that any 
unauthorized review, use, disclosure, dissemination, distribution or 
copying of this communication, or any of its contents, 
is strictly prohibited. If you have received it by mistake please let us 
know by e-mail immediately and delete it from your system. Many thanks.

 

La información contenida en este mensaje puede ser confidencial. Ha sido 
enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de 
este mensaje no fuera el destinatario previsto, por el presente queda Ud. 
notificado que cualquier lectura, uso, publicación, diseminación, 
distribución o copiado de esta comunicación o su contenido está 
estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje 
por error le agradeceremos notificarnos por e-mail inmediatamente y 
eliminarlo de su sistema. Muchas gracias.

Re: Partition and Split rows

Posted by Dan Burkert <da...@cloudera.com>.

Hi Amit, responses inline


> Q: Can I fetch the kudu Timestamp data from Tableu/Pentaho for reporting
> and analytics purpose or I need Int64 datatype only.
>

Is Tableu/Pentaho using Impala to query?  If so, see the answers below.


> Q: Are you going to provide the Kimpala merge or kudu timestamp support in
> Impala/Tableu in near future.
>

TIMESTAMP typed columns should work in Impala/Kudu as of the 0.8 release
(the latest), however I know there were a few bugs in the previous
releases.  If your using the latest and are still seeing errors, please
send the error and/or file an issue on JIRA.


> Q: At this moment, instead of Timestamp we are using Int64 type in kudu,
> will it be equally helpful for performance, if we use the partition and
> split by explanation given for timestamp datatype? e.g. can we get a better
> performance for a composite key on a given kudu table defined as
> 'metric(s),Int64(has timestamp data) and with a partition on Int64 column
> with Split by clause?
>

Internally, the TIMESTAMP type is just an alias to INT64, so it has the
exact same performance characteristics.  The only difference is how the
values are displayed in log messages and on the web UI.

- Dan


> On Sat, May 7, 2016 at 9:20 PM, Dan Burkert <da...@cloudera.com> wrote:
>
>> Hi Sand,
>>
>> I've been working on some diagrams to help explain some of the more
>> advanced partitioning types, it's attached.   Still pretty rough at this
>> point, but the goal is to clean it up and move it into the Kudu
>> documentation proper.  I'm interested to hear what kind of time series you
>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>> series, you can follow progress here
>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>> additional ideas I'd love to hear them.  You may also be interested in a
>> small project that a JD and I have been working on in the past week to
>> build an OpenTSDB style store on top of Kudu, you can find it here
>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited at
>> this point.
>>
>> - Dan
>>
>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>>> Thanks. Will read.
>>>
>>> Given that I am researching time series data, row locality is crucial
>>> :-)
>>>
>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jd...@apache.org>
>>> wrote:
>>>
>>>> We do have non-covering range partitions coming in the next few months,
>>>> here's the design (in review):
>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>
>>>> The "Background & Motivation" section should give you a good idea of
>>>> why I'm mentioning this.
>>>>
>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>> could be good enough.
>>>>
>>>> J-D
>>>>
>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Makes sense.
>>>>>
>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>>> range buckets.
>>>>>
>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>> jdcryans@apache.org> wrote:
>>>>>
>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>>
>>>>>> That's also how HBase works, but it will split regions as you insert
>>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>> specify the split rows?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>
>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Sand,
>>>>>>>>>
>>>>>>>>> Please have a look at
>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Misty
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sand.m.stone@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>>
>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>
>>>>>>>>>> Thanks much.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Thanks & Regards,
>
> *Amit Adhau* | Data Architect
>
> *GLOBANT* | IND:+91 9821518132
>
> [image: Facebook] <https://www.facebook.com/Globant>
>
> [image: Twitter] <http://www.twitter.com/globant>
>
> [image: Youtube] <http://www.youtube.com/Globant>
>
> [image: Linkedin] <http://www.linkedin.com/company/globant>
>
> [image: Pinterest] <http://pinterest.com/globant/>
>
> [image: Globant] <http://www.globant.com/>
>
> The information contained in this e-mail may be confidential. It has been
> sent for the sole use of the intended recipient(s). If the reader of this
> message is not an intended recipient, you are hereby notified that any
> unauthorized review, use, disclosure, dissemination, distribution or
> copying of this communication, or any of its contents,
> is strictly prohibited. If you have received it by mistake please let us
> know by e-mail immediately and delete it from your system. Many thanks.
>
>
>
> La información contenida en este mensaje puede ser confidencial. Ha sido
> enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de
> este mensaje no fuera el destinatario previsto, por el presente queda Ud.
> notificado que cualquier lectura, uso, publicación, diseminación,
> distribución o copiado de esta comunicación o su contenido está
> estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje
> por error le agradeceremos notificarnos por e-mail inmediatamente y
> eliminarlo de su sistema. Muchas gracias.
>
>

Re: Partition and Split rows

Posted by Amit Adhau <am...@globant.com>.

Hi,

As you have explained the importance of the partition and data split for
time series data retrieval, I have below query or limitation at this moment
for implementing one of our use case.

For date time data, in kudu we use Timestamp datatype and obviously we can
do the partitioning on it. But, if we need to fetch the timestamp based
data from Impala for adhoc analysis it gives us an error like the
integration in Impala is missing and I guess you will be implementing
it in Kimpala
merge.

Q: Can I fetch the kudu Timestamp data from Tableu/Pentaho for reporting
and analytics purpose or I need Int64 datatype only.

Q: Are you going to provide the Kimpala merge or kudu timestamp support in
Impala/Tableu in near future.

Q: At this moment, instead of Timestamp we are using Int64 type in kudu,
will it be equally helpful for performance, if we use the partition and
split by explanation given for timestamp datatype? e.g. can we get a better
performance for a composite key on a given kudu table defined as
'metric(s),Int64(has timestamp data) and with a partition on Int64 column
with Split by clause?

Thanks,
Amit

On Sat, May 7, 2016 at 9:20 PM, Dan Burkert <da...@cloudera.com> wrote:

> Hi Sand,
>
> I've been working on some diagrams to help explain some of the more
> advanced partitioning types, it's attached.   Still pretty rough at this
> point, but the goal is to clean it up and move it into the Kudu
> documentation proper.  I'm interested to hear what kind of time series you
> are interested in Kudu for.  I'm tasked with improving Kudu for time
> series, you can follow progress here
> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
> additional ideas I'd love to hear them.  You may also be interested in a
> small project that a JD and I have been working on in the past week to
> build an OpenTSDB style store on top of Kudu, you can find it here
> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited at
> this point.
>
> - Dan
>
> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com> wrote:
>
>> Thanks. Will read.
>>
>> Given that I am researching time series data, row locality is crucial :-)
>>
>>
>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jd...@apache.org>
>> wrote:
>>
>>> We do have non-covering range partitions coming in the next few months,
>>> here's the design (in review):
>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>
>>> The "Background & Motivation" section should give you a good idea of why
>>> I'm mentioning this.
>>>
>>> Meanwhile, if you don't need row locality, using hash partitioning could
>>> be good enough.
>>>
>>> J-D
>>>
>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>>> wrote:
>>>
>>>> Makes sense.
>>>>
>>>> Yeah it would be cool if users could specify/control the split rows
>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>> range buckets.
>>>>
>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jdcryans@apache.org
>>>> > wrote:
>>>>
>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>
>>>>> That's also how HBase works, but it will split regions as you insert
>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> One more questions, how does the range partition work if I don't
>>>>>> specify the split rows?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>
>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>> unclear how the range partition column names associated with the partial
>>>>>>> rows params in the addSplitRow API.
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>
>>>>>>>> Hi Sand,
>>>>>>>>
>>>>>>>> Please have a look at
>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>> and see if it is helpful to you.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Misty
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>> researching how to partition (hash+range) some time series test data.
>>>>>>>>>
>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>
>>>>>>>>> Thanks much.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-- 
Thanks & Regards,

*Amit Adhau* | Data Architect

*GLOBANT* | IND:+91 9821518132

[image: Facebook] <https://www.facebook.com/Globant>

[image: Twitter] <http://www.twitter.com/globant>

[image: Youtube] <http://www.youtube.com/Globant>

[image: Linkedin] <http://www.linkedin.com/company/globant>

[image: Pinterest] <http://pinterest.com/globant/>

[image: Globant] <http://www.globant.com/>

-- 

The information contained in this e-mail may be confidential. It has been 
sent for the sole use of the intended recipient(s). If the reader of this 
message is not an intended recipient, you are hereby notified that any 
unauthorized review, use, disclosure, dissemination, distribution or 
copying of this communication, or any of its contents, 
is strictly prohibited. If you have received it by mistake please let us 
know by e-mail immediately and delete it from your system. Many thanks.

La información contenida en este mensaje puede ser confidencial. Ha sido 
enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de 
este mensaje no fuera el destinatario previsto, por el presente queda Ud. 
notificado que cualquier lectura, uso, publicación, diseminación, 
distribución o copiado de esta comunicación o su contenido está 
estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje 
por error le agradeceremos notificarnos por e-mail inmediatamente y 
eliminarlo de su sistema. Muchas gracias.

Re: Partition and Split rows

Posted by Dan Burkert <da...@cloudera.com>.

Hi Sand,

I've been working on some diagrams to help explain some of the more
advanced partitioning types, it's attached.   Still pretty rough at this
point, but the goal is to clean it up and move it into the Kudu
documentation proper.  I'm interested to hear what kind of time series you
are interested in Kudu for.  I'm tasked with improving Kudu for time
series, you can follow progress here
<https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
additional ideas I'd love to hear them.  You may also be interested in a
small project that a JD and I have been working on in the past week to
build an OpenTSDB style store on top of Kudu, you can find it here
<https://github.com/danburkert/kudu-ts>.  Still quite feature limited at
this point.

- Dan

On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sa...@gmail.com> wrote:

> Thanks. Will read.
>
> Given that I am researching time series data, row locality is crucial :-)
>
> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jd...@apache.org>
> wrote:
>
>> We do have non-covering range partitions coming in the next few months,
>> here's the design (in review):
>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>
>> The "Background & Motivation" section should give you a good idea of why
>> I'm mentioning this.
>>
>> Meanwhile, if you don't need row locality, using hash partitioning could
>> be good enough.
>>
>> J-D
>>
>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>>> Makes sense.
>>>
>>> Yeah it would be cool if users could specify/control the split rows
>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>> range buckets.
>>>
>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jd...@apache.org>
>>> wrote:
>>>
>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>
>>>> That's also how HBase works, but it will split regions as you insert
>>>> data and eventually you'll get some data distribution even if it doesn't
>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>
>>>> J-D
>>>>
>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> One more questions, how does the range partition work if I don't
>>>>> specify the split rows?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>
>>>>>> I was just reading the Java API,CreateTableOptions.java, it's unclear
>>>>>> how the range partition column names associated with the partial rows
>>>>>> params in the addSplitRow API.
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>
>>>>>>> Hi Sand,
>>>>>>>
>>>>>>> Please have a look at
>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>> and see if it is helpful to you.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Misty
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know from
>>>>>>>> some docs, this is currently for pre-creation the table. I am researching
>>>>>>>> how to partition (hash+range) some time series test data.
>>>>>>>>
>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>
>>>>>>>> Thanks much.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Sand Stone <sa...@gmail.com>.

Thanks. Will read.

Given that I am researching time series data, row locality is crucial :-)

On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jd...@apache.org>
wrote:

> We do have non-covering range partitions coming in the next few months,
> here's the design (in review):
> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>
> The "Background & Motivation" section should give you a good idea of why
> I'm mentioning this.
>
> Meanwhile, if you don't need row locality, using hash partitioning could
> be good enough.
>
> J-D
>
> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com> wrote:
>
>> Makes sense.
>>
>> Yeah it would be cool if users could specify/control the split rows after
>> the table is created. Now, I have to "think ahead" to pre-create the range
>> buckets.
>>
>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jd...@apache.org>
>> wrote:
>>
>>> You will only get 1 tablet and no data distribution, which is bad.
>>>
>>> That's also how HBase works, but it will split regions as you insert
>>> data and eventually you'll get some data distribution even if it doesn't
>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>
>>> J-D
>>>
>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>>> wrote:
>>>
>>>> One more questions, how does the range partition work if I don't
>>>> specify the split rows?
>>>>
>>>> Thanks!
>>>>
>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>
>>>>> I was just reading the Java API,CreateTableOptions.java, it's unclear
>>>>> how the range partition column names associated with the partial rows
>>>>> params in the addSplitRow API.
>>>>>
>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>
>>>>>> Hi Sand,
>>>>>>
>>>>>> Please have a look at
>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>> and see if it is helpful to you.
>>>>>>
>>>>>> Thanks,
>>>>>> Misty
>>>>>>
>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know from
>>>>>>> some docs, this is currently for pre-creation the table. I am researching
>>>>>>> how to partition (hash+range) some time series test data.
>>>>>>>
>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>
>>>>>>> Thanks much.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Jean-Daniel Cryans <jd...@apache.org>.

We do have non-covering range partitions coming in the next few months,
here's the design (in review):
http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md

The "Background & Motivation" section should give you a good idea of why
I'm mentioning this.

Meanwhile, if you don't need row locality, using hash partitioning could be
good enough.

J-D

On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sa...@gmail.com> wrote:

> Makes sense.
>
> Yeah it would be cool if users could specify/control the split rows after
> the table is created. Now, I have to "think ahead" to pre-create the range
> buckets.
>
> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jd...@apache.org>
> wrote:
>
>> You will only get 1 tablet and no data distribution, which is bad.
>>
>> That's also how HBase works, but it will split regions as you insert data
>> and eventually you'll get some data distribution even if it doesn't start
>> in an ideal situation. Tablet splitting will come later for Kudu.
>>
>> J-D
>>
>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>>> One more questions, how does the range partition work if I don't specify
>>> the split rows?
>>>
>>> Thanks!
>>>
>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>>> wrote:
>>>
>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>
>>>> I was just reading the Java API,CreateTableOptions.java, it's unclear
>>>> how the range partition column names associated with the partial rows
>>>> params in the addSplitRow API.
>>>>
>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>> mstanleyjones@cloudera.com> wrote:
>>>>
>>>>> Hi Sand,
>>>>>
>>>>> Please have a look at
>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>> and see if it is helpful to you.
>>>>>
>>>>> Thanks,
>>>>> Misty
>>>>>
>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know from
>>>>>> some docs, this is currently for pre-creation the table. I am researching
>>>>>> how to partition (hash+range) some time series test data.
>>>>>>
>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>
>>>>>> Thanks much.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Sand Stone <sa...@gmail.com>.

Makes sense.

Yeah it would be cool if users could specify/control the split rows after
the table is created. Now, I have to "think ahead" to pre-create the range
buckets.

On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <jd...@apache.org>
wrote:

> You will only get 1 tablet and no data distribution, which is bad.
>
> That's also how HBase works, but it will split regions as you insert data
> and eventually you'll get some data distribution even if it doesn't start
> in an ideal situation. Tablet splitting will come later for Kudu.
>
> J-D
>
> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com> wrote:
>
>> One more questions, how does the range partition work if I don't specify
>> the split rows?
>>
>> Thanks!
>>
>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>>> Thanks, Misty. The "advanced" impala example helped.
>>>
>>> I was just reading the Java API,CreateTableOptions.java, it's unclear
>>> how the range partition column names associated with the partial rows
>>> params in the addSplitRow API.
>>>
>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>> mstanleyjones@cloudera.com> wrote:
>>>
>>>> Hi Sand,
>>>>
>>>> Please have a look at
>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>> and see if it is helpful to you.
>>>>
>>>> Thanks,
>>>> Misty
>>>>
>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know from
>>>>> some docs, this is currently for pre-creation the table. I am researching
>>>>> how to partition (hash+range) some time series test data.
>>>>>
>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>
>>>>> Thanks much.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Jean-Daniel Cryans <jd...@apache.org>.

You will only get 1 tablet and no data distribution, which is bad.

That's also how HBase works, but it will split regions as you insert data
and eventually you'll get some data distribution even if it doesn't start
in an ideal situation. Tablet splitting will come later for Kudu.

J-D

On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sa...@gmail.com> wrote:

> One more questions, how does the range partition work if I don't specify
> the split rows?
>
> Thanks!
>
> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com> wrote:
>
>> Thanks, Misty. The "advanced" impala example helped.
>>
>> I was just reading the Java API,CreateTableOptions.java, it's unclear how
>> the range partition column names associated with the partial rows params in
>> the addSplitRow API.
>>
>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>> mstanleyjones@cloudera.com> wrote:
>>
>>> Hi Sand,
>>>
>>> Please have a look at
>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>> and see if it is helpful to you.
>>>
>>> Thanks,
>>> Misty
>>>
>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com>
>>> wrote:
>>>
>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know from
>>>> some docs, this is currently for pre-creation the table. I am researching
>>>> how to partition (hash+range) some time series test data.
>>>>
>>>> Is there an example? or notes somewhere I could read upon.
>>>>
>>>> Thanks much.
>>>>
>>>
>>>
>>
>

Re: Partition and Split rows

Posted by Sand Stone <sa...@gmail.com>.

One more questions, how does the range partition work if I don't specify
the split rows?

Thanks!

On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sa...@gmail.com> wrote:

> Thanks, Misty. The "advanced" impala example helped.
>
> I was just reading the Java API,CreateTableOptions.java, it's unclear how
> the range partition column names associated with the partial rows params in
> the addSplitRow API.
>
> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
> mstanleyjones@cloudera.com> wrote:
>
>> Hi Sand,
>>
>> Please have a look at
>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>> and see if it is helpful to you.
>>
>> Thanks,
>> Misty
>>
>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com>
>> wrote:
>>
>>> Hi, I am new to Kudu. I wonder how the split rows work. I know from some
>>> docs, this is currently for pre-creation the table. I am researching how to
>>> partition (hash+range) some time series test data.
>>>
>>> Is there an example? or notes somewhere I could read upon.
>>>
>>> Thanks much.
>>>
>>
>>
>

Re: Partition and Split rows

Posted by Sand Stone <sa...@gmail.com>.

Thanks, Misty. The "advanced" impala example helped.

I was just reading the Java API,CreateTableOptions.java, it's unclear how
the range partition column names associated with the partial rows params in
the addSplitRow API.

On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
mstanleyjones@cloudera.com> wrote:

> Hi Sand,
>
> Please have a look at
> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
> and see if it is helpful to you.
>
> Thanks,
> Misty
>
> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com> wrote:
>
>> Hi, I am new to Kudu. I wonder how the split rows work. I know from some
>> docs, this is currently for pre-creation the table. I am researching how to
>> partition (hash+range) some time series test data.
>>
>> Is there an example? or notes somewhere I could read upon.
>>
>> Thanks much.
>>
>
>

Re: Partition and Split rows

Posted by Misty Stanley-Jones <ms...@cloudera.com>.

Hi Sand,

Please have a look at
http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables and
see if it is helpful to you.

Thanks,
Misty

On Fri, May 6, 2016 at 2:00 PM, Sand Stone <sa...@gmail.com> wrote:

> Hi, I am new to Kudu. I wonder how the split rows work. I know from some
> docs, this is currently for pre-creation the table. I am researching how to
> partition (hash+range) some time series test data.
>
> Is there an example? or notes somewhere I could read upon.
>
> Thanks much.
>