You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Akash Nilugal <ak...@gmail.com> on 2019/09/23 13:42:48 UTC

[DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Hi Community,

Timeseries data are simply measurements or events that are
tracked,monitored, downsampled, and aggregated over time.
Basicallytimeseries data analysis helps in analyzing or monitoring
theaggregated data over period of time to take better decision forbusiness.
So since carbondata supports olap datamap like preaggregate, MV and since
time series is of atmost importance,
we can supporttimeseries for carbondata over MV datamap model.

Currentlycarbondata supports timeseries on preaggregate datamap, but its
analpha feature and there are so many limitations when we compare and
analyze the existing timeseries database or projects which supportstime
series like apache druid or influxdb. So, in this feature we can support
timeseries
by avoiding the limitations in the current system. After doing the analysis
on the current existing timeseries database like influxdb, and the apache
druid,
i have  prepared a solution/design document. Any inputs, improvements or
suggestion are most welcome.

I have created jira https://issues.apache.org/jira/browse/CARBONDATA-3525 for
this. Later i will create sub jiras for tracking.


Regards,
Akash R Nilugal

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Kunal Kapoor <ku...@gmail.com>.

+1


Regards
Kunal Kapoor

On Mon, Nov 18, 2019, 3:47 PM Kumar Vishal <ku...@gmail.com>
wrote:

> +1
> -Regards
> Kumar Vishal
>
> On Mon, Oct 21, 2019 at 5:46 PM Akash Nilugal <ak...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > Based on further analysis with druid and influxdb, current design fails
> to
> > cover the late data arrived to load. So i have updated the design
> document
> > based on that to support late data and attached in jira. Please help to
> > review it and suggestions are welcomed.
> >
> > Regards,
> > Akash
> >
> > On 2019/09/23 13:42:48, Akash Nilugal <ak...@gmail.com> wrote:
> > > Hi Community,
> > >
> > > Timeseries data are simply measurements or events that are
> > > tracked,monitored, downsampled, and aggregated over time.
> > > Basicallytimeseries data analysis helps in analyzing or monitoring
> > > theaggregated data over period of time to take better decision
> > forbusiness.
> > > So since carbondata supports olap datamap like preaggregate, MV and
> since
> > > time series is of atmost importance,
> > > we can supporttimeseries for carbondata over MV datamap model.
> > >
> > > Currentlycarbondata supports timeseries on preaggregate datamap, but
> its
> > > analpha feature and there are so many limitations when we compare and
> > > analyze the existing timeseries database or projects which supportstime
> > > series like apache druid or influxdb. So, in this feature we can
> support
> > > timeseries
> > > by avoiding the limitations in the current system. After doing the
> > analysis
> > > on the current existing timeseries database like influxdb, and the
> apache
> > > druid,
> > > i have  prepared a solution/design document. Any inputs, improvements
> or
> > > suggestion are most welcome.
> > >
> > > I have created jira
> > https://issues.apache.org/jira/browse/CARBONDATA-3525 for
> > > this. Later i will create sub jiras for tracking.
> > >
> > >
> > > Regards,
> > > Akash R Nilugal
> > >
> >
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Kumar Vishal <ku...@gmail.com>.

+1
-Regards
Kumar Vishal

On Mon, Oct 21, 2019 at 5:46 PM Akash Nilugal <ak...@gmail.com>
wrote:

> Hi All,
>
> Based on further analysis with druid and influxdb, current design fails to
> cover the late data arrived to load. So i have updated the design document
> based on that to support late data and attached in jira. Please help to
> review it and suggestions are welcomed.
>
> Regards,
> Akash
>
> On 2019/09/23 13:42:48, Akash Nilugal <ak...@gmail.com> wrote:
> > Hi Community,
> >
> > Timeseries data are simply measurements or events that are
> > tracked,monitored, downsampled, and aggregated over time.
> > Basicallytimeseries data analysis helps in analyzing or monitoring
> > theaggregated data over period of time to take better decision
> forbusiness.
> > So since carbondata supports olap datamap like preaggregate, MV and since
> > time series is of atmost importance,
> > we can supporttimeseries for carbondata over MV datamap model.
> >
> > Currentlycarbondata supports timeseries on preaggregate datamap, but its
> > analpha feature and there are so many limitations when we compare and
> > analyze the existing timeseries database or projects which supportstime
> > series like apache druid or influxdb. So, in this feature we can support
> > timeseries
> > by avoiding the limitations in the current system. After doing the
> analysis
> > on the current existing timeseries database like influxdb, and the apache
> > druid,
> > i have  prepared a solution/design document. Any inputs, improvements or
> > suggestion are most welcome.
> >
> > I have created jira
> https://issues.apache.org/jira/browse/CARBONDATA-3525 for
> > this. Later i will create sub jiras for tracking.
> >
> >
> > Regards,
> > Akash R Nilugal
> >
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Akash Nilugal <ak...@gmail.com>.

Hi All,

Based on further analysis with druid and influxdb, current design fails to cover the late data arrived to load. So i have updated the design document based on that to support late data and attached in jira. Please help to review it and suggestions are welcomed.

Regards,
Akash

On 2019/09/23 13:42:48, Akash Nilugal <ak...@gmail.com> wrote: 
> Hi Community,
> 
> Timeseries data are simply measurements or events that are
> tracked,monitored, downsampled, and aggregated over time.
> Basicallytimeseries data analysis helps in analyzing or monitoring
> theaggregated data over period of time to take better decision forbusiness.
> So since carbondata supports olap datamap like preaggregate, MV and since
> time series is of atmost importance,
> we can supporttimeseries for carbondata over MV datamap model.
> 
> Currentlycarbondata supports timeseries on preaggregate datamap, but its
> analpha feature and there are so many limitations when we compare and
> analyze the existing timeseries database or projects which supportstime
> series like apache druid or influxdb. So, in this feature we can support
> timeseries
> by avoiding the limitations in the current system. After doing the analysis
> on the current existing timeseries database like influxdb, and the apache
> druid,
> i have  prepared a solution/design document. Any inputs, improvements or
> suggestion are most welcome.
> 
> I have created jira https://issues.apache.org/jira/browse/CARBONDATA-3525 for
> this. Later i will create sub jiras for tracking.
> 
> 
> Regards,
> Akash R Nilugal
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Akash Nilugal <ak...@gmail.com>.

Hi Manhua,

Thanks for the questions, Please find the comments 
1. User can specify the granularity as minute, so we take as 1 unit, (1 minute in this case) and store the data, then during query in UDF he can mention for minute level query, how many minutes data he wants. 

2. Since we will be loading data to datamap from fact, there is no restriction on it.

Regards,
Akash

On 2019/09/24 08:00:55, Manhua <ke...@gmail.com> wrote: 
> Hi Akash
> 
> Can user specific the granularity? Such as 5minutes, 15 minutes
> 
> Is there any constraint on timestamp_column's datatype?  Including DATE,  TIMESTAMP,  BIGINT(Unix timestamp)
> 
> On 2019/09/23 13:42:48, Akash Nilugal <ak...@gmail.com> wrote: 
> > Hi Community,
> > 
> > Timeseries data are simply measurements or events that are
> > tracked,monitored, downsampled, and aggregated over time.
> > Basicallytimeseries data analysis helps in analyzing or monitoring
> > theaggregated data over period of time to take better decision forbusiness.
> > So since carbondata supports olap datamap like preaggregate, MV and since
> > time series is of atmost importance,
> > we can supporttimeseries for carbondata over MV datamap model.
> > 
> > Currentlycarbondata supports timeseries on preaggregate datamap, but its
> > analpha feature and there are so many limitations when we compare and
> > analyze the existing timeseries database or projects which supportstime
> > series like apache druid or influxdb. So, in this feature we can support
> > timeseries
> > by avoiding the limitations in the current system. After doing the analysis
> > on the current existing timeseries database like influxdb, and the apache
> > druid,
> > i have  prepared a solution/design document. Any inputs, improvements or
> > suggestion are most welcome.
> > 
> > I have created jira https://issues.apache.org/jira/browse/CARBONDATA-3525 for
> > this. Later i will create sub jiras for tracking.
> > 
> > 
> > Regards,
> > Akash R Nilugal
> > 
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Manhua <ke...@gmail.com>.

Hi Akash

Can user specific the granularity? Such as 5minutes, 15 minutes

Is there any constraint on timestamp_column's datatype?  Including DATE,  TIMESTAMP,  BIGINT(Unix timestamp)

On 2019/09/23 13:42:48, Akash Nilugal <ak...@gmail.com> wrote: 
> Hi Community,
> 
> Timeseries data are simply measurements or events that are
> tracked,monitored, downsampled, and aggregated over time.
> Basicallytimeseries data analysis helps in analyzing or monitoring
> theaggregated data over period of time to take better decision forbusiness.
> So since carbondata supports olap datamap like preaggregate, MV and since
> time series is of atmost importance,
> we can supporttimeseries for carbondata over MV datamap model.
> 
> Currentlycarbondata supports timeseries on preaggregate datamap, but its
> analpha feature and there are so many limitations when we compare and
> analyze the existing timeseries database or projects which supportstime
> series like apache druid or influxdb. So, in this feature we can support
> timeseries
> by avoiding the limitations in the current system. After doing the analysis
> on the current existing timeseries database like influxdb, and the apache
> druid,
> i have  prepared a solution/design document. Any inputs, improvements or
> suggestion are most welcome.
> 
> I have created jira https://issues.apache.org/jira/browse/CARBONDATA-3525 for
> this. Later i will create sub jiras for tracking.
> 
> 
> Regards,
> Akash R Nilugal
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Akash Nilugal <ak...@gmail.com>.

Hi vishal,

I got your point, i have changed accordingly and updated the document in jira, please check

Regards,
Akash R Nilugal

On 2019/09/30 17:09:44, Kumar Vishal <ku...@gmail.com> wrote: 
> Hi Akash,
> 
> In this desing document you haven't mentioned how to handle data loading
> for timeseries datamap for older segments[Existing table].
> If the customer's main table data is also stored based on time[increasing
> time] in different segments,he can use this feature as well.
> 
> We can discuss and finalize the solution.
> 
> -Regards
> Kumar Vishal
> 
> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
> wrote:
> 
> > Hi Ajantha,
> >
> > Thanks for the queries and suggestions
> >
> > 1. Yes, this is a good suggestion, i ll include this change. Both date and
> > timestamp columns are supported, will be updated in document.
> > 2. yes, you are right.
> > 3. you are right, if the day level is not available, then we will try to
> > get the whole day data from hour level, if not availaible, as explained in
> > design document, we will get the data from datamap UNION data from main
> > table based on user query.
> >
> > Regards,
> > Akash R Nilugal
> >
> >
> > On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
> > > + 1 ,
> > >
> > > I have some suggestions and questions.
> > >
> > > 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> > > 'timeseries_column'.
> > >  so that it won't give an impression that only time stamp datatype is
> > > supported and update the document with all the datatype supported.
> > >
> > > 2. Querying on this datamap table is also supported right ? supporting
> > > changing plan for main table to refer datamap table is for user to avoid
> > > changing his query or any other reason ?
> > >
> > > 3. If user has not created day granularity datamap, but just created hour
> > > granularity datamap. When query has day granularity, data will be fetched
> > > form hour granularity datamap and aggregated ? or data is fetched from
> > main
> > > table ?
> > >
> > > Thanks,
> > > Ajantha
> > >
> > > On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <ak...@gmail.com>
> > > wrote:
> > >
> > > > Hi xuchuanyin,
> > > >
> > > > Thanks for the comments/Suggestions
> > > >
> > > > 1. Preaggregate is productized, but not the timeseries with
> > preaggregate,
> > > > i think you  got confused with that, if im right.
> > > > 2. Limitations like, auto sampling or rollup, which we will be
> > supporting
> > > > now. Retention policies. etc
> > > > 3. segmentTimestampMin, this i will consider in design.
> > > > 4. RP is added as a separate task, i thought instead of maintaining two
> > > > variables better to maintabin one and parse it. But i will consider
> > your
> > > > point based on feasibility during implementation.
> > > > 5. We use an accumulator which takes list, so before writing index
> > files
> > > > we take the min max of the timestamp column and fill in accumulator and
> > > > then we can access accumulator.value in driver after load is finished.
> > > >
> > > > Regards,
> > > > Akash R Nilugal
> > > >
> > > > On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
> > > > > Hi akash, glad to see the feature proposed and I have some questions
> > > > about
> > > > > this. Please notice that some of the following descriptions are
> > comments
> > > > > followed by '===' described in the design document attached in the
> > > > > corresponding jira.
> > > > >
> > > > > 1.
> > > > > "Currently carbondata supports timeseries on preaggregate datamap,
> > but
> > > > its
> > > > > an alpha feature"
> > > > > ===
> > > > > It has been some time since the preaggregate datamap was introduced
> > and
> > > > it
> > > > > is still **alpha**, why it is still not product-ready? Will the new
> > > > feature
> > > > > also come into the similar situation?
> > > > >
> > > > > 2.
> > > > > "there are so many limitations when we compare and analyze the
> > existing
> > > > > timeseries database or projects which supports time series like
> > apache
> > > > druid
> > > > > or influxdb"
> > > > > ===
> > > > > What are the actual limitations? Besides, please give an example of
> > this.
> > > > >
> > > > > 3.
> > > > > "Segment_Timestamp_Min"
> > > > > ===
> > > > > Suggest using camel-case style like 'segmentTimestampMin'
> > > > >
> > > > > 4.
> > > > > "RP is way of telling the system, for how long the data should be
> > kept"
> > > > > ===
> > > > > Since the function is simple, I'd suggest using 'retentionTime'=15
> > and
> > > > > 'timeUnit'='day' instead of 'RP'='15_days'
> > > > >
> > > > > 5.
> > > > > "When the data load is called for main table, use an spark
> > accumulator to
> > > > > get the maximum value of timestamp in that load and return to the
> > load."
> > > > > ===
> > > > > How can you get the spark accumulator? The load is launched using
> > > > > loading-by-dataframe not using global-sort-by-spark.
> > > > >
> > > > > 6.
> > > > > For the rest of the content, still reading.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sent from:
> > > >
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Manhua <ke...@gmail.com>.

UDF might have performance problem: Spark built-in UDF  vs  Spark UDF  vs  Hive UDF have some different


On 2019/10/07 10:26:07, Ravindra Pesala <ra...@gmail.com> wrote: 
> Hi Akash,
> 
> 1. It is better to make it simple and let user provide the udf he wants in the query. So no need to rewrite the query and no need provide extra granularity property.
> 
> 3. I got your point why you want to use accumulator to get min/max. But why I am worried is it should not add complexity to generate min/max as we already has this information available. I don’t think we should be so bothered about reading min/max on data loading phase as it is already heavy duty job and adding few more mills does not do any harm. But as you mentioned it is easier to do so we can go a head your way.
> 
> 
> Regards,
> Ravindra.
> 
> > On 7 Oct 2019, at 5:38 PM, Akash Nilugal <ak...@gmail.com> wrote:
> > 
> > Hi Ravi,
> > 
> > 1. i) During create datamap, in ctas query, user does not mention udf, so if granularity is present in DM properties, then internally we rewrite the ctas query with udf and then load the data to datamap according to current design.
> >   ii) but if we say user to give ctas query with udf only, then internally no need to rewite the query, we can just load data to it and avoid giving the granularity in DMproperties.
> > 	Currently im planning to do first one. Please give your input on this.
> > 
> > 2. Ok, we will not use the RP management in DMProperties, we will use as separate command and do proper decoupling.
> > 
> > 3. I think you are referring to the cache pre-priming in index server. Problem with this is that, we wil be not sure whether the cache loaded for the segment or not, because as per pre-priming design, if loading to cache fails after data load to main table, we ignore it as query takes care of it. So we cannot completely rely on this feature for min max.
> > So for accumulator, im not calculating again, i just take the minmax before writing index file in dataload and use that in driver to prepare the dataload ranges for datamaps.
> > 
> > The reason to keep the segment min max in the table status of datamap is that, it will be helful in RP scenarios, second is we will not be missing any data from loading to datamap from main table[if 1st time data came from 1 to 4:15 , then next we get data 5:10 to 6, then there might be chance that we can miss 15minutes of data from 4 to 4:15]. It will be helpful in querying also. So that we can avoid the problem i mentioned above with datamaps loaded in cache.
> > 
> > 4. I agree, your point is valid one. I will do more abalysis on this based on the user use cases and then we can decide finally. That would be better.
> > 
> > Please give your inputs/suggestions on the above points.
> > 
> > regards,
> > Akash R Nilugal
> > 
> > On 2019/10/07 03:03:35, Ravindra Pesala <ra...@gmail.com> wrote: 
> >> HI Akash,
> >> 
> >> 1. I feel user providing granularity is redundant, he can just provide respective udf in select query should be enough.
> >> 
> >> 2. I think it is better to add the RP management now itself, otherwise if you start adding to DM properties as temporary then it will never be moved. Better put little more effort to decouple it from datamaps.
> >> 
> >> 3. I feel accumulator is a added cost, we already have feature in development to load datamap immediately after load happens, why not use that? If the datamap is already in memory why we need min/max at segment level?
> >> 
> >> 4. I feel there must be some reason why other timeseries db does not support union of data.  Consider a scenario that we have data from 1pm to 4.30 pm , it means 4 to 5pm data is still loading.  when user asks the data at hour level I feel it is safe to give data for 1,2,3 hours data, because providing 4pm is actually not a complete data. So atleast user comes to know that 4 pm data is not available and starts querying the low level data if he needs it.
> >> I think better get some real uses how user wants this time series data.
> >> 
> >> Regards,
> >> Ravindra.
> >> 
> >>> On 4 Oct 2019, at 9:39 PM, Akash Nilugal <ak...@gmail.com> wrote:
> >>> 
> >>> Hi Ravi,
> >>> 
> >>> 1. I forgot to mention the CTAS query in the create datamap statement, i have updated the document, during create datamap user can give granularity, during query just the UDF. That should be fine right.
> >>> 2. I think may be we can mention the RP policy in DM properties also, and then may be we provide add RP, drop RP, alter RP for existing and older datamaps. RP will be taken as a separate subtask and will be handled in later part. That should be fine i tink.
> >>> 3. Actually consider a scenario when datamap is already created, then load happened to main table, then i use accumulator to get all the min max to driver, so that i can avoid reading index file in driver in order to load to datamap. 
> >>>        other scenario is when main table already has segments and then datamap is created, the we will read index files from each segments to decide the min max of timestamp column.
> >>> 4. We are not storing min max in main table  table status. We are storing in datamap table's table status file, so that it will be used to prepare the plan during the query phase.
> >>> 
> >>> 5. Other timeseries db supports only getting the data present in hour or day .. aggregated data. Since we cannot miss the data, plan is to get the data like higher to lower. May be it does not make much difference when its from minute to second, but it makes difference from year to month , so that we cannot avoid aggregations from main table.
> >>> 
> >>> 
> >>> Regards,
> >>> Akash R Nilugal
> >>> 
> >>> On 2019/10/04 11:35:46, Ravindra Pesala <ra...@gmail.com> wrote: 
> >>>> Hi Akash,
> >>>> 
> >>>> I have following suggestions.
> >>>> 
> >>>> 1. I think it is redundant to use granularity inside create datamap, user can use the respective granularity UDF in his query like time(1h) or time(1d) etc.
> >>>> 
> >>>> 2. Better create separate RP commands and let user add the RP on the datamap or even on the main table also. It would be more manageable if you independent feature for RP instead of including in datamap.
> >>>> 
> >>>> 3. I am not getting why exactly we need accumulator instead of using index min/max? Can you explain with some scenario 
> >>>> 
> >>>> 4. Why to store min/max at segment level? We can get from datamap also right?
> >>>> 
> >>>> 4.  Union with high granularity tables to low granularity tables are really needed? Any other time series DB is doing it? Or any known use case we have?
> >>>> 
> >>>> Regards,
> >>>> Ravindra.
> >>>> 
> >>>>> On 1 Oct 2019, at 5:49 PM, Akash Nilugal <ak...@gmail.com> wrote:
> >>>>> 
> >>>>> Hi Babu,
> >>>>> 
> >>>>> Thanks for the inputs. Please find the comments 
> >>>>> 1. I will change from Union to UnionAll
> >>>>> 2. For auto datamap loading, once the data is loaded to lower level granularity datamap, then we load the higher level datamap from the lower level datamap. But as per your point, i think you are telling to load from main table itself.
> >>>>> 3. similar to 2nd point, whether to need configuration or not we can decide i think.
> >>>>> 4. a. I think the max of the datamap is required to decide the range for the load, because in case of failure case, we may need.
> >>>>> b. This point will be taken care.
> >>>>> 5. Yes, dataload is sync based on current design, as it is non lazy, it will happen with main table load only.
> >>>>> 6. Yes, this will be handled.
> >>>>> 7. Already added a task in jira.
> >>>>> On 2019/10/01 08:50:05, babu lal jangir <ba...@gmail.com> wrote: 
> >>>>>> Hi Akash, Thanks for Time Series DataMap proposal.
> >>>>>> Please check below Points.
> >>>>>> 
> >>>>>> 1. During Query Planing Change Union to Union All , Otherwise will loose
> >>>>>> row if same value appears.
> >>>>>> 2. Whether system start load for next granularity level table as soon it
> >>>>>> matches the data condition or next granularity level table has to wait till
> >>>>>> current  granularity level table is finished ? please handle if possible.
> >>>>>> 3. Add Configuration to load multiple Ranges at a time(across granularity
> >>>>>> tables).
> >>>>>> 4. Please check if Current data loading min ,max is enough to find current
> >>>>>> load . No need to refer the DataMap's min,max because data loading Range
> >>>>>> prepration can go wrong if loading happens from multiple driver . i think
> >>>>>> below rules are enough for loading.
> >>>>>>  4.a. Create MV should should sync data.   On any failure Rebuild should
> >>>>>> sync again till than MV will be disabled.
> >>>>>>  4.b.  Each load has independent Ranges and should load only those
> >>>>>> ranges. Any failure MV may go in disable state(only if intermediate ranges
> >>>>>> load is failed ,last loads failure will NOT make MV disable).
> >>>>>> 5. We can make Data loading sync because anyway queries can be served from
> >>>>>> fact table if any segments is in-progress in  Datamap.
> >>>>>> 6. In Data loading Pipleline ,failures in intermediate time series datamap,
> >>>>>> still we can continue loading next level data. (ignore if already handled).
> >>>>>> For Example.
> >>>>>>  DataMaps:- Hour,Day,Month Level
> >>>>>>  Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
> >>>>>>    Failure in hour level during below range
> >>>>>>      2018-01-06 01:00:00 to 2018-01-06 01:00:00
> >>>>>>   This point of time Hour level has 5 day data.so start loading on day
> >>>>>> level .
> >>>>>> 7. Add SubTask to support loading of in-between missing time.(Incremental
> >>>>>> but old records if timeseries device stopped working for some time).
> >>>>>> 
> >>>>>> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <ak...@gmail.com>
> >>>>>> wrote:
> >>>>>> 
> >>>>>>> Hi vishal,
> >>>>>>> 
> >>>>>>> In the design document, in the impacted analysis section, there is a topic
> >>>>>>> compatibility/legacy stores, so basically For old tables when the datamap
> >>>>>>> is created, we load all the timeseries datamaps with different granularity.
> >>>>>>> I think this should do fine, please let me know for further
> >>>>>>> suggestions/comments.
> >>>>>>> 
> >>>>>>> Regards,
> >>>>>>> Akash R Nilugal
> >>>>>>> 
> >>>>>>> On 2019/09/30 17:09:44, Kumar Vishal <ku...@gmail.com> wrote:
> >>>>>>>> Hi Akash,
> >>>>>>>> 
> >>>>>>>> In this desing document you haven't mentioned how to handle data loading
> >>>>>>>> for timeseries datamap for older segments[Existing table].
> >>>>>>>> If the customer's main table data is also stored based on time[increasing
> >>>>>>>> time] in different segments,he can use this feature as well.
> >>>>>>>> 
> >>>>>>>> We can discuss and finalize the solution.
> >>>>>>>> 
> >>>>>>>> -Regards
> >>>>>>>> Kumar Vishal
> >>>>>>>> 
> >>>>>>>> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>>> Hi Ajantha,
> >>>>>>>>> 
> >>>>>>>>> Thanks for the queries and suggestions
> >>>>>>>>> 
> >>>>>>>>> 1. Yes, this is a good suggestion, i ll include this change. Both date
> >>>>>>> and
> >>>>>>>>> timestamp columns are supported, will be updated in document.
> >>>>>>>>> 2. yes, you are right.
> >>>>>>>>> 3. you are right, if the day level is not available, then we will try
> >>>>>>> to
> >>>>>>>>> get the whole day data from hour level, if not availaible, as
> >>>>>>> explained in
> >>>>>>>>> design document, we will get the data from datamap UNION data from main
> >>>>>>>>> table based on user query.
> >>>>>>>>> 
> >>>>>>>>> Regards,
> >>>>>>>>> Akash R Nilugal
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
> >>>>>>>>>> + 1 ,
> >>>>>>>>>> 
> >>>>>>>>>> I have some suggestions and questions.
> >>>>>>>>>> 
> >>>>>>>>>> 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> >>>>>>>>>> 'timeseries_column'.
> >>>>>>>>>> so that it won't give an impression that only time stamp datatype is
> >>>>>>>>>> supported and update the document with all the datatype supported.
> >>>>>>>>>> 
> >>>>>>>>>> 2. Querying on this datamap table is also supported right ?
> >>>>>>> supporting
> >>>>>>>>>> changing plan for main table to refer datamap table is for user to
> >>>>>>> avoid
> >>>>>>>>>> changing his query or any other reason ?
> >>>>>>>>>> 
> >>>>>>>>>> 3. If user has not created day granularity datamap, but just created
> >>>>>>> hour
> >>>>>>>>>> granularity datamap. When query has day granularity, data will be
> >>>>>>> fetched
> >>>>>>>>>> form hour granularity datamap and aggregated ? or data is fetched
> >>>>>>> from
> >>>>>>>>> main
> >>>>>>>>>> table ?
> >>>>>>>>>> 
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Ajantha
> >>>>>>>>>> 
> >>>>>>>>>> On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <
> >>>>>>> akashnilugal@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>> 
> >>>>>>>>>>> Hi xuchuanyin,
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanks for the comments/Suggestions
> >>>>>>>>>>> 
> >>>>>>>>>>> 1. Preaggregate is productized, but not the timeseries with
> >>>>>>>>> preaggregate,
> >>>>>>>>>>> i think you  got confused with that, if im right.
> >>>>>>>>>>> 2. Limitations like, auto sampling or rollup, which we will be
> >>>>>>>>> supporting
> >>>>>>>>>>> now. Retention policies. etc
> >>>>>>>>>>> 3. segmentTimestampMin, this i will consider in design.
> >>>>>>>>>>> 4. RP is added as a separate task, i thought instead of
> >>>>>>> maintaining two
> >>>>>>>>>>> variables better to maintabin one and parse it. But i will consider
> >>>>>>>>> your
> >>>>>>>>>>> point based on feasibility during implementation.
> >>>>>>>>>>> 5. We use an accumulator which takes list, so before writing index
> >>>>>>>>> files
> >>>>>>>>>>> we take the min max of the timestamp column and fill in
> >>>>>>> accumulator and
> >>>>>>>>>>> then we can access accumulator.value in driver after load is
> >>>>>>> finished.
> >>>>>>>>>>> 
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Akash R Nilugal
> >>>>>>>>>>> 
> >>>>>>>>>>> On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
> >>>>>>>>>>>> Hi akash, glad to see the feature proposed and I have some
> >>>>>>> questions
> >>>>>>>>>>> about
> >>>>>>>>>>>> this. Please notice that some of the following descriptions are
> >>>>>>>>> comments
> >>>>>>>>>>>> followed by '===' described in the design document attached in
> >>>>>>> the
> >>>>>>>>>>>> corresponding jira.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 1.
> >>>>>>>>>>>> "Currently carbondata supports timeseries on preaggregate
> >>>>>>> datamap,
> >>>>>>>>> but
> >>>>>>>>>>> its
> >>>>>>>>>>>> an alpha feature"
> >>>>>>>>>>>> ===
> >>>>>>>>>>>> It has been some time since the preaggregate datamap was
> >>>>>>> introduced
> >>>>>>>>> and
> >>>>>>>>>>> it
> >>>>>>>>>>>> is still **alpha**, why it is still not product-ready? Will the
> >>>>>>> new
> >>>>>>>>>>> feature
> >>>>>>>>>>>> also come into the similar situation?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 2.
> >>>>>>>>>>>> "there are so many limitations when we compare and analyze the
> >>>>>>>>> existing
> >>>>>>>>>>>> timeseries database or projects which supports time series like
> >>>>>>>>> apache
> >>>>>>>>>>> druid
> >>>>>>>>>>>> or influxdb"
> >>>>>>>>>>>> ===
> >>>>>>>>>>>> What are the actual limitations? Besides, please give an example
> >>>>>>> of
> >>>>>>>>> this.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 3.
> >>>>>>>>>>>> "Segment_Timestamp_Min"
> >>>>>>>>>>>> ===
> >>>>>>>>>>>> Suggest using camel-case style like 'segmentTimestampMin'
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 4.
> >>>>>>>>>>>> "RP is way of telling the system, for how long the data should be
> >>>>>>>>> kept"
> >>>>>>>>>>>> ===
> >>>>>>>>>>>> Since the function is simple, I'd suggest using
> >>>>>>> 'retentionTime'=15
> >>>>>>>>> and
> >>>>>>>>>>>> 'timeUnit'='day' instead of 'RP'='15_days'
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 5.
> >>>>>>>>>>>> "When the data load is called for main table, use an spark
> >>>>>>>>> accumulator to
> >>>>>>>>>>>> get the maximum value of timestamp in that load and return to the
> >>>>>>>>> load."
> >>>>>>>>>>>> ===
> >>>>>>>>>>>> How can you get the spark accumulator? The load is launched using
> >>>>>>>>>>>> loading-by-dataframe not using global-sort-by-spark.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 6.
> >>>>>>>>>>>> For the rest of the content, still reading.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Sent from:
> >>>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>> 
> >>>> 
> >> 
> >> 
> 
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Ravindra Pesala <ra...@gmail.com>.

Hi Akash,

1. It is better to make it simple and let user provide the udf he wants in the query. So no need to rewrite the query and no need provide extra granularity property.

3. I got your point why you want to use accumulator to get min/max. But why I am worried is it should not add complexity to generate min/max as we already has this information available. I don’t think we should be so bothered about reading min/max on data loading phase as it is already heavy duty job and adding few more mills does not do any harm. But as you mentioned it is easier to do so we can go a head your way.


Regards,
Ravindra.

> On 7 Oct 2019, at 5:38 PM, Akash Nilugal <ak...@gmail.com> wrote:
> 
> Hi Ravi,
> 
> 1. i) During create datamap, in ctas query, user does not mention udf, so if granularity is present in DM properties, then internally we rewrite the ctas query with udf and then load the data to datamap according to current design.
>   ii) but if we say user to give ctas query with udf only, then internally no need to rewite the query, we can just load data to it and avoid giving the granularity in DMproperties.
> 	Currently im planning to do first one. Please give your input on this.
> 
> 2. Ok, we will not use the RP management in DMProperties, we will use as separate command and do proper decoupling.
> 
> 3. I think you are referring to the cache pre-priming in index server. Problem with this is that, we wil be not sure whether the cache loaded for the segment or not, because as per pre-priming design, if loading to cache fails after data load to main table, we ignore it as query takes care of it. So we cannot completely rely on this feature for min max.
> So for accumulator, im not calculating again, i just take the minmax before writing index file in dataload and use that in driver to prepare the dataload ranges for datamaps.
> 
> The reason to keep the segment min max in the table status of datamap is that, it will be helful in RP scenarios, second is we will not be missing any data from loading to datamap from main table[if 1st time data came from 1 to 4:15 , then next we get data 5:10 to 6, then there might be chance that we can miss 15minutes of data from 4 to 4:15]. It will be helpful in querying also. So that we can avoid the problem i mentioned above with datamaps loaded in cache.
> 
> 4. I agree, your point is valid one. I will do more abalysis on this based on the user use cases and then we can decide finally. That would be better.
> 
> Please give your inputs/suggestions on the above points.
> 
> regards,
> Akash R Nilugal
> 
> On 2019/10/07 03:03:35, Ravindra Pesala <ra...@gmail.com> wrote: 
>> HI Akash,
>> 
>> 1. I feel user providing granularity is redundant, he can just provide respective udf in select query should be enough.
>> 
>> 2. I think it is better to add the RP management now itself, otherwise if you start adding to DM properties as temporary then it will never be moved. Better put little more effort to decouple it from datamaps.
>> 
>> 3. I feel accumulator is a added cost, we already have feature in development to load datamap immediately after load happens, why not use that? If the datamap is already in memory why we need min/max at segment level?
>> 
>> 4. I feel there must be some reason why other timeseries db does not support union of data.  Consider a scenario that we have data from 1pm to 4.30 pm , it means 4 to 5pm data is still loading.  when user asks the data at hour level I feel it is safe to give data for 1,2,3 hours data, because providing 4pm is actually not a complete data. So atleast user comes to know that 4 pm data is not available and starts querying the low level data if he needs it.
>> I think better get some real uses how user wants this time series data.
>> 
>> Regards,
>> Ravindra.
>> 
>>> On 4 Oct 2019, at 9:39 PM, Akash Nilugal <ak...@gmail.com> wrote:
>>> 
>>> Hi Ravi,
>>> 
>>> 1. I forgot to mention the CTAS query in the create datamap statement, i have updated the document, during create datamap user can give granularity, during query just the UDF. That should be fine right.
>>> 2. I think may be we can mention the RP policy in DM properties also, and then may be we provide add RP, drop RP, alter RP for existing and older datamaps. RP will be taken as a separate subtask and will be handled in later part. That should be fine i tink.
>>> 3. Actually consider a scenario when datamap is already created, then load happened to main table, then i use accumulator to get all the min max to driver, so that i can avoid reading index file in driver in order to load to datamap. 
>>>        other scenario is when main table already has segments and then datamap is created, the we will read index files from each segments to decide the min max of timestamp column.
>>> 4. We are not storing min max in main table  table status. We are storing in datamap table's table status file, so that it will be used to prepare the plan during the query phase.
>>> 
>>> 5. Other timeseries db supports only getting the data present in hour or day .. aggregated data. Since we cannot miss the data, plan is to get the data like higher to lower. May be it does not make much difference when its from minute to second, but it makes difference from year to month , so that we cannot avoid aggregations from main table.
>>> 
>>> 
>>> Regards,
>>> Akash R Nilugal
>>> 
>>> On 2019/10/04 11:35:46, Ravindra Pesala <ra...@gmail.com> wrote: 
>>>> Hi Akash,
>>>> 
>>>> I have following suggestions.
>>>> 
>>>> 1. I think it is redundant to use granularity inside create datamap, user can use the respective granularity UDF in his query like time(1h) or time(1d) etc.
>>>> 
>>>> 2. Better create separate RP commands and let user add the RP on the datamap or even on the main table also. It would be more manageable if you independent feature for RP instead of including in datamap.
>>>> 
>>>> 3. I am not getting why exactly we need accumulator instead of using index min/max? Can you explain with some scenario 
>>>> 
>>>> 4. Why to store min/max at segment level? We can get from datamap also right?
>>>> 
>>>> 4.  Union with high granularity tables to low granularity tables are really needed? Any other time series DB is doing it? Or any known use case we have?
>>>> 
>>>> Regards,
>>>> Ravindra.
>>>> 
>>>>> On 1 Oct 2019, at 5:49 PM, Akash Nilugal <ak...@gmail.com> wrote:
>>>>> 
>>>>> Hi Babu,
>>>>> 
>>>>> Thanks for the inputs. Please find the comments 
>>>>> 1. I will change from Union to UnionAll
>>>>> 2. For auto datamap loading, once the data is loaded to lower level granularity datamap, then we load the higher level datamap from the lower level datamap. But as per your point, i think you are telling to load from main table itself.
>>>>> 3. similar to 2nd point, whether to need configuration or not we can decide i think.
>>>>> 4. a. I think the max of the datamap is required to decide the range for the load, because in case of failure case, we may need.
>>>>> b. This point will be taken care.
>>>>> 5. Yes, dataload is sync based on current design, as it is non lazy, it will happen with main table load only.
>>>>> 6. Yes, this will be handled.
>>>>> 7. Already added a task in jira.
>>>>> On 2019/10/01 08:50:05, babu lal jangir <ba...@gmail.com> wrote: 
>>>>>> Hi Akash, Thanks for Time Series DataMap proposal.
>>>>>> Please check below Points.
>>>>>> 
>>>>>> 1. During Query Planing Change Union to Union All , Otherwise will loose
>>>>>> row if same value appears.
>>>>>> 2. Whether system start load for next granularity level table as soon it
>>>>>> matches the data condition or next granularity level table has to wait till
>>>>>> current  granularity level table is finished ? please handle if possible.
>>>>>> 3. Add Configuration to load multiple Ranges at a time(across granularity
>>>>>> tables).
>>>>>> 4. Please check if Current data loading min ,max is enough to find current
>>>>>> load . No need to refer the DataMap's min,max because data loading Range
>>>>>> prepration can go wrong if loading happens from multiple driver . i think
>>>>>> below rules are enough for loading.
>>>>>>  4.a. Create MV should should sync data.   On any failure Rebuild should
>>>>>> sync again till than MV will be disabled.
>>>>>>  4.b.  Each load has independent Ranges and should load only those
>>>>>> ranges. Any failure MV may go in disable state(only if intermediate ranges
>>>>>> load is failed ,last loads failure will NOT make MV disable).
>>>>>> 5. We can make Data loading sync because anyway queries can be served from
>>>>>> fact table if any segments is in-progress in  Datamap.
>>>>>> 6. In Data loading Pipleline ,failures in intermediate time series datamap,
>>>>>> still we can continue loading next level data. (ignore if already handled).
>>>>>> For Example.
>>>>>>  DataMaps:- Hour,Day,Month Level
>>>>>>  Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
>>>>>>    Failure in hour level during below range
>>>>>>      2018-01-06 01:00:00 to 2018-01-06 01:00:00
>>>>>>   This point of time Hour level has 5 day data.so start loading on day
>>>>>> level .
>>>>>> 7. Add SubTask to support loading of in-between missing time.(Incremental
>>>>>> but old records if timeseries device stopped working for some time).
>>>>>> 
>>>>>> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <ak...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi vishal,
>>>>>>> 
>>>>>>> In the design document, in the impacted analysis section, there is a topic
>>>>>>> compatibility/legacy stores, so basically For old tables when the datamap
>>>>>>> is created, we load all the timeseries datamaps with different granularity.
>>>>>>> I think this should do fine, please let me know for further
>>>>>>> suggestions/comments.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Akash R Nilugal
>>>>>>> 
>>>>>>> On 2019/09/30 17:09:44, Kumar Vishal <ku...@gmail.com> wrote:
>>>>>>>> Hi Akash,
>>>>>>>> 
>>>>>>>> In this desing document you haven't mentioned how to handle data loading
>>>>>>>> for timeseries datamap for older segments[Existing table].
>>>>>>>> If the customer's main table data is also stored based on time[increasing
>>>>>>>> time] in different segments,he can use this feature as well.
>>>>>>>> 
>>>>>>>> We can discuss and finalize the solution.
>>>>>>>> 
>>>>>>>> -Regards
>>>>>>>> Kumar Vishal
>>>>>>>> 
>>>>>>>> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Ajantha,
>>>>>>>>> 
>>>>>>>>> Thanks for the queries and suggestions
>>>>>>>>> 
>>>>>>>>> 1. Yes, this is a good suggestion, i ll include this change. Both date
>>>>>>> and
>>>>>>>>> timestamp columns are supported, will be updated in document.
>>>>>>>>> 2. yes, you are right.
>>>>>>>>> 3. you are right, if the day level is not available, then we will try
>>>>>>> to
>>>>>>>>> get the whole day data from hour level, if not availaible, as
>>>>>>> explained in
>>>>>>>>> design document, we will get the data from datamap UNION data from main
>>>>>>>>> table based on user query.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Akash R Nilugal
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
>>>>>>>>>> + 1 ,
>>>>>>>>>> 
>>>>>>>>>> I have some suggestions and questions.
>>>>>>>>>> 
>>>>>>>>>> 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
>>>>>>>>>> 'timeseries_column'.
>>>>>>>>>> so that it won't give an impression that only time stamp datatype is
>>>>>>>>>> supported and update the document with all the datatype supported.
>>>>>>>>>> 
>>>>>>>>>> 2. Querying on this datamap table is also supported right ?
>>>>>>> supporting
>>>>>>>>>> changing plan for main table to refer datamap table is for user to
>>>>>>> avoid
>>>>>>>>>> changing his query or any other reason ?
>>>>>>>>>> 
>>>>>>>>>> 3. If user has not created day granularity datamap, but just created
>>>>>>> hour
>>>>>>>>>> granularity datamap. When query has day granularity, data will be
>>>>>>> fetched
>>>>>>>>>> form hour granularity datamap and aggregated ? or data is fetched
>>>>>>> from
>>>>>>>>> main
>>>>>>>>>> table ?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Ajantha
>>>>>>>>>> 
>>>>>>>>>> On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <
>>>>>>> akashnilugal@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi xuchuanyin,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the comments/Suggestions
>>>>>>>>>>> 
>>>>>>>>>>> 1. Preaggregate is productized, but not the timeseries with
>>>>>>>>> preaggregate,
>>>>>>>>>>> i think you  got confused with that, if im right.
>>>>>>>>>>> 2. Limitations like, auto sampling or rollup, which we will be
>>>>>>>>> supporting
>>>>>>>>>>> now. Retention policies. etc
>>>>>>>>>>> 3. segmentTimestampMin, this i will consider in design.
>>>>>>>>>>> 4. RP is added as a separate task, i thought instead of
>>>>>>> maintaining two
>>>>>>>>>>> variables better to maintabin one and parse it. But i will consider
>>>>>>>>> your
>>>>>>>>>>> point based on feasibility during implementation.
>>>>>>>>>>> 5. We use an accumulator which takes list, so before writing index
>>>>>>>>> files
>>>>>>>>>>> we take the min max of the timestamp column and fill in
>>>>>>> accumulator and
>>>>>>>>>>> then we can access accumulator.value in driver after load is
>>>>>>> finished.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Akash R Nilugal
>>>>>>>>>>> 
>>>>>>>>>>> On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
>>>>>>>>>>>> Hi akash, glad to see the feature proposed and I have some
>>>>>>> questions
>>>>>>>>>>> about
>>>>>>>>>>>> this. Please notice that some of the following descriptions are
>>>>>>>>> comments
>>>>>>>>>>>> followed by '===' described in the design document attached in
>>>>>>> the
>>>>>>>>>>>> corresponding jira.
>>>>>>>>>>>> 
>>>>>>>>>>>> 1.
>>>>>>>>>>>> "Currently carbondata supports timeseries on preaggregate
>>>>>>> datamap,
>>>>>>>>> but
>>>>>>>>>>> its
>>>>>>>>>>>> an alpha feature"
>>>>>>>>>>>> ===
>>>>>>>>>>>> It has been some time since the preaggregate datamap was
>>>>>>> introduced
>>>>>>>>> and
>>>>>>>>>>> it
>>>>>>>>>>>> is still **alpha**, why it is still not product-ready? Will the
>>>>>>> new
>>>>>>>>>>> feature
>>>>>>>>>>>> also come into the similar situation?
>>>>>>>>>>>> 
>>>>>>>>>>>> 2.
>>>>>>>>>>>> "there are so many limitations when we compare and analyze the
>>>>>>>>> existing
>>>>>>>>>>>> timeseries database or projects which supports time series like
>>>>>>>>> apache
>>>>>>>>>>> druid
>>>>>>>>>>>> or influxdb"
>>>>>>>>>>>> ===
>>>>>>>>>>>> What are the actual limitations? Besides, please give an example
>>>>>>> of
>>>>>>>>> this.
>>>>>>>>>>>> 
>>>>>>>>>>>> 3.
>>>>>>>>>>>> "Segment_Timestamp_Min"
>>>>>>>>>>>> ===
>>>>>>>>>>>> Suggest using camel-case style like 'segmentTimestampMin'
>>>>>>>>>>>> 
>>>>>>>>>>>> 4.
>>>>>>>>>>>> "RP is way of telling the system, for how long the data should be
>>>>>>>>> kept"
>>>>>>>>>>>> ===
>>>>>>>>>>>> Since the function is simple, I'd suggest using
>>>>>>> 'retentionTime'=15
>>>>>>>>> and
>>>>>>>>>>>> 'timeUnit'='day' instead of 'RP'='15_days'
>>>>>>>>>>>> 
>>>>>>>>>>>> 5.
>>>>>>>>>>>> "When the data load is called for main table, use an spark
>>>>>>>>> accumulator to
>>>>>>>>>>>> get the maximum value of timestamp in that load and return to the
>>>>>>>>> load."
>>>>>>>>>>>> ===
>>>>>>>>>>>> How can you get the spark accumulator? The load is launched using
>>>>>>>>>>>> loading-by-dataframe not using global-sort-by-spark.
>>>>>>>>>>>> 
>>>>>>>>>>>> 6.
>>>>>>>>>>>> For the rest of the content, still reading.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Sent from:
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Akash Nilugal <ak...@gmail.com>.

Hi Ravi,

1. i) During create datamap, in ctas query, user does not mention udf, so if granularity is present in DM properties, then internally we rewrite the ctas query with udf and then load the data to datamap according to current design.
   ii) but if we say user to give ctas query with udf only, then internally no need to rewite the query, we can just load data to it and avoid giving the granularity in DMproperties.
	Currently im planning to do first one. Please give your input on this.

2. Ok, we will not use the RP management in DMProperties, we will use as separate command and do proper decoupling.

3. I think you are referring to the cache pre-priming in index server. Problem with this is that, we wil be not sure whether the cache loaded for the segment or not, because as per pre-priming design, if loading to cache fails after data load to main table, we ignore it as query takes care of it. So we cannot completely rely on this feature for min max.
So for accumulator, im not calculating again, i just take the minmax before writing index file in dataload and use that in driver to prepare the dataload ranges for datamaps.

The reason to keep the segment min max in the table status of datamap is that, it will be helful in RP scenarios, second is we will not be missing any data from loading to datamap from main table[if 1st time data came from 1 to 4:15 , then next we get data 5:10 to 6, then there might be chance that we can miss 15minutes of data from 4 to 4:15]. It will be helpful in querying also. So that we can avoid the problem i mentioned above with datamaps loaded in cache.

4. I agree, your point is valid one. I will do more abalysis on this based on the user use cases and then we can decide finally. That would be better.

Please give your inputs/suggestions on the above points.

regards,
Akash R Nilugal

On 2019/10/07 03:03:35, Ravindra Pesala <ra...@gmail.com> wrote: 
> HI Akash,
> 
> 1. I feel user providing granularity is redundant, he can just provide respective udf in select query should be enough.
> 
> 2. I think it is better to add the RP management now itself, otherwise if you start adding to DM properties as temporary then it will never be moved. Better put little more effort to decouple it from datamaps.
> 
> 3. I feel accumulator is a added cost, we already have feature in development to load datamap immediately after load happens, why not use that? If the datamap is already in memory why we need min/max at segment level?
> 
> 4. I feel there must be some reason why other timeseries db does not support union of data.  Consider a scenario that we have data from 1pm to 4.30 pm , it means 4 to 5pm data is still loading.  when user asks the data at hour level I feel it is safe to give data for 1,2,3 hours data, because providing 4pm is actually not a complete data. So atleast user comes to know that 4 pm data is not available and starts querying the low level data if he needs it.
> I think better get some real uses how user wants this time series data.
> 
> Regards,
> Ravindra.
> 
> > On 4 Oct 2019, at 9:39 PM, Akash Nilugal <ak...@gmail.com> wrote:
> > 
> > Hi Ravi,
> > 
> > 1. I forgot to mention the CTAS query in the create datamap statement, i have updated the document, during create datamap user can give granularity, during query just the UDF. That should be fine right.
> > 2. I think may be we can mention the RP policy in DM properties also, and then may be we provide add RP, drop RP, alter RP for existing and older datamaps. RP will be taken as a separate subtask and will be handled in later part. That should be fine i tink.
> > 3. Actually consider a scenario when datamap is already created, then load happened to main table, then i use accumulator to get all the min max to driver, so that i can avoid reading index file in driver in order to load to datamap. 
> >         other scenario is when main table already has segments and then datamap is created, the we will read index files from each segments to decide the min max of timestamp column.
> > 4. We are not storing min max in main table  table status. We are storing in datamap table's table status file, so that it will be used to prepare the plan during the query phase.
> > 
> > 5. Other timeseries db supports only getting the data present in hour or day .. aggregated data. Since we cannot miss the data, plan is to get the data like higher to lower. May be it does not make much difference when its from minute to second, but it makes difference from year to month , so that we cannot avoid aggregations from main table.
> > 
> > 
> > Regards,
> > Akash R Nilugal
> > 
> > On 2019/10/04 11:35:46, Ravindra Pesala <ra...@gmail.com> wrote: 
> >> Hi Akash,
> >> 
> >> I have following suggestions.
> >> 
> >> 1. I think it is redundant to use granularity inside create datamap, user can use the respective granularity UDF in his query like time(1h) or time(1d) etc.
> >> 
> >> 2. Better create separate RP commands and let user add the RP on the datamap or even on the main table also. It would be more manageable if you independent feature for RP instead of including in datamap.
> >> 
> >> 3. I am not getting why exactly we need accumulator instead of using index min/max? Can you explain with some scenario 
> >> 
> >> 4. Why to store min/max at segment level? We can get from datamap also right?
> >> 
> >> 4.  Union with high granularity tables to low granularity tables are really needed? Any other time series DB is doing it? Or any known use case we have?
> >> 
> >> Regards,
> >> Ravindra.
> >> 
> >>> On 1 Oct 2019, at 5:49 PM, Akash Nilugal <ak...@gmail.com> wrote:
> >>> 
> >>> Hi Babu,
> >>> 
> >>> Thanks for the inputs. Please find the comments 
> >>> 1. I will change from Union to UnionAll
> >>> 2. For auto datamap loading, once the data is loaded to lower level granularity datamap, then we load the higher level datamap from the lower level datamap. But as per your point, i think you are telling to load from main table itself.
> >>> 3. similar to 2nd point, whether to need configuration or not we can decide i think.
> >>> 4. a. I think the max of the datamap is required to decide the range for the load, because in case of failure case, we may need.
> >>> b. This point will be taken care.
> >>> 5. Yes, dataload is sync based on current design, as it is non lazy, it will happen with main table load only.
> >>> 6. Yes, this will be handled.
> >>> 7. Already added a task in jira.
> >>> On 2019/10/01 08:50:05, babu lal jangir <ba...@gmail.com> wrote: 
> >>>> Hi Akash, Thanks for Time Series DataMap proposal.
> >>>> Please check below Points.
> >>>> 
> >>>> 1. During Query Planing Change Union to Union All , Otherwise will loose
> >>>> row if same value appears.
> >>>> 2. Whether system start load for next granularity level table as soon it
> >>>> matches the data condition or next granularity level table has to wait till
> >>>> current  granularity level table is finished ? please handle if possible.
> >>>> 3. Add Configuration to load multiple Ranges at a time(across granularity
> >>>> tables).
> >>>> 4. Please check if Current data loading min ,max is enough to find current
> >>>> load . No need to refer the DataMap's min,max because data loading Range
> >>>> prepration can go wrong if loading happens from multiple driver . i think
> >>>> below rules are enough for loading.
> >>>>   4.a. Create MV should should sync data.   On any failure Rebuild should
> >>>> sync again till than MV will be disabled.
> >>>>   4.b.  Each load has independent Ranges and should load only those
> >>>> ranges. Any failure MV may go in disable state(only if intermediate ranges
> >>>> load is failed ,last loads failure will NOT make MV disable).
> >>>> 5. We can make Data loading sync because anyway queries can be served from
> >>>> fact table if any segments is in-progress in  Datamap.
> >>>> 6. In Data loading Pipleline ,failures in intermediate time series datamap,
> >>>> still we can continue loading next level data. (ignore if already handled).
> >>>>  For Example.
> >>>>   DataMaps:- Hour,Day,Month Level
> >>>>   Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
> >>>>     Failure in hour level during below range
> >>>>       2018-01-06 01:00:00 to 2018-01-06 01:00:00
> >>>>    This point of time Hour level has 5 day data.so start loading on day
> >>>> level .
> >>>> 7. Add SubTask to support loading of in-between missing time.(Incremental
> >>>> but old records if timeseries device stopped working for some time).
> >>>> 
> >>>> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <ak...@gmail.com>
> >>>> wrote:
> >>>> 
> >>>>> Hi vishal,
> >>>>> 
> >>>>> In the design document, in the impacted analysis section, there is a topic
> >>>>> compatibility/legacy stores, so basically For old tables when the datamap
> >>>>> is created, we load all the timeseries datamaps with different granularity.
> >>>>> I think this should do fine, please let me know for further
> >>>>> suggestions/comments.
> >>>>> 
> >>>>> Regards,
> >>>>> Akash R Nilugal
> >>>>> 
> >>>>> On 2019/09/30 17:09:44, Kumar Vishal <ku...@gmail.com> wrote:
> >>>>>> Hi Akash,
> >>>>>> 
> >>>>>> In this desing document you haven't mentioned how to handle data loading
> >>>>>> for timeseries datamap for older segments[Existing table].
> >>>>>> If the customer's main table data is also stored based on time[increasing
> >>>>>> time] in different segments,he can use this feature as well.
> >>>>>> 
> >>>>>> We can discuss and finalize the solution.
> >>>>>> 
> >>>>>> -Regards
> >>>>>> Kumar Vishal
> >>>>>> 
> >>>>>> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
> >>>>>> wrote:
> >>>>>> 
> >>>>>>> Hi Ajantha,
> >>>>>>> 
> >>>>>>> Thanks for the queries and suggestions
> >>>>>>> 
> >>>>>>> 1. Yes, this is a good suggestion, i ll include this change. Both date
> >>>>> and
> >>>>>>> timestamp columns are supported, will be updated in document.
> >>>>>>> 2. yes, you are right.
> >>>>>>> 3. you are right, if the day level is not available, then we will try
> >>>>> to
> >>>>>>> get the whole day data from hour level, if not availaible, as
> >>>>> explained in
> >>>>>>> design document, we will get the data from datamap UNION data from main
> >>>>>>> table based on user query.
> >>>>>>> 
> >>>>>>> Regards,
> >>>>>>> Akash R Nilugal
> >>>>>>> 
> >>>>>>> 
> >>>>>>> On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
> >>>>>>>> + 1 ,
> >>>>>>>> 
> >>>>>>>> I have some suggestions and questions.
> >>>>>>>> 
> >>>>>>>> 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> >>>>>>>> 'timeseries_column'.
> >>>>>>>> so that it won't give an impression that only time stamp datatype is
> >>>>>>>> supported and update the document with all the datatype supported.
> >>>>>>>> 
> >>>>>>>> 2. Querying on this datamap table is also supported right ?
> >>>>> supporting
> >>>>>>>> changing plan for main table to refer datamap table is for user to
> >>>>> avoid
> >>>>>>>> changing his query or any other reason ?
> >>>>>>>> 
> >>>>>>>> 3. If user has not created day granularity datamap, but just created
> >>>>> hour
> >>>>>>>> granularity datamap. When query has day granularity, data will be
> >>>>> fetched
> >>>>>>>> form hour granularity datamap and aggregated ? or data is fetched
> >>>>> from
> >>>>>>> main
> >>>>>>>> table ?
> >>>>>>>> 
> >>>>>>>> Thanks,
> >>>>>>>> Ajantha
> >>>>>>>> 
> >>>>>>>> On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <
> >>>>> akashnilugal@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>>> Hi xuchuanyin,
> >>>>>>>>> 
> >>>>>>>>> Thanks for the comments/Suggestions
> >>>>>>>>> 
> >>>>>>>>> 1. Preaggregate is productized, but not the timeseries with
> >>>>>>> preaggregate,
> >>>>>>>>> i think you  got confused with that, if im right.
> >>>>>>>>> 2. Limitations like, auto sampling or rollup, which we will be
> >>>>>>> supporting
> >>>>>>>>> now. Retention policies. etc
> >>>>>>>>> 3. segmentTimestampMin, this i will consider in design.
> >>>>>>>>> 4. RP is added as a separate task, i thought instead of
> >>>>> maintaining two
> >>>>>>>>> variables better to maintabin one and parse it. But i will consider
> >>>>>>> your
> >>>>>>>>> point based on feasibility during implementation.
> >>>>>>>>> 5. We use an accumulator which takes list, so before writing index
> >>>>>>> files
> >>>>>>>>> we take the min max of the timestamp column and fill in
> >>>>> accumulator and
> >>>>>>>>> then we can access accumulator.value in driver after load is
> >>>>> finished.
> >>>>>>>>> 
> >>>>>>>>> Regards,
> >>>>>>>>> Akash R Nilugal
> >>>>>>>>> 
> >>>>>>>>> On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
> >>>>>>>>>> Hi akash, glad to see the feature proposed and I have some
> >>>>> questions
> >>>>>>>>> about
> >>>>>>>>>> this. Please notice that some of the following descriptions are
> >>>>>>> comments
> >>>>>>>>>> followed by '===' described in the design document attached in
> >>>>> the
> >>>>>>>>>> corresponding jira.
> >>>>>>>>>> 
> >>>>>>>>>> 1.
> >>>>>>>>>> "Currently carbondata supports timeseries on preaggregate
> >>>>> datamap,
> >>>>>>> but
> >>>>>>>>> its
> >>>>>>>>>> an alpha feature"
> >>>>>>>>>> ===
> >>>>>>>>>> It has been some time since the preaggregate datamap was
> >>>>> introduced
> >>>>>>> and
> >>>>>>>>> it
> >>>>>>>>>> is still **alpha**, why it is still not product-ready? Will the
> >>>>> new
> >>>>>>>>> feature
> >>>>>>>>>> also come into the similar situation?
> >>>>>>>>>> 
> >>>>>>>>>> 2.
> >>>>>>>>>> "there are so many limitations when we compare and analyze the
> >>>>>>> existing
> >>>>>>>>>> timeseries database or projects which supports time series like
> >>>>>>> apache
> >>>>>>>>> druid
> >>>>>>>>>> or influxdb"
> >>>>>>>>>> ===
> >>>>>>>>>> What are the actual limitations? Besides, please give an example
> >>>>> of
> >>>>>>> this.
> >>>>>>>>>> 
> >>>>>>>>>> 3.
> >>>>>>>>>> "Segment_Timestamp_Min"
> >>>>>>>>>> ===
> >>>>>>>>>> Suggest using camel-case style like 'segmentTimestampMin'
> >>>>>>>>>> 
> >>>>>>>>>> 4.
> >>>>>>>>>> "RP is way of telling the system, for how long the data should be
> >>>>>>> kept"
> >>>>>>>>>> ===
> >>>>>>>>>> Since the function is simple, I'd suggest using
> >>>>> 'retentionTime'=15
> >>>>>>> and
> >>>>>>>>>> 'timeUnit'='day' instead of 'RP'='15_days'
> >>>>>>>>>> 
> >>>>>>>>>> 5.
> >>>>>>>>>> "When the data load is called for main table, use an spark
> >>>>>>> accumulator to
> >>>>>>>>>> get the maximum value of timestamp in that load and return to the
> >>>>>>> load."
> >>>>>>>>>> ===
> >>>>>>>>>> How can you get the spark accumulator? The load is launched using
> >>>>>>>>>> loading-by-dataframe not using global-sort-by-spark.
> >>>>>>>>>> 
> >>>>>>>>>> 6.
> >>>>>>>>>> For the rest of the content, still reading.
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> --
> >>>>>>>>>> Sent from:
> >>>>>>>>> 
> >>>>>>> 
> >>>>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>>> 
> >>>> 
> >> 
> >> 
> 
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Ravindra Pesala <ra...@gmail.com>.

HI Akash,

1. I feel user providing granularity is redundant, he can just provide respective udf in select query should be enough.

2. I think it is better to add the RP management now itself, otherwise if you start adding to DM properties as temporary then it will never be moved. Better put little more effort to decouple it from datamaps.

3. I feel accumulator is a added cost, we already have feature in development to load datamap immediately after load happens, why not use that? If the datamap is already in memory why we need min/max at segment level?

4. I feel there must be some reason why other timeseries db does not support union of data.  Consider a scenario that we have data from 1pm to 4.30 pm , it means 4 to 5pm data is still loading.  when user asks the data at hour level I feel it is safe to give data for 1,2,3 hours data, because providing 4pm is actually not a complete data. So atleast user comes to know that 4 pm data is not available and starts querying the low level data if he needs it.
I think better get some real uses how user wants this time series data.

Regards,
Ravindra.

> On 4 Oct 2019, at 9:39 PM, Akash Nilugal <ak...@gmail.com> wrote:
> 
> Hi Ravi,
> 
> 1. I forgot to mention the CTAS query in the create datamap statement, i have updated the document, during create datamap user can give granularity, during query just the UDF. That should be fine right.
> 2. I think may be we can mention the RP policy in DM properties also, and then may be we provide add RP, drop RP, alter RP for existing and older datamaps. RP will be taken as a separate subtask and will be handled in later part. That should be fine i tink.
> 3. Actually consider a scenario when datamap is already created, then load happened to main table, then i use accumulator to get all the min max to driver, so that i can avoid reading index file in driver in order to load to datamap. 
>         other scenario is when main table already has segments and then datamap is created, the we will read index files from each segments to decide the min max of timestamp column.
> 4. We are not storing min max in main table  table status. We are storing in datamap table's table status file, so that it will be used to prepare the plan during the query phase.
> 
> 5. Other timeseries db supports only getting the data present in hour or day .. aggregated data. Since we cannot miss the data, plan is to get the data like higher to lower. May be it does not make much difference when its from minute to second, but it makes difference from year to month , so that we cannot avoid aggregations from main table.
> 
> 
> Regards,
> Akash R Nilugal
> 
> On 2019/10/04 11:35:46, Ravindra Pesala <ra...@gmail.com> wrote: 
>> Hi Akash,
>> 
>> I have following suggestions.
>> 
>> 1. I think it is redundant to use granularity inside create datamap, user can use the respective granularity UDF in his query like time(1h) or time(1d) etc.
>> 
>> 2. Better create separate RP commands and let user add the RP on the datamap or even on the main table also. It would be more manageable if you independent feature for RP instead of including in datamap.
>> 
>> 3. I am not getting why exactly we need accumulator instead of using index min/max? Can you explain with some scenario 
>> 
>> 4. Why to store min/max at segment level? We can get from datamap also right?
>> 
>> 4.  Union with high granularity tables to low granularity tables are really needed? Any other time series DB is doing it? Or any known use case we have?
>> 
>> Regards,
>> Ravindra.
>> 
>>> On 1 Oct 2019, at 5:49 PM, Akash Nilugal <ak...@gmail.com> wrote:
>>> 
>>> Hi Babu,
>>> 
>>> Thanks for the inputs. Please find the comments 
>>> 1. I will change from Union to UnionAll
>>> 2. For auto datamap loading, once the data is loaded to lower level granularity datamap, then we load the higher level datamap from the lower level datamap. But as per your point, i think you are telling to load from main table itself.
>>> 3. similar to 2nd point, whether to need configuration or not we can decide i think.
>>> 4. a. I think the max of the datamap is required to decide the range for the load, because in case of failure case, we may need.
>>> b. This point will be taken care.
>>> 5. Yes, dataload is sync based on current design, as it is non lazy, it will happen with main table load only.
>>> 6. Yes, this will be handled.
>>> 7. Already added a task in jira.
>>> On 2019/10/01 08:50:05, babu lal jangir <ba...@gmail.com> wrote: 
>>>> Hi Akash, Thanks for Time Series DataMap proposal.
>>>> Please check below Points.
>>>> 
>>>> 1. During Query Planing Change Union to Union All , Otherwise will loose
>>>> row if same value appears.
>>>> 2. Whether system start load for next granularity level table as soon it
>>>> matches the data condition or next granularity level table has to wait till
>>>> current  granularity level table is finished ? please handle if possible.
>>>> 3. Add Configuration to load multiple Ranges at a time(across granularity
>>>> tables).
>>>> 4. Please check if Current data loading min ,max is enough to find current
>>>> load . No need to refer the DataMap's min,max because data loading Range
>>>> prepration can go wrong if loading happens from multiple driver . i think
>>>> below rules are enough for loading.
>>>>   4.a. Create MV should should sync data.   On any failure Rebuild should
>>>> sync again till than MV will be disabled.
>>>>   4.b.  Each load has independent Ranges and should load only those
>>>> ranges. Any failure MV may go in disable state(only if intermediate ranges
>>>> load is failed ,last loads failure will NOT make MV disable).
>>>> 5. We can make Data loading sync because anyway queries can be served from
>>>> fact table if any segments is in-progress in  Datamap.
>>>> 6. In Data loading Pipleline ,failures in intermediate time series datamap,
>>>> still we can continue loading next level data. (ignore if already handled).
>>>>  For Example.
>>>>   DataMaps:- Hour,Day,Month Level
>>>>   Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
>>>>     Failure in hour level during below range
>>>>       2018-01-06 01:00:00 to 2018-01-06 01:00:00
>>>>    This point of time Hour level has 5 day data.so start loading on day
>>>> level .
>>>> 7. Add SubTask to support loading of in-between missing time.(Incremental
>>>> but old records if timeseries device stopped working for some time).
>>>> 
>>>> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <ak...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi vishal,
>>>>> 
>>>>> In the design document, in the impacted analysis section, there is a topic
>>>>> compatibility/legacy stores, so basically For old tables when the datamap
>>>>> is created, we load all the timeseries datamaps with different granularity.
>>>>> I think this should do fine, please let me know for further
>>>>> suggestions/comments.
>>>>> 
>>>>> Regards,
>>>>> Akash R Nilugal
>>>>> 
>>>>> On 2019/09/30 17:09:44, Kumar Vishal <ku...@gmail.com> wrote:
>>>>>> Hi Akash,
>>>>>> 
>>>>>> In this desing document you haven't mentioned how to handle data loading
>>>>>> for timeseries datamap for older segments[Existing table].
>>>>>> If the customer's main table data is also stored based on time[increasing
>>>>>> time] in different segments,he can use this feature as well.
>>>>>> 
>>>>>> We can discuss and finalize the solution.
>>>>>> 
>>>>>> -Regards
>>>>>> Kumar Vishal
>>>>>> 
>>>>>> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Ajantha,
>>>>>>> 
>>>>>>> Thanks for the queries and suggestions
>>>>>>> 
>>>>>>> 1. Yes, this is a good suggestion, i ll include this change. Both date
>>>>> and
>>>>>>> timestamp columns are supported, will be updated in document.
>>>>>>> 2. yes, you are right.
>>>>>>> 3. you are right, if the day level is not available, then we will try
>>>>> to
>>>>>>> get the whole day data from hour level, if not availaible, as
>>>>> explained in
>>>>>>> design document, we will get the data from datamap UNION data from main
>>>>>>> table based on user query.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Akash R Nilugal
>>>>>>> 
>>>>>>> 
>>>>>>> On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
>>>>>>>> + 1 ,
>>>>>>>> 
>>>>>>>> I have some suggestions and questions.
>>>>>>>> 
>>>>>>>> 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
>>>>>>>> 'timeseries_column'.
>>>>>>>> so that it won't give an impression that only time stamp datatype is
>>>>>>>> supported and update the document with all the datatype supported.
>>>>>>>> 
>>>>>>>> 2. Querying on this datamap table is also supported right ?
>>>>> supporting
>>>>>>>> changing plan for main table to refer datamap table is for user to
>>>>> avoid
>>>>>>>> changing his query or any other reason ?
>>>>>>>> 
>>>>>>>> 3. If user has not created day granularity datamap, but just created
>>>>> hour
>>>>>>>> granularity datamap. When query has day granularity, data will be
>>>>> fetched
>>>>>>>> form hour granularity datamap and aggregated ? or data is fetched
>>>>> from
>>>>>>> main
>>>>>>>> table ?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Ajantha
>>>>>>>> 
>>>>>>>> On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <
>>>>> akashnilugal@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi xuchuanyin,
>>>>>>>>> 
>>>>>>>>> Thanks for the comments/Suggestions
>>>>>>>>> 
>>>>>>>>> 1. Preaggregate is productized, but not the timeseries with
>>>>>>> preaggregate,
>>>>>>>>> i think you  got confused with that, if im right.
>>>>>>>>> 2. Limitations like, auto sampling or rollup, which we will be
>>>>>>> supporting
>>>>>>>>> now. Retention policies. etc
>>>>>>>>> 3. segmentTimestampMin, this i will consider in design.
>>>>>>>>> 4. RP is added as a separate task, i thought instead of
>>>>> maintaining two
>>>>>>>>> variables better to maintabin one and parse it. But i will consider
>>>>>>> your
>>>>>>>>> point based on feasibility during implementation.
>>>>>>>>> 5. We use an accumulator which takes list, so before writing index
>>>>>>> files
>>>>>>>>> we take the min max of the timestamp column and fill in
>>>>> accumulator and
>>>>>>>>> then we can access accumulator.value in driver after load is
>>>>> finished.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Akash R Nilugal
>>>>>>>>> 
>>>>>>>>> On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
>>>>>>>>>> Hi akash, glad to see the feature proposed and I have some
>>>>> questions
>>>>>>>>> about
>>>>>>>>>> this. Please notice that some of the following descriptions are
>>>>>>> comments
>>>>>>>>>> followed by '===' described in the design document attached in
>>>>> the
>>>>>>>>>> corresponding jira.
>>>>>>>>>> 
>>>>>>>>>> 1.
>>>>>>>>>> "Currently carbondata supports timeseries on preaggregate
>>>>> datamap,
>>>>>>> but
>>>>>>>>> its
>>>>>>>>>> an alpha feature"
>>>>>>>>>> ===
>>>>>>>>>> It has been some time since the preaggregate datamap was
>>>>> introduced
>>>>>>> and
>>>>>>>>> it
>>>>>>>>>> is still **alpha**, why it is still not product-ready? Will the
>>>>> new
>>>>>>>>> feature
>>>>>>>>>> also come into the similar situation?
>>>>>>>>>> 
>>>>>>>>>> 2.
>>>>>>>>>> "there are so many limitations when we compare and analyze the
>>>>>>> existing
>>>>>>>>>> timeseries database or projects which supports time series like
>>>>>>> apache
>>>>>>>>> druid
>>>>>>>>>> or influxdb"
>>>>>>>>>> ===
>>>>>>>>>> What are the actual limitations? Besides, please give an example
>>>>> of
>>>>>>> this.
>>>>>>>>>> 
>>>>>>>>>> 3.
>>>>>>>>>> "Segment_Timestamp_Min"
>>>>>>>>>> ===
>>>>>>>>>> Suggest using camel-case style like 'segmentTimestampMin'
>>>>>>>>>> 
>>>>>>>>>> 4.
>>>>>>>>>> "RP is way of telling the system, for how long the data should be
>>>>>>> kept"
>>>>>>>>>> ===
>>>>>>>>>> Since the function is simple, I'd suggest using
>>>>> 'retentionTime'=15
>>>>>>> and
>>>>>>>>>> 'timeUnit'='day' instead of 'RP'='15_days'
>>>>>>>>>> 
>>>>>>>>>> 5.
>>>>>>>>>> "When the data load is called for main table, use an spark
>>>>>>> accumulator to
>>>>>>>>>> get the maximum value of timestamp in that load and return to the
>>>>>>> load."
>>>>>>>>>> ===
>>>>>>>>>> How can you get the spark accumulator? The load is launched using
>>>>>>>>>> loading-by-dataframe not using global-sort-by-spark.
>>>>>>>>>> 
>>>>>>>>>> 6.
>>>>>>>>>> For the rest of the content, still reading.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Sent from:
>>>>>>>>> 
>>>>>>> 
>>>>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Akash Nilugal <ak...@gmail.com>.

Hi Ravi,

1. I forgot to mention the CTAS query in the create datamap statement, i have updated the document, during create datamap user can give granularity, during query just the UDF. That should be fine right.
2. I think may be we can mention the RP policy in DM properties also, and then may be we provide add RP, drop RP, alter RP for existing and older datamaps. RP will be taken as a separate subtask and will be handled in later part. That should be fine i tink.
3. Actually consider a scenario when datamap is already created, then load happened to main table, then i use accumulator to get all the min max to driver, so that i can avoid reading index file in driver in order to load to datamap. 
         other scenario is when main table already has segments and then datamap is created, the we will read index files from each segments to decide the min max of timestamp column.
4. We are not storing min max in main table  table status. We are storing in datamap table's table status file, so that it will be used to prepare the plan during the query phase.

5. Other timeseries db supports only getting the data present in hour or day .. aggregated data. Since we cannot miss the data, plan is to get the data like higher to lower. May be it does not make much difference when its from minute to second, but it makes difference from year to month , so that we cannot avoid aggregations from main table.


Regards,
Akash R Nilugal

On 2019/10/04 11:35:46, Ravindra Pesala <ra...@gmail.com> wrote: 
> Hi Akash,
> 
> I have following suggestions.
> 
> 1. I think it is redundant to use granularity inside create datamap, user can use the respective granularity UDF in his query like time(1h) or time(1d) etc.
> 
> 2. Better create separate RP commands and let user add the RP on the datamap or even on the main table also. It would be more manageable if you independent feature for RP instead of including in datamap.
> 
> 3. I am not getting why exactly we need accumulator instead of using index min/max? Can you explain with some scenario 
> 
> 4. Why to store min/max at segment level? We can get from datamap also right?
> 
> 4.  Union with high granularity tables to low granularity tables are really needed? Any other time series DB is doing it? Or any known use case we have?
> 
> Regards,
> Ravindra.
> 
> > On 1 Oct 2019, at 5:49 PM, Akash Nilugal <ak...@gmail.com> wrote:
> > 
> > Hi Babu,
> > 
> > Thanks for the inputs. Please find the comments 
> > 1. I will change from Union to UnionAll
> > 2. For auto datamap loading, once the data is loaded to lower level granularity datamap, then we load the higher level datamap from the lower level datamap. But as per your point, i think you are telling to load from main table itself.
> > 3. similar to 2nd point, whether to need configuration or not we can decide i think.
> > 4. a. I think the max of the datamap is required to decide the range for the load, because in case of failure case, we may need.
> > b. This point will be taken care.
> > 5. Yes, dataload is sync based on current design, as it is non lazy, it will happen with main table load only.
> > 6. Yes, this will be handled.
> > 7. Already added a task in jira.
> > On 2019/10/01 08:50:05, babu lal jangir <ba...@gmail.com> wrote: 
> >> Hi Akash, Thanks for Time Series DataMap proposal.
> >> Please check below Points.
> >> 
> >> 1. During Query Planing Change Union to Union All , Otherwise will loose
> >> row if same value appears.
> >> 2. Whether system start load for next granularity level table as soon it
> >> matches the data condition or next granularity level table has to wait till
> >> current  granularity level table is finished ? please handle if possible.
> >> 3. Add Configuration to load multiple Ranges at a time(across granularity
> >> tables).
> >> 4. Please check if Current data loading min ,max is enough to find current
> >> load . No need to refer the DataMap's min,max because data loading Range
> >> prepration can go wrong if loading happens from multiple driver . i think
> >> below rules are enough for loading.
> >>    4.a. Create MV should should sync data.   On any failure Rebuild should
> >> sync again till than MV will be disabled.
> >>    4.b.  Each load has independent Ranges and should load only those
> >> ranges. Any failure MV may go in disable state(only if intermediate ranges
> >> load is failed ,last loads failure will NOT make MV disable).
> >> 5. We can make Data loading sync because anyway queries can be served from
> >> fact table if any segments is in-progress in  Datamap.
> >> 6. In Data loading Pipleline ,failures in intermediate time series datamap,
> >> still we can continue loading next level data. (ignore if already handled).
> >>   For Example.
> >>    DataMaps:- Hour,Day,Month Level
> >>    Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
> >>      Failure in hour level during below range
> >>        2018-01-06 01:00:00 to 2018-01-06 01:00:00
> >>     This point of time Hour level has 5 day data.so start loading on day
> >> level .
> >> 7. Add SubTask to support loading of in-between missing time.(Incremental
> >> but old records if timeseries device stopped working for some time).
> >> 
> >> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <ak...@gmail.com>
> >> wrote:
> >> 
> >>> Hi vishal,
> >>> 
> >>> In the design document, in the impacted analysis section, there is a topic
> >>> compatibility/legacy stores, so basically For old tables when the datamap
> >>> is created, we load all the timeseries datamaps with different granularity.
> >>> I think this should do fine, please let me know for further
> >>> suggestions/comments.
> >>> 
> >>> Regards,
> >>> Akash R Nilugal
> >>> 
> >>> On 2019/09/30 17:09:44, Kumar Vishal <ku...@gmail.com> wrote:
> >>>> Hi Akash,
> >>>> 
> >>>> In this desing document you haven't mentioned how to handle data loading
> >>>> for timeseries datamap for older segments[Existing table].
> >>>> If the customer's main table data is also stored based on time[increasing
> >>>> time] in different segments,he can use this feature as well.
> >>>> 
> >>>> We can discuss and finalize the solution.
> >>>> 
> >>>> -Regards
> >>>> Kumar Vishal
> >>>> 
> >>>> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
> >>>> wrote:
> >>>> 
> >>>>> Hi Ajantha,
> >>>>> 
> >>>>> Thanks for the queries and suggestions
> >>>>> 
> >>>>> 1. Yes, this is a good suggestion, i ll include this change. Both date
> >>> and
> >>>>> timestamp columns are supported, will be updated in document.
> >>>>> 2. yes, you are right.
> >>>>> 3. you are right, if the day level is not available, then we will try
> >>> to
> >>>>> get the whole day data from hour level, if not availaible, as
> >>> explained in
> >>>>> design document, we will get the data from datamap UNION data from main
> >>>>> table based on user query.
> >>>>> 
> >>>>> Regards,
> >>>>> Akash R Nilugal
> >>>>> 
> >>>>> 
> >>>>> On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
> >>>>>> + 1 ,
> >>>>>> 
> >>>>>> I have some suggestions and questions.
> >>>>>> 
> >>>>>> 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> >>>>>> 'timeseries_column'.
> >>>>>> so that it won't give an impression that only time stamp datatype is
> >>>>>> supported and update the document with all the datatype supported.
> >>>>>> 
> >>>>>> 2. Querying on this datamap table is also supported right ?
> >>> supporting
> >>>>>> changing plan for main table to refer datamap table is for user to
> >>> avoid
> >>>>>> changing his query or any other reason ?
> >>>>>> 
> >>>>>> 3. If user has not created day granularity datamap, but just created
> >>> hour
> >>>>>> granularity datamap. When query has day granularity, data will be
> >>> fetched
> >>>>>> form hour granularity datamap and aggregated ? or data is fetched
> >>> from
> >>>>> main
> >>>>>> table ?
> >>>>>> 
> >>>>>> Thanks,
> >>>>>> Ajantha
> >>>>>> 
> >>>>>> On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <
> >>> akashnilugal@gmail.com>
> >>>>>> wrote:
> >>>>>> 
> >>>>>>> Hi xuchuanyin,
> >>>>>>> 
> >>>>>>> Thanks for the comments/Suggestions
> >>>>>>> 
> >>>>>>> 1. Preaggregate is productized, but not the timeseries with
> >>>>> preaggregate,
> >>>>>>> i think you  got confused with that, if im right.
> >>>>>>> 2. Limitations like, auto sampling or rollup, which we will be
> >>>>> supporting
> >>>>>>> now. Retention policies. etc
> >>>>>>> 3. segmentTimestampMin, this i will consider in design.
> >>>>>>> 4. RP is added as a separate task, i thought instead of
> >>> maintaining two
> >>>>>>> variables better to maintabin one and parse it. But i will consider
> >>>>> your
> >>>>>>> point based on feasibility during implementation.
> >>>>>>> 5. We use an accumulator which takes list, so before writing index
> >>>>> files
> >>>>>>> we take the min max of the timestamp column and fill in
> >>> accumulator and
> >>>>>>> then we can access accumulator.value in driver after load is
> >>> finished.
> >>>>>>> 
> >>>>>>> Regards,
> >>>>>>> Akash R Nilugal
> >>>>>>> 
> >>>>>>> On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
> >>>>>>>> Hi akash, glad to see the feature proposed and I have some
> >>> questions
> >>>>>>> about
> >>>>>>>> this. Please notice that some of the following descriptions are
> >>>>> comments
> >>>>>>>> followed by '===' described in the design document attached in
> >>> the
> >>>>>>>> corresponding jira.
> >>>>>>>> 
> >>>>>>>> 1.
> >>>>>>>> "Currently carbondata supports timeseries on preaggregate
> >>> datamap,
> >>>>> but
> >>>>>>> its
> >>>>>>>> an alpha feature"
> >>>>>>>> ===
> >>>>>>>> It has been some time since the preaggregate datamap was
> >>> introduced
> >>>>> and
> >>>>>>> it
> >>>>>>>> is still **alpha**, why it is still not product-ready? Will the
> >>> new
> >>>>>>> feature
> >>>>>>>> also come into the similar situation?
> >>>>>>>> 
> >>>>>>>> 2.
> >>>>>>>> "there are so many limitations when we compare and analyze the
> >>>>> existing
> >>>>>>>> timeseries database or projects which supports time series like
> >>>>> apache
> >>>>>>> druid
> >>>>>>>> or influxdb"
> >>>>>>>> ===
> >>>>>>>> What are the actual limitations? Besides, please give an example
> >>> of
> >>>>> this.
> >>>>>>>> 
> >>>>>>>> 3.
> >>>>>>>> "Segment_Timestamp_Min"
> >>>>>>>> ===
> >>>>>>>> Suggest using camel-case style like 'segmentTimestampMin'
> >>>>>>>> 
> >>>>>>>> 4.
> >>>>>>>> "RP is way of telling the system, for how long the data should be
> >>>>> kept"
> >>>>>>>> ===
> >>>>>>>> Since the function is simple, I'd suggest using
> >>> 'retentionTime'=15
> >>>>> and
> >>>>>>>> 'timeUnit'='day' instead of 'RP'='15_days'
> >>>>>>>> 
> >>>>>>>> 5.
> >>>>>>>> "When the data load is called for main table, use an spark
> >>>>> accumulator to
> >>>>>>>> get the maximum value of timestamp in that load and return to the
> >>>>> load."
> >>>>>>>> ===
> >>>>>>>> How can you get the spark accumulator? The load is launched using
> >>>>>>>> loading-by-dataframe not using global-sort-by-spark.
> >>>>>>>> 
> >>>>>>>> 6.
> >>>>>>>> For the rest of the content, still reading.
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> --
> >>>>>>>> Sent from:
> >>>>>>> 
> >>>>> 
> >>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >>>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>>> 
> >>>> 
> >>> 
> >> 
> 
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Ravindra Pesala <ra...@gmail.com>.

Hi Akash,

I have following suggestions.

1. I think it is redundant to use granularity inside create datamap, user can use the respective granularity UDF in his query like time(1h) or time(1d) etc.

2. Better create separate RP commands and let user add the RP on the datamap or even on the main table also. It would be more manageable if you independent feature for RP instead of including in datamap.

3. I am not getting why exactly we need accumulator instead of using index min/max? Can you explain with some scenario 

4. Why to store min/max at segment level? We can get from datamap also right?

4.  Union with high granularity tables to low granularity tables are really needed? Any other time series DB is doing it? Or any known use case we have?

Regards,
Ravindra.

> On 1 Oct 2019, at 5:49 PM, Akash Nilugal <ak...@gmail.com> wrote:
> 
> Hi Babu,
> 
> Thanks for the inputs. Please find the comments 
> 1. I will change from Union to UnionAll
> 2. For auto datamap loading, once the data is loaded to lower level granularity datamap, then we load the higher level datamap from the lower level datamap. But as per your point, i think you are telling to load from main table itself.
> 3. similar to 2nd point, whether to need configuration or not we can decide i think.
> 4. a. I think the max of the datamap is required to decide the range for the load, because in case of failure case, we may need.
> b. This point will be taken care.
> 5. Yes, dataload is sync based on current design, as it is non lazy, it will happen with main table load only.
> 6. Yes, this will be handled.
> 7. Already added a task in jira.
> On 2019/10/01 08:50:05, babu lal jangir <ba...@gmail.com> wrote: 
>> Hi Akash, Thanks for Time Series DataMap proposal.
>> Please check below Points.
>> 
>> 1. During Query Planing Change Union to Union All , Otherwise will loose
>> row if same value appears.
>> 2. Whether system start load for next granularity level table as soon it
>> matches the data condition or next granularity level table has to wait till
>> current  granularity level table is finished ? please handle if possible.
>> 3. Add Configuration to load multiple Ranges at a time(across granularity
>> tables).
>> 4. Please check if Current data loading min ,max is enough to find current
>> load . No need to refer the DataMap's min,max because data loading Range
>> prepration can go wrong if loading happens from multiple driver . i think
>> below rules are enough for loading.
>>    4.a. Create MV should should sync data.   On any failure Rebuild should
>> sync again till than MV will be disabled.
>>    4.b.  Each load has independent Ranges and should load only those
>> ranges. Any failure MV may go in disable state(only if intermediate ranges
>> load is failed ,last loads failure will NOT make MV disable).
>> 5. We can make Data loading sync because anyway queries can be served from
>> fact table if any segments is in-progress in  Datamap.
>> 6. In Data loading Pipleline ,failures in intermediate time series datamap,
>> still we can continue loading next level data. (ignore if already handled).
>>   For Example.
>>    DataMaps:- Hour,Day,Month Level
>>    Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
>>      Failure in hour level during below range
>>        2018-01-06 01:00:00 to 2018-01-06 01:00:00
>>     This point of time Hour level has 5 day data.so start loading on day
>> level .
>> 7. Add SubTask to support loading of in-between missing time.(Incremental
>> but old records if timeseries device stopped working for some time).
>> 
>> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <ak...@gmail.com>
>> wrote:
>> 
>>> Hi vishal,
>>> 
>>> In the design document, in the impacted analysis section, there is a topic
>>> compatibility/legacy stores, so basically For old tables when the datamap
>>> is created, we load all the timeseries datamaps with different granularity.
>>> I think this should do fine, please let me know for further
>>> suggestions/comments.
>>> 
>>> Regards,
>>> Akash R Nilugal
>>> 
>>> On 2019/09/30 17:09:44, Kumar Vishal <ku...@gmail.com> wrote:
>>>> Hi Akash,
>>>> 
>>>> In this desing document you haven't mentioned how to handle data loading
>>>> for timeseries datamap for older segments[Existing table].
>>>> If the customer's main table data is also stored based on time[increasing
>>>> time] in different segments,he can use this feature as well.
>>>> 
>>>> We can discuss and finalize the solution.
>>>> 
>>>> -Regards
>>>> Kumar Vishal
>>>> 
>>>> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi Ajantha,
>>>>> 
>>>>> Thanks for the queries and suggestions
>>>>> 
>>>>> 1. Yes, this is a good suggestion, i ll include this change. Both date
>>> and
>>>>> timestamp columns are supported, will be updated in document.
>>>>> 2. yes, you are right.
>>>>> 3. you are right, if the day level is not available, then we will try
>>> to
>>>>> get the whole day data from hour level, if not availaible, as
>>> explained in
>>>>> design document, we will get the data from datamap UNION data from main
>>>>> table based on user query.
>>>>> 
>>>>> Regards,
>>>>> Akash R Nilugal
>>>>> 
>>>>> 
>>>>> On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
>>>>>> + 1 ,
>>>>>> 
>>>>>> I have some suggestions and questions.
>>>>>> 
>>>>>> 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
>>>>>> 'timeseries_column'.
>>>>>> so that it won't give an impression that only time stamp datatype is
>>>>>> supported and update the document with all the datatype supported.
>>>>>> 
>>>>>> 2. Querying on this datamap table is also supported right ?
>>> supporting
>>>>>> changing plan for main table to refer datamap table is for user to
>>> avoid
>>>>>> changing his query or any other reason ?
>>>>>> 
>>>>>> 3. If user has not created day granularity datamap, but just created
>>> hour
>>>>>> granularity datamap. When query has day granularity, data will be
>>> fetched
>>>>>> form hour granularity datamap and aggregated ? or data is fetched
>>> from
>>>>> main
>>>>>> table ?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ajantha
>>>>>> 
>>>>>> On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <
>>> akashnilugal@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi xuchuanyin,
>>>>>>> 
>>>>>>> Thanks for the comments/Suggestions
>>>>>>> 
>>>>>>> 1. Preaggregate is productized, but not the timeseries with
>>>>> preaggregate,
>>>>>>> i think you  got confused with that, if im right.
>>>>>>> 2. Limitations like, auto sampling or rollup, which we will be
>>>>> supporting
>>>>>>> now. Retention policies. etc
>>>>>>> 3. segmentTimestampMin, this i will consider in design.
>>>>>>> 4. RP is added as a separate task, i thought instead of
>>> maintaining two
>>>>>>> variables better to maintabin one and parse it. But i will consider
>>>>> your
>>>>>>> point based on feasibility during implementation.
>>>>>>> 5. We use an accumulator which takes list, so before writing index
>>>>> files
>>>>>>> we take the min max of the timestamp column and fill in
>>> accumulator and
>>>>>>> then we can access accumulator.value in driver after load is
>>> finished.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Akash R Nilugal
>>>>>>> 
>>>>>>> On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
>>>>>>>> Hi akash, glad to see the feature proposed and I have some
>>> questions
>>>>>>> about
>>>>>>>> this. Please notice that some of the following descriptions are
>>>>> comments
>>>>>>>> followed by '===' described in the design document attached in
>>> the
>>>>>>>> corresponding jira.
>>>>>>>> 
>>>>>>>> 1.
>>>>>>>> "Currently carbondata supports timeseries on preaggregate
>>> datamap,
>>>>> but
>>>>>>> its
>>>>>>>> an alpha feature"
>>>>>>>> ===
>>>>>>>> It has been some time since the preaggregate datamap was
>>> introduced
>>>>> and
>>>>>>> it
>>>>>>>> is still **alpha**, why it is still not product-ready? Will the
>>> new
>>>>>>> feature
>>>>>>>> also come into the similar situation?
>>>>>>>> 
>>>>>>>> 2.
>>>>>>>> "there are so many limitations when we compare and analyze the
>>>>> existing
>>>>>>>> timeseries database or projects which supports time series like
>>>>> apache
>>>>>>> druid
>>>>>>>> or influxdb"
>>>>>>>> ===
>>>>>>>> What are the actual limitations? Besides, please give an example
>>> of
>>>>> this.
>>>>>>>> 
>>>>>>>> 3.
>>>>>>>> "Segment_Timestamp_Min"
>>>>>>>> ===
>>>>>>>> Suggest using camel-case style like 'segmentTimestampMin'
>>>>>>>> 
>>>>>>>> 4.
>>>>>>>> "RP is way of telling the system, for how long the data should be
>>>>> kept"
>>>>>>>> ===
>>>>>>>> Since the function is simple, I'd suggest using
>>> 'retentionTime'=15
>>>>> and
>>>>>>>> 'timeUnit'='day' instead of 'RP'='15_days'
>>>>>>>> 
>>>>>>>> 5.
>>>>>>>> "When the data load is called for main table, use an spark
>>>>> accumulator to
>>>>>>>> get the maximum value of timestamp in that load and return to the
>>>>> load."
>>>>>>>> ===
>>>>>>>> How can you get the spark accumulator? The load is launched using
>>>>>>>> loading-by-dataframe not using global-sort-by-spark.
>>>>>>>> 
>>>>>>>> 6.
>>>>>>>> For the rest of the content, still reading.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Sent from:
>>>>>>> 
>>>>> 
>>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Akash Nilugal <ak...@gmail.com>.

Hi Babu,

Thanks for the inputs. Please find the comments 
1. I will change from Union to UnionAll
2. For auto datamap loading, once the data is loaded to lower level granularity datamap, then we load the higher level datamap from the lower level datamap. But as per your point, i think you are telling to load from main table itself.
3. similar to 2nd point, whether to need configuration or not we can decide i think.
4. a. I think the max of the datamap is required to decide the range for the load, because in case of failure case, we may need.
b. This point will be taken care.
5. Yes, dataload is sync based on current design, as it is non lazy, it will happen with main table load only.
6. Yes, this will be handled.
7. Already added a task in jira.
On 2019/10/01 08:50:05, babu lal jangir <ba...@gmail.com> wrote: 
> Hi Akash, Thanks for Time Series DataMap proposal.
> Please check below Points.
> 
> 1. During Query Planing Change Union to Union All , Otherwise will loose
> row if same value appears.
> 2. Whether system start load for next granularity level table as soon it
> matches the data condition or next granularity level table has to wait till
> current  granularity level table is finished ? please handle if possible.
> 3. Add Configuration to load multiple Ranges at a time(across granularity
> tables).
> 4. Please check if Current data loading min ,max is enough to find current
> load . No need to refer the DataMap's min,max because data loading Range
> prepration can go wrong if loading happens from multiple driver . i think
> below rules are enough for loading.
>     4.a. Create MV should should sync data.   On any failure Rebuild should
> sync again till than MV will be disabled.
>     4.b.  Each load has independent Ranges and should load only those
> ranges. Any failure MV may go in disable state(only if intermediate ranges
> load is failed ,last loads failure will NOT make MV disable).
> 5. We can make Data loading sync because anyway queries can be served from
> fact table if any segments is in-progress in  Datamap.
> 6. In Data loading Pipleline ,failures in intermediate time series datamap,
> still we can continue loading next level data. (ignore if already handled).
>    For Example.
>     DataMaps:- Hour,Day,Month Level
>     Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
>       Failure in hour level during below range
>         2018-01-06 01:00:00 to 2018-01-06 01:00:00
>      This point of time Hour level has 5 day data.so start loading on day
> level .
> 7. Add SubTask to support loading of in-between missing time.(Incremental
> but old records if timeseries device stopped working for some time).
> 
> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <ak...@gmail.com>
> wrote:
> 
> > Hi vishal,
> >
> > In the design document, in the impacted analysis section, there is a topic
> > compatibility/legacy stores, so basically For old tables when the datamap
> > is created, we load all the timeseries datamaps with different granularity.
> > I think this should do fine, please let me know for further
> > suggestions/comments.
> >
> > Regards,
> > Akash R Nilugal
> >
> > On 2019/09/30 17:09:44, Kumar Vishal <ku...@gmail.com> wrote:
> > > Hi Akash,
> > >
> > > In this desing document you haven't mentioned how to handle data loading
> > > for timeseries datamap for older segments[Existing table].
> > > If the customer's main table data is also stored based on time[increasing
> > > time] in different segments,he can use this feature as well.
> > >
> > > We can discuss and finalize the solution.
> > >
> > > -Regards
> > > Kumar Vishal
> > >
> > > On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
> > > wrote:
> > >
> > > > Hi Ajantha,
> > > >
> > > > Thanks for the queries and suggestions
> > > >
> > > > 1. Yes, this is a good suggestion, i ll include this change. Both date
> > and
> > > > timestamp columns are supported, will be updated in document.
> > > > 2. yes, you are right.
> > > > 3. you are right, if the day level is not available, then we will try
> > to
> > > > get the whole day data from hour level, if not availaible, as
> > explained in
> > > > design document, we will get the data from datamap UNION data from main
> > > > table based on user query.
> > > >
> > > > Regards,
> > > > Akash R Nilugal
> > > >
> > > >
> > > > On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
> > > > > + 1 ,
> > > > >
> > > > > I have some suggestions and questions.
> > > > >
> > > > > 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> > > > > 'timeseries_column'.
> > > > >  so that it won't give an impression that only time stamp datatype is
> > > > > supported and update the document with all the datatype supported.
> > > > >
> > > > > 2. Querying on this datamap table is also supported right ?
> > supporting
> > > > > changing plan for main table to refer datamap table is for user to
> > avoid
> > > > > changing his query or any other reason ?
> > > > >
> > > > > 3. If user has not created day granularity datamap, but just created
> > hour
> > > > > granularity datamap. When query has day granularity, data will be
> > fetched
> > > > > form hour granularity datamap and aggregated ? or data is fetched
> > from
> > > > main
> > > > > table ?
> > > > >
> > > > > Thanks,
> > > > > Ajantha
> > > > >
> > > > > On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <
> > akashnilugal@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi xuchuanyin,
> > > > > >
> > > > > > Thanks for the comments/Suggestions
> > > > > >
> > > > > > 1. Preaggregate is productized, but not the timeseries with
> > > > preaggregate,
> > > > > > i think you  got confused with that, if im right.
> > > > > > 2. Limitations like, auto sampling or rollup, which we will be
> > > > supporting
> > > > > > now. Retention policies. etc
> > > > > > 3. segmentTimestampMin, this i will consider in design.
> > > > > > 4. RP is added as a separate task, i thought instead of
> > maintaining two
> > > > > > variables better to maintabin one and parse it. But i will consider
> > > > your
> > > > > > point based on feasibility during implementation.
> > > > > > 5. We use an accumulator which takes list, so before writing index
> > > > files
> > > > > > we take the min max of the timestamp column and fill in
> > accumulator and
> > > > > > then we can access accumulator.value in driver after load is
> > finished.
> > > > > >
> > > > > > Regards,
> > > > > > Akash R Nilugal
> > > > > >
> > > > > > On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
> > > > > > > Hi akash, glad to see the feature proposed and I have some
> > questions
> > > > > > about
> > > > > > > this. Please notice that some of the following descriptions are
> > > > comments
> > > > > > > followed by '===' described in the design document attached in
> > the
> > > > > > > corresponding jira.
> > > > > > >
> > > > > > > 1.
> > > > > > > "Currently carbondata supports timeseries on preaggregate
> > datamap,
> > > > but
> > > > > > its
> > > > > > > an alpha feature"
> > > > > > > ===
> > > > > > > It has been some time since the preaggregate datamap was
> > introduced
> > > > and
> > > > > > it
> > > > > > > is still **alpha**, why it is still not product-ready? Will the
> > new
> > > > > > feature
> > > > > > > also come into the similar situation?
> > > > > > >
> > > > > > > 2.
> > > > > > > "there are so many limitations when we compare and analyze the
> > > > existing
> > > > > > > timeseries database or projects which supports time series like
> > > > apache
> > > > > > druid
> > > > > > > or influxdb"
> > > > > > > ===
> > > > > > > What are the actual limitations? Besides, please give an example
> > of
> > > > this.
> > > > > > >
> > > > > > > 3.
> > > > > > > "Segment_Timestamp_Min"
> > > > > > > ===
> > > > > > > Suggest using camel-case style like 'segmentTimestampMin'
> > > > > > >
> > > > > > > 4.
> > > > > > > "RP is way of telling the system, for how long the data should be
> > > > kept"
> > > > > > > ===
> > > > > > > Since the function is simple, I'd suggest using
> > 'retentionTime'=15
> > > > and
> > > > > > > 'timeUnit'='day' instead of 'RP'='15_days'
> > > > > > >
> > > > > > > 5.
> > > > > > > "When the data load is called for main table, use an spark
> > > > accumulator to
> > > > > > > get the maximum value of timestamp in that load and return to the
> > > > load."
> > > > > > > ===
> > > > > > > How can you get the spark accumulator? The load is launched using
> > > > > > > loading-by-dataframe not using global-sort-by-spark.
> > > > > > >
> > > > > > > 6.
> > > > > > > For the rest of the content, still reading.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sent from:
> > > > > >
> > > >
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by babu lal jangir <ba...@gmail.com>.

Hi Akash, Thanks for Time Series DataMap proposal.
Please check below Points.

1. During Query Planing Change Union to Union All , Otherwise will loose
row if same value appears.
2. Whether system start load for next granularity level table as soon it
matches the data condition or next granularity level table has to wait till
current  granularity level table is finished ? please handle if possible.
3. Add Configuration to load multiple Ranges at a time(across granularity
tables).
4. Please check if Current data loading min ,max is enough to find current
load . No need to refer the DataMap's min,max because data loading Range
prepration can go wrong if loading happens from multiple driver . i think
below rules are enough for loading.
    4.a. Create MV should should sync data.   On any failure Rebuild should
sync again till than MV will be disabled.
    4.b.  Each load has independent Ranges and should load only those
ranges. Any failure MV may go in disable state(only if intermediate ranges
load is failed ,last loads failure will NOT make MV disable).
5. We can make Data loading sync because anyway queries can be served from
fact table if any segments is in-progress in  Datamap.
6. In Data loading Pipleline ,failures in intermediate time series datamap,
still we can continue loading next level data. (ignore if already handled).
   For Example.
    DataMaps:- Hour,Day,Month Level
    Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
      Failure in hour level during below range
        2018-01-06 01:00:00 to 2018-01-06 01:00:00
     This point of time Hour level has 5 day data.so start loading on day
level .
7. Add SubTask to support loading of in-between missing time.(Incremental
but old records if timeseries device stopped working for some time).

On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <ak...@gmail.com>
wrote:

> Hi vishal,
>
> In the design document, in the impacted analysis section, there is a topic
> compatibility/legacy stores, so basically For old tables when the datamap
> is created, we load all the timeseries datamaps with different granularity.
> I think this should do fine, please let me know for further
> suggestions/comments.
>
> Regards,
> Akash R Nilugal
>
> On 2019/09/30 17:09:44, Kumar Vishal <ku...@gmail.com> wrote:
> > Hi Akash,
> >
> > In this desing document you haven't mentioned how to handle data loading
> > for timeseries datamap for older segments[Existing table].
> > If the customer's main table data is also stored based on time[increasing
> > time] in different segments,he can use this feature as well.
> >
> > We can discuss and finalize the solution.
> >
> > -Regards
> > Kumar Vishal
> >
> > On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
> > wrote:
> >
> > > Hi Ajantha,
> > >
> > > Thanks for the queries and suggestions
> > >
> > > 1. Yes, this is a good suggestion, i ll include this change. Both date
> and
> > > timestamp columns are supported, will be updated in document.
> > > 2. yes, you are right.
> > > 3. you are right, if the day level is not available, then we will try
> to
> > > get the whole day data from hour level, if not availaible, as
> explained in
> > > design document, we will get the data from datamap UNION data from main
> > > table based on user query.
> > >
> > > Regards,
> > > Akash R Nilugal
> > >
> > >
> > > On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
> > > > + 1 ,
> > > >
> > > > I have some suggestions and questions.
> > > >
> > > > 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> > > > 'timeseries_column'.
> > > >  so that it won't give an impression that only time stamp datatype is
> > > > supported and update the document with all the datatype supported.
> > > >
> > > > 2. Querying on this datamap table is also supported right ?
> supporting
> > > > changing plan for main table to refer datamap table is for user to
> avoid
> > > > changing his query or any other reason ?
> > > >
> > > > 3. If user has not created day granularity datamap, but just created
> hour
> > > > granularity datamap. When query has day granularity, data will be
> fetched
> > > > form hour granularity datamap and aggregated ? or data is fetched
> from
> > > main
> > > > table ?
> > > >
> > > > Thanks,
> > > > Ajantha
> > > >
> > > > On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <
> akashnilugal@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi xuchuanyin,
> > > > >
> > > > > Thanks for the comments/Suggestions
> > > > >
> > > > > 1. Preaggregate is productized, but not the timeseries with
> > > preaggregate,
> > > > > i think you  got confused with that, if im right.
> > > > > 2. Limitations like, auto sampling or rollup, which we will be
> > > supporting
> > > > > now. Retention policies. etc
> > > > > 3. segmentTimestampMin, this i will consider in design.
> > > > > 4. RP is added as a separate task, i thought instead of
> maintaining two
> > > > > variables better to maintabin one and parse it. But i will consider
> > > your
> > > > > point based on feasibility during implementation.
> > > > > 5. We use an accumulator which takes list, so before writing index
> > > files
> > > > > we take the min max of the timestamp column and fill in
> accumulator and
> > > > > then we can access accumulator.value in driver after load is
> finished.
> > > > >
> > > > > Regards,
> > > > > Akash R Nilugal
> > > > >
> > > > > On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
> > > > > > Hi akash, glad to see the feature proposed and I have some
> questions
> > > > > about
> > > > > > this. Please notice that some of the following descriptions are
> > > comments
> > > > > > followed by '===' described in the design document attached in
> the
> > > > > > corresponding jira.
> > > > > >
> > > > > > 1.
> > > > > > "Currently carbondata supports timeseries on preaggregate
> datamap,
> > > but
> > > > > its
> > > > > > an alpha feature"
> > > > > > ===
> > > > > > It has been some time since the preaggregate datamap was
> introduced
> > > and
> > > > > it
> > > > > > is still **alpha**, why it is still not product-ready? Will the
> new
> > > > > feature
> > > > > > also come into the similar situation?
> > > > > >
> > > > > > 2.
> > > > > > "there are so many limitations when we compare and analyze the
> > > existing
> > > > > > timeseries database or projects which supports time series like
> > > apache
> > > > > druid
> > > > > > or influxdb"
> > > > > > ===
> > > > > > What are the actual limitations? Besides, please give an example
> of
> > > this.
> > > > > >
> > > > > > 3.
> > > > > > "Segment_Timestamp_Min"
> > > > > > ===
> > > > > > Suggest using camel-case style like 'segmentTimestampMin'
> > > > > >
> > > > > > 4.
> > > > > > "RP is way of telling the system, for how long the data should be
> > > kept"
> > > > > > ===
> > > > > > Since the function is simple, I'd suggest using
> 'retentionTime'=15
> > > and
> > > > > > 'timeUnit'='day' instead of 'RP'='15_days'
> > > > > >
> > > > > > 5.
> > > > > > "When the data load is called for main table, use an spark
> > > accumulator to
> > > > > > get the maximum value of timestamp in that load and return to the
> > > load."
> > > > > > ===
> > > > > > How can you get the spark accumulator? The load is launched using
> > > > > > loading-by-dataframe not using global-sort-by-spark.
> > > > > >
> > > > > > 6.
> > > > > > For the rest of the content, still reading.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sent from:
> > > > >
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Akash Nilugal <ak...@gmail.com>.

Hi vishal,

In the design document, in the impacted analysis section, there is a topic compatibility/legacy stores, so basically For old tables when the datamap is created, we load all the timeseries datamaps with different granularity. I think this should do fine, please let me know for further suggestions/comments.

Regards,
Akash R Nilugal

On 2019/09/30 17:09:44, Kumar Vishal <ku...@gmail.com> wrote: 
> Hi Akash,
> 
> In this desing document you haven't mentioned how to handle data loading
> for timeseries datamap for older segments[Existing table].
> If the customer's main table data is also stored based on time[increasing
> time] in different segments,he can use this feature as well.
> 
> We can discuss and finalize the solution.
> 
> -Regards
> Kumar Vishal
> 
> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
> wrote:
> 
> > Hi Ajantha,
> >
> > Thanks for the queries and suggestions
> >
> > 1. Yes, this is a good suggestion, i ll include this change. Both date and
> > timestamp columns are supported, will be updated in document.
> > 2. yes, you are right.
> > 3. you are right, if the day level is not available, then we will try to
> > get the whole day data from hour level, if not availaible, as explained in
> > design document, we will get the data from datamap UNION data from main
> > table based on user query.
> >
> > Regards,
> > Akash R Nilugal
> >
> >
> > On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
> > > + 1 ,
> > >
> > > I have some suggestions and questions.
> > >
> > > 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> > > 'timeseries_column'.
> > >  so that it won't give an impression that only time stamp datatype is
> > > supported and update the document with all the datatype supported.
> > >
> > > 2. Querying on this datamap table is also supported right ? supporting
> > > changing plan for main table to refer datamap table is for user to avoid
> > > changing his query or any other reason ?
> > >
> > > 3. If user has not created day granularity datamap, but just created hour
> > > granularity datamap. When query has day granularity, data will be fetched
> > > form hour granularity datamap and aggregated ? or data is fetched from
> > main
> > > table ?
> > >
> > > Thanks,
> > > Ajantha
> > >
> > > On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <ak...@gmail.com>
> > > wrote:
> > >
> > > > Hi xuchuanyin,
> > > >
> > > > Thanks for the comments/Suggestions
> > > >
> > > > 1. Preaggregate is productized, but not the timeseries with
> > preaggregate,
> > > > i think you  got confused with that, if im right.
> > > > 2. Limitations like, auto sampling or rollup, which we will be
> > supporting
> > > > now. Retention policies. etc
> > > > 3. segmentTimestampMin, this i will consider in design.
> > > > 4. RP is added as a separate task, i thought instead of maintaining two
> > > > variables better to maintabin one and parse it. But i will consider
> > your
> > > > point based on feasibility during implementation.
> > > > 5. We use an accumulator which takes list, so before writing index
> > files
> > > > we take the min max of the timestamp column and fill in accumulator and
> > > > then we can access accumulator.value in driver after load is finished.
> > > >
> > > > Regards,
> > > > Akash R Nilugal
> > > >
> > > > On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
> > > > > Hi akash, glad to see the feature proposed and I have some questions
> > > > about
> > > > > this. Please notice that some of the following descriptions are
> > comments
> > > > > followed by '===' described in the design document attached in the
> > > > > corresponding jira.
> > > > >
> > > > > 1.
> > > > > "Currently carbondata supports timeseries on preaggregate datamap,
> > but
> > > > its
> > > > > an alpha feature"
> > > > > ===
> > > > > It has been some time since the preaggregate datamap was introduced
> > and
> > > > it
> > > > > is still **alpha**, why it is still not product-ready? Will the new
> > > > feature
> > > > > also come into the similar situation?
> > > > >
> > > > > 2.
> > > > > "there are so many limitations when we compare and analyze the
> > existing
> > > > > timeseries database or projects which supports time series like
> > apache
> > > > druid
> > > > > or influxdb"
> > > > > ===
> > > > > What are the actual limitations? Besides, please give an example of
> > this.
> > > > >
> > > > > 3.
> > > > > "Segment_Timestamp_Min"
> > > > > ===
> > > > > Suggest using camel-case style like 'segmentTimestampMin'
> > > > >
> > > > > 4.
> > > > > "RP is way of telling the system, for how long the data should be
> > kept"
> > > > > ===
> > > > > Since the function is simple, I'd suggest using 'retentionTime'=15
> > and
> > > > > 'timeUnit'='day' instead of 'RP'='15_days'
> > > > >
> > > > > 5.
> > > > > "When the data load is called for main table, use an spark
> > accumulator to
> > > > > get the maximum value of timestamp in that load and return to the
> > load."
> > > > > ===
> > > > > How can you get the spark accumulator? The load is launched using
> > > > > loading-by-dataframe not using global-sort-by-spark.
> > > > >
> > > > > 6.
> > > > > For the rest of the content, still reading.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sent from:
> > > >
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Kumar Vishal <ku...@gmail.com>.

Hi Akash,

In this desing document you haven't mentioned how to handle data loading
for timeseries datamap for older segments[Existing table].
If the customer's main table data is also stored based on time[increasing
time] in different segments,he can use this feature as well.

We can discuss and finalize the solution.

-Regards
Kumar Vishal

On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal <ak...@gmail.com>
wrote:

> Hi Ajantha,
>
> Thanks for the queries and suggestions
>
> 1. Yes, this is a good suggestion, i ll include this change. Both date and
> timestamp columns are supported, will be updated in document.
> 2. yes, you are right.
> 3. you are right, if the day level is not available, then we will try to
> get the whole day data from hour level, if not availaible, as explained in
> design document, we will get the data from datamap UNION data from main
> table based on user query.
>
> Regards,
> Akash R Nilugal
>
>
> On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote:
> > + 1 ,
> >
> > I have some suggestions and questions.
> >
> > 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> > 'timeseries_column'.
> >  so that it won't give an impression that only time stamp datatype is
> > supported and update the document with all the datatype supported.
> >
> > 2. Querying on this datamap table is also supported right ? supporting
> > changing plan for main table to refer datamap table is for user to avoid
> > changing his query or any other reason ?
> >
> > 3. If user has not created day granularity datamap, but just created hour
> > granularity datamap. When query has day granularity, data will be fetched
> > form hour granularity datamap and aggregated ? or data is fetched from
> main
> > table ?
> >
> > Thanks,
> > Ajantha
> >
> > On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <ak...@gmail.com>
> > wrote:
> >
> > > Hi xuchuanyin,
> > >
> > > Thanks for the comments/Suggestions
> > >
> > > 1. Preaggregate is productized, but not the timeseries with
> preaggregate,
> > > i think you  got confused with that, if im right.
> > > 2. Limitations like, auto sampling or rollup, which we will be
> supporting
> > > now. Retention policies. etc
> > > 3. segmentTimestampMin, this i will consider in design.
> > > 4. RP is added as a separate task, i thought instead of maintaining two
> > > variables better to maintabin one and parse it. But i will consider
> your
> > > point based on feasibility during implementation.
> > > 5. We use an accumulator which takes list, so before writing index
> files
> > > we take the min max of the timestamp column and fill in accumulator and
> > > then we can access accumulator.value in driver after load is finished.
> > >
> > > Regards,
> > > Akash R Nilugal
> > >
> > > On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
> > > > Hi akash, glad to see the feature proposed and I have some questions
> > > about
> > > > this. Please notice that some of the following descriptions are
> comments
> > > > followed by '===' described in the design document attached in the
> > > > corresponding jira.
> > > >
> > > > 1.
> > > > "Currently carbondata supports timeseries on preaggregate datamap,
> but
> > > its
> > > > an alpha feature"
> > > > ===
> > > > It has been some time since the preaggregate datamap was introduced
> and
> > > it
> > > > is still **alpha**, why it is still not product-ready? Will the new
> > > feature
> > > > also come into the similar situation?
> > > >
> > > > 2.
> > > > "there are so many limitations when we compare and analyze the
> existing
> > > > timeseries database or projects which supports time series like
> apache
> > > druid
> > > > or influxdb"
> > > > ===
> > > > What are the actual limitations? Besides, please give an example of
> this.
> > > >
> > > > 3.
> > > > "Segment_Timestamp_Min"
> > > > ===
> > > > Suggest using camel-case style like 'segmentTimestampMin'
> > > >
> > > > 4.
> > > > "RP is way of telling the system, for how long the data should be
> kept"
> > > > ===
> > > > Since the function is simple, I'd suggest using 'retentionTime'=15
> and
> > > > 'timeUnit'='day' instead of 'RP'='15_days'
> > > >
> > > > 5.
> > > > "When the data load is called for main table, use an spark
> accumulator to
> > > > get the maximum value of timestamp in that load and return to the
> load."
> > > > ===
> > > > How can you get the spark accumulator? The load is launched using
> > > > loading-by-dataframe not using global-sort-by-spark.
> > > >
> > > > 6.
> > > > For the rest of the content, still reading.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > >
> > >
> >
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Akash Nilugal <ak...@gmail.com>.

Hi Ajantha,

Thanks for the queries and suggestions

1. Yes, this is a good suggestion, i ll include this change. Both date and timestamp columns are supported, will be updated in document.
2. yes, you are right.
3. you are right, if the day level is not available, then we will try to get the whole day data from hour level, if not availaible, as explained in design document, we will get the data from datamap UNION data from main table based on user query.

Regards,
Akash R Nilugal


On 2019/09/30 06:56:45, Ajantha Bhat <aj...@gmail.com> wrote: 
> + 1 ,
> 
> I have some suggestions and questions.
> 
> 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> 'timeseries_column'.
>  so that it won't give an impression that only time stamp datatype is
> supported and update the document with all the datatype supported.
> 
> 2. Querying on this datamap table is also supported right ? supporting
> changing plan for main table to refer datamap table is for user to avoid
> changing his query or any other reason ?
> 
> 3. If user has not created day granularity datamap, but just created hour
> granularity datamap. When query has day granularity, data will be fetched
> form hour granularity datamap and aggregated ? or data is fetched from main
> table ?
> 
> Thanks,
> Ajantha
> 
> On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <ak...@gmail.com>
> wrote:
> 
> > Hi xuchuanyin,
> >
> > Thanks for the comments/Suggestions
> >
> > 1. Preaggregate is productized, but not the timeseries with preaggregate,
> > i think you  got confused with that, if im right.
> > 2. Limitations like, auto sampling or rollup, which we will be supporting
> > now. Retention policies. etc
> > 3. segmentTimestampMin, this i will consider in design.
> > 4. RP is added as a separate task, i thought instead of maintaining two
> > variables better to maintabin one and parse it. But i will consider your
> > point based on feasibility during implementation.
> > 5. We use an accumulator which takes list, so before writing index files
> > we take the min max of the timestamp column and fill in accumulator and
> > then we can access accumulator.value in driver after load is finished.
> >
> > Regards,
> > Akash R Nilugal
> >
> > On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
> > > Hi akash, glad to see the feature proposed and I have some questions
> > about
> > > this. Please notice that some of the following descriptions are comments
> > > followed by '===' described in the design document attached in the
> > > corresponding jira.
> > >
> > > 1.
> > > "Currently carbondata supports timeseries on preaggregate datamap, but
> > its
> > > an alpha feature"
> > > ===
> > > It has been some time since the preaggregate datamap was introduced and
> > it
> > > is still **alpha**, why it is still not product-ready? Will the new
> > feature
> > > also come into the similar situation?
> > >
> > > 2.
> > > "there are so many limitations when we compare and analyze the existing
> > > timeseries database or projects which supports time series like apache
> > druid
> > > or influxdb"
> > > ===
> > > What are the actual limitations? Besides, please give an example of this.
> > >
> > > 3.
> > > "Segment_Timestamp_Min"
> > > ===
> > > Suggest using camel-case style like 'segmentTimestampMin'
> > >
> > > 4.
> > > "RP is way of telling the system, for how long the data should be kept"
> > > ===
> > > Since the function is simple, I'd suggest using 'retentionTime'=15 and
> > > 'timeUnit'='day' instead of 'RP'='15_days'
> > >
> > > 5.
> > > "When the data load is called for main table, use an spark accumulator to
> > > get the maximum value of timestamp in that load and return to the load."
> > > ===
> > > How can you get the spark accumulator? The load is launched using
> > > loading-by-dataframe not using global-sort-by-spark.
> > >
> > > 6.
> > > For the rest of the content, still reading.
> > >
> > >
> > >
> > >
> > > --
> > > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Ajantha Bhat <aj...@gmail.com>.

+ 1 ,

I have some suggestions and questions.

1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
'timeseries_column'.
 so that it won't give an impression that only time stamp datatype is
supported and update the document with all the datatype supported.

2. Querying on this datamap table is also supported right ? supporting
changing plan for main table to refer datamap table is for user to avoid
changing his query or any other reason ?

3. If user has not created day granularity datamap, but just created hour
granularity datamap. When query has day granularity, data will be fetched
form hour granularity datamap and aggregated ? or data is fetched from main
table ?

Thanks,
Ajantha

On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <ak...@gmail.com>
wrote:

> Hi xuchuanyin,
>
> Thanks for the comments/Suggestions
>
> 1. Preaggregate is productized, but not the timeseries with preaggregate,
> i think you  got confused with that, if im right.
> 2. Limitations like, auto sampling or rollup, which we will be supporting
> now. Retention policies. etc
> 3. segmentTimestampMin, this i will consider in design.
> 4. RP is added as a separate task, i thought instead of maintaining two
> variables better to maintabin one and parse it. But i will consider your
> point based on feasibility during implementation.
> 5. We use an accumulator which takes list, so before writing index files
> we take the min max of the timestamp column and fill in accumulator and
> then we can access accumulator.value in driver after load is finished.
>
> Regards,
> Akash R Nilugal
>
> On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote:
> > Hi akash, glad to see the feature proposed and I have some questions
> about
> > this. Please notice that some of the following descriptions are comments
> > followed by '===' described in the design document attached in the
> > corresponding jira.
> >
> > 1.
> > "Currently carbondata supports timeseries on preaggregate datamap, but
> its
> > an alpha feature"
> > ===
> > It has been some time since the preaggregate datamap was introduced and
> it
> > is still **alpha**, why it is still not product-ready? Will the new
> feature
> > also come into the similar situation?
> >
> > 2.
> > "there are so many limitations when we compare and analyze the existing
> > timeseries database or projects which supports time series like apache
> druid
> > or influxdb"
> > ===
> > What are the actual limitations? Besides, please give an example of this.
> >
> > 3.
> > "Segment_Timestamp_Min"
> > ===
> > Suggest using camel-case style like 'segmentTimestampMin'
> >
> > 4.
> > "RP is way of telling the system, for how long the data should be kept"
> > ===
> > Since the function is simple, I'd suggest using 'retentionTime'=15 and
> > 'timeUnit'='day' instead of 'RP'='15_days'
> >
> > 5.
> > "When the data load is called for main table, use an spark accumulator to
> > get the maximum value of timestamp in that load and return to the load."
> > ===
> > How can you get the spark accumulator? The load is launched using
> > loading-by-dataframe not using global-sort-by-spark.
> >
> > 6.
> > For the rest of the content, still reading.
> >
> >
> >
> >
> > --
> > Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Akash Nilugal <ak...@gmail.com>.

Hi xuchuanyin,

Thanks for the comments/Suggestions

1. Preaggregate is productized, but not the timeseries with preaggregate, i think you  got confused with that, if im right.
2. Limitations like, auto sampling or rollup, which we will be supporting now. Retention policies. etc
3. segmentTimestampMin, this i will consider in design.
4. RP is added as a separate task, i thought instead of maintaining two variables better to maintabin one and parse it. But i will consider your point based on feasibility during implementation.
5. We use an accumulator which takes list, so before writing index files we take the min max of the timestamp column and fill in accumulator and then we can access accumulator.value in driver after load is finished.

Regards,
Akash R Nilugal 

On 2019/09/28 10:46:31, xuchuanyin <xu...@apache.org> wrote: 
> Hi akash, glad to see the feature proposed and I have some questions about
> this. Please notice that some of the following descriptions are comments
> followed by '===' described in the design document attached in the
> corresponding jira.
> 
> 1. 
> "Currently carbondata supports timeseries on preaggregate datamap, but its
> an alpha feature"
> ===
> It has been some time since the preaggregate datamap was introduced and it
> is still **alpha**, why it is still not product-ready? Will the new feature
> also come into the similar situation? 
> 
> 2.
> "there are so many limitations when we compare and analyze the existing
> timeseries database or projects which supports time series like apache druid
> or influxdb"
> ===
> What are the actual limitations? Besides, please give an example of this.
> 
> 3.
> "Segment_Timestamp_Min"
> ===
> Suggest using camel-case style like 'segmentTimestampMin'
> 
> 4.
> "RP is way of telling the system, for how long the data should be kept"
> ===
> Since the function is simple, I'd suggest using 'retentionTime'=15 and
> 'timeUnit'='day' instead of 'RP'='15_days'
> 
> 5.
> "When the data load is called for main table, use an spark accumulator to
> get the maximum value of timestamp in that load and return to the load."
> ===
> How can you get the spark accumulator? The load is launched using
> loading-by-dataframe not using global-sort-by-spark.
> 
> 6.
> For the rest of the content, still reading.
> 
> 
> 
> 
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by xuchuanyin <xu...@apache.org>.

Hi akash, glad to see the feature proposed and I have some questions about
this. Please notice that some of the following descriptions are comments
followed by '===' described in the design document attached in the
corresponding jira.

1. 
"Currently carbondata supports timeseries on preaggregate datamap, but its
an alpha feature"
===
It has been some time since the preaggregate datamap was introduced and it
is still **alpha**, why it is still not product-ready? Will the new feature
also come into the similar situation? 

2.
"there are so many limitations when we compare and analyze the existing
timeseries database or projects which supports time series like apache druid
or influxdb"
===
What are the actual limitations? Besides, please give an example of this.

3.
"Segment_Timestamp_Min"
===
Suggest using camel-case style like 'segmentTimestampMin'

4.
"RP is way of telling the system, for how long the data should be kept"
===
Since the function is simple, I'd suggest using 'retentionTime'=15 and
'timeUnit'='day' instead of 'RP'='15_days'

5.
"When the data load is called for main table, use an spark accumulator to
get the maximum value of timestamp in that load and return to the load."
===
How can you get the spark accumulator? The load is launched using
loading-by-dataframe not using global-sort-by-spark.

6.
For the rest of the content, still reading.




--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by Akash Nilugal <ak...@gmail.com>.

Hi chetan,

1. Deprecate means, not recommended to use, so i think no further details are required for that
2. complex and add partition not supported. it wil be updated in document.


On 2019/09/23 14:43:51, chetan bhat <ch...@gmail.com> wrote: 
> Hi Akash,
> 
> 1. For preaggregate table deprecation as part of subtask 7 can specific details be provided in the design doc.
> 2. Will alter table add partition supported now for table that has timeseries MV datamap ?If not supported it can be updated in the design doc.
> 3. Would Complex datatypes for timeseries MV datamap be supported ? If not supported it can be updated in the design doc.
> 
> Regards
> Chetan
> 
> On 2019/09/23 13:42:48, Akash Nilugal <ak...@gmail.com> wrote: 
> > Hi Community,
> > 
> > Timeseries data are simply measurements or events that are
> > tracked,monitored, downsampled, and aggregated over time.
> > Basicallytimeseries data analysis helps in analyzing or monitoring
> > theaggregated data over period of time to take better decision forbusiness.
> > So since carbondata supports olap datamap like preaggregate, MV and since
> > time series is of atmost importance,
> > we can supporttimeseries for carbondata over MV datamap model.
> > 
> > Currentlycarbondata supports timeseries on preaggregate datamap, but its
> > analpha feature and there are so many limitations when we compare and
> > analyze the existing timeseries database or projects which supportstime
> > series like apache druid or influxdb. So, in this feature we can support
> > timeseries
> > by avoiding the limitations in the current system. After doing the analysis
> > on the current existing timeseries database like influxdb, and the apache
> > druid,
> > i have  prepared a solution/design document. Any inputs, improvements or
> > suggestion are most welcome.
> > 
> > I have created jira https://issues.apache.org/jira/browse/CARBONDATA-3525 for
> > this. Later i will create sub jiras for tracking.
> > 
> > 
> > Regards,
> > Akash R Nilugal
> > 
>

Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

Posted by chetan bhat <ch...@gmail.com>.

Hi Akash,

1. For preaggregate table deprecation as part of subtask 7 can specific details be provided in the design doc.
2. Will alter table add partition supported now for table that has timeseries MV datamap ?If not supported it can be updated in the design doc.
3. Would Complex datatypes for timeseries MV datamap be supported ? If not supported it can be updated in the design doc.

Regards
Chetan

On 2019/09/23 13:42:48, Akash Nilugal <ak...@gmail.com> wrote: 
> Hi Community,
> 
> Timeseries data are simply measurements or events that are
> tracked,monitored, downsampled, and aggregated over time.
> Basicallytimeseries data analysis helps in analyzing or monitoring
> theaggregated data over period of time to take better decision forbusiness.
> So since carbondata supports olap datamap like preaggregate, MV and since
> time series is of atmost importance,
> we can supporttimeseries for carbondata over MV datamap model.
> 
> Currentlycarbondata supports timeseries on preaggregate datamap, but its
> analpha feature and there are so many limitations when we compare and
> analyze the existing timeseries database or projects which supportstime
> series like apache druid or influxdb. So, in this feature we can support
> timeseries
> by avoiding the limitations in the current system. After doing the analysis
> on the current existing timeseries database like influxdb, and the apache
> druid,
> i have  prepared a solution/design document. Any inputs, improvements or
> suggestion are most welcome.
> 
> I have created jira https://issues.apache.org/jira/browse/CARBONDATA-3525 for
> this. Later i will create sub jiras for tracking.
> 
> 
> Regards,
> Akash R Nilugal
>