You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Edward Zhang <yo...@apache.org> on 2016/02/03 02:45:55 UTC

[Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Hi Kylin Community,

I had discussion with Shaofeng (@Shaofengshi) on JIRA KYLIN-1351 to make
Kylin to support RDBMS as data source. But we want to get more input from
community to see how much importance and urgency for this feature. Please
do respond and provide your suggestion if you are in need of this feature
or are interested in developing this feature.

Though Kylin today supports plugin datasource, this RDBMS feature is not
trivial in that we need take care of the following problems.

1. Independent dictionary especially for data type mapping.
Hive has its different data type system from RDBMS. Kylin dictionary should
infer column type from HIVE schema today, but we need make sure dictionary
is dependent of data source so that RDBMS schema can be stored in Kylin
dictionary

2. Pipeline
Do we import data from RDBMS to Hive or directly read data from RDBMS?
If the destination is Hive, we may reuse current Hive MR cubing job, but we
need take care of RDBMS to Hive conversion.
If Kylin directly reads data from RDBMS, we need write a new MR or Spark
job.

3. Consistency
Normally RDBMS supports data insert/update/delete, how does Kylin handle
that?

4. Read continuously
Do we require that RDBMS fact table always has a timestamp field which
Kylin uses for reading records continuously?

5. Cube modeling
Is current cube modeling feature independent enough to support RDBMS
modeling?

6. Sharding
Normally RDBMS can support complicated join queries across multiple tables,
here the reason we use Kylin is probably that the source table is sharded
into many children tables and Kylin can query across all the shards once
after the data is imported into Kylin.

Thanks
Edward

Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Posted by Luke Han <lu...@gmail.com>.

That's right, the requirement for different input source has been asked
many times. Kylin 1.x using Hive as protocol and it has been extended to
support "SQL on Hadoop" as protocol in Kylin 2.

I think we could leverage the advantages from SparkSQL and Apache Drill (or
others), they both offer abstract layer for underline data sources. With
the plug-able input source architecture, Kylin2 just need to depends(or
include) one library and leave the RDBMs even other sources to them.

@Edward, how do you think? Maybe we could consult Drill and Spark community
for such idea.

Thanks.
Luke

Best Regards!
---------------------

Luke Han

On Thu, Feb 4, 2016 at 12:23 AM, KylinPOC <sa...@gmail.com>
wrote:

> As Luke said:
> /But read from RDBMs is valid to extend input source rather than Hive
> today, not only RDBMs also SparkSQL, Impala, Drill and other SQL on Hadoop.
> /
>
> This can be a great addition. Many organizations now have Hadoop and other
> databases as a data lake/data fabric.
>
> --
> View this message in context:
> http://apache-kylin.74782.x6.nabble.com/Discuss-KYLIN-1351-Kylin-to-support-RDBMS-as-data-source-tp3564p3596.html
> Sent from the Apache Kylin mailing list archive at Nabble.com.
>

Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Posted by KylinPOC <sa...@gmail.com>.

As Luke said:
/But read from RDBMs is valid to extend input source rather than Hive 
today, not only RDBMs also SparkSQL, Impala, Drill and other SQL on Hadoop. 
/

This can be a great addition. Many organizations now have Hadoop and other
databases as a data lake/data fabric. 

--
View this message in context: http://apache-kylin.74782.x6.nabble.com/Discuss-KYLIN-1351-Kylin-to-support-RDBMS-as-data-source-tp3564p3596.html
Sent from the Apache Kylin mailing list archive at Nabble.com.

Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Posted by "Sudhir.Kumar" <Su...@target.com>.

Hello Edward,

One of the big advantages of Kylin talking to RDMS would be in building the unified data architecture. But how would data blending from multiple source be done in Kylin. The advantage from RDMS would be if data blending is enabled in Kylin.  Also as Luke mentions there are tools available which can enable ETL to Hive. Also as data strategy,  organizations would like to eventually keep data into HDFS.

In my opinion reading from RDMS would be good to have feature and does not seem to be urgent.

Thanks,

Sudhir

"We must accept finite disappointment, but never lose infinite hope." - Martin Luther King Jr.


From: Luke Han <lu...@gmail.com>>
Reply-To: "user@kylin.apache.org<ma...@kylin.apache.org>" <us...@kylin.apache.org>>
Date: Wednesday, February 3, 2016 at 8:32 AM
To: "dev@kylin.apache.org<ma...@kylin.apache.org>" <de...@kylin.apache.org>>
Cc: "user@kylin.apache.org<ma...@kylin.apache.org>" <us...@kylin.apache.org>>
Subject: Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Hi Edward,
     Thanks to raise this discussion, read data from RDBMs is tricky and we have to come up a very clear design and architecture before implement it.

     There's one thread/JIRA about read data from Oracle directly, but finally dropped this since there's already many tools could handle it, extract data from Oracle and load to Hive.

     The concern here is, most RDBMs are not optimized yet for distribution system to read directly. For example, hundreds Hadoop nodes read data from MySQL or Oracle or others directly. And also network.

     From the beginning, we decided to use Hive as protocol between upstream and Kylin. This is good model so far since users could leverage every ETL tool to do this job, to landing source data into Hive and then build cube based on it. Even if Kylin supports to read data from RDBMs, then how about transform? how about load? it will bring ETL parts into Kylin's scope which is not good idea, I think.

      But read from RDBMs is valid to extend input source rather than Hive today, not only RDBMs also SparkSQL, Impala, Drill and other SQL on Hadoop.
      How about to build a light tool for this requirement? Which could be one extension tool for user to leverage.

      Thanks.
Luke





Best Regards!
---------------------

Luke Han

On Wed, Feb 3, 2016 at 9:45 AM, Edward Zhang <yo...@apache.org>> wrote:
Hi Kylin Community,

I had discussion with Shaofeng (@Shaofengshi) on JIRA KYLIN-1351 to make
Kylin to support RDBMS as data source. But we want to get more input from
community to see how much importance and urgency for this feature. Please
do respond and provide your suggestion if you are in need of this feature
or are interested in developing this feature.

Though Kylin today supports plugin datasource, this RDBMS feature is not
trivial in that we need take care of the following problems.

1. Independent dictionary especially for data type mapping.
Hive has its different data type system from RDBMS. Kylin dictionary should
infer column type from HIVE schema today, but we need make sure dictionary
is dependent of data source so that RDBMS schema can be stored in Kylin
dictionary

2. Pipeline
Do we import data from RDBMS to Hive or directly read data from RDBMS?
If the destination is Hive, we may reuse current Hive MR cubing job, but we
need take care of RDBMS to Hive conversion.
If Kylin directly reads data from RDBMS, we need write a new MR or Spark
job.

3. Consistency
Normally RDBMS supports data insert/update/delete, how does Kylin handle
that?

4. Read continuously
Do we require that RDBMS fact table always has a timestamp field which
Kylin uses for reading records continuously?

5. Cube modeling
Is current cube modeling feature independent enough to support RDBMS
modeling?

6. Sharding
Normally RDBMS can support complicated join queries across multiple tables,
here the reason we use Kylin is probably that the source table is sharded
into many children tables and Kylin can query across all the shards once
after the data is imported into Kylin.

Thanks
Edward

Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Posted by "Sudhir.Kumar" <Su...@target.com>.

Hello Edward,

One of the big advantages of Kylin talking to RDMS would be in building the unified data architecture. But how would data blending from multiple source be done in Kylin. The advantage from RDMS would be if data blending is enabled in Kylin.  Also as Luke mentions there are tools available which can enable ETL to Hive. Also as data strategy,  organizations would like to eventually keep data into HDFS.

In my opinion reading from RDMS would be good to have feature and does not seem to be urgent.

Thanks,

Sudhir

"We must accept finite disappointment, but never lose infinite hope." - Martin Luther King Jr.


From: Luke Han <lu...@gmail.com>>
Reply-To: "user@kylin.apache.org<ma...@kylin.apache.org>" <us...@kylin.apache.org>>
Date: Wednesday, February 3, 2016 at 8:32 AM
To: "dev@kylin.apache.org<ma...@kylin.apache.org>" <de...@kylin.apache.org>>
Cc: "user@kylin.apache.org<ma...@kylin.apache.org>" <us...@kylin.apache.org>>
Subject: Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Hi Edward,
     Thanks to raise this discussion, read data from RDBMs is tricky and we have to come up a very clear design and architecture before implement it.

     There's one thread/JIRA about read data from Oracle directly, but finally dropped this since there's already many tools could handle it, extract data from Oracle and load to Hive.

     The concern here is, most RDBMs are not optimized yet for distribution system to read directly. For example, hundreds Hadoop nodes read data from MySQL or Oracle or others directly. And also network.

     From the beginning, we decided to use Hive as protocol between upstream and Kylin. This is good model so far since users could leverage every ETL tool to do this job, to landing source data into Hive and then build cube based on it. Even if Kylin supports to read data from RDBMs, then how about transform? how about load? it will bring ETL parts into Kylin's scope which is not good idea, I think.

      But read from RDBMs is valid to extend input source rather than Hive today, not only RDBMs also SparkSQL, Impala, Drill and other SQL on Hadoop.
      How about to build a light tool for this requirement? Which could be one extension tool for user to leverage.

      Thanks.
Luke





Best Regards!
---------------------

Luke Han

On Wed, Feb 3, 2016 at 9:45 AM, Edward Zhang <yo...@apache.org>> wrote:
Hi Kylin Community,

I had discussion with Shaofeng (@Shaofengshi) on JIRA KYLIN-1351 to make
Kylin to support RDBMS as data source. But we want to get more input from
community to see how much importance and urgency for this feature. Please
do respond and provide your suggestion if you are in need of this feature
or are interested in developing this feature.

Though Kylin today supports plugin datasource, this RDBMS feature is not
trivial in that we need take care of the following problems.

1. Independent dictionary especially for data type mapping.
Hive has its different data type system from RDBMS. Kylin dictionary should
infer column type from HIVE schema today, but we need make sure dictionary
is dependent of data source so that RDBMS schema can be stored in Kylin
dictionary

2. Pipeline
Do we import data from RDBMS to Hive or directly read data from RDBMS?
If the destination is Hive, we may reuse current Hive MR cubing job, but we
need take care of RDBMS to Hive conversion.
If Kylin directly reads data from RDBMS, we need write a new MR or Spark
job.

3. Consistency
Normally RDBMS supports data insert/update/delete, how does Kylin handle
that?

4. Read continuously
Do we require that RDBMS fact table always has a timestamp field which
Kylin uses for reading records continuously?

5. Cube modeling
Is current cube modeling feature independent enough to support RDBMS
modeling?

6. Sharding
Normally RDBMS can support complicated join queries across multiple tables,
here the reason we use Kylin is probably that the source table is sharded
into many children tables and Kylin can query across all the shards once
after the data is imported into Kylin.

Thanks
Edward

Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Posted by Luke Han <lu...@gmail.com>.

Hi Edward,
     Thanks to raise this discussion, read data from RDBMs is tricky and we
have to come up a very clear design and architecture before implement it.

     There's one thread/JIRA about read data from Oracle directly, but
finally dropped this since there's already many tools could handle it,
extract data from Oracle and load to Hive.

     The concern here is, most RDBMs are not optimized yet for distribution
system to read directly. For example, hundreds Hadoop nodes read data from
MySQL or Oracle or others directly. And also network.

     From the beginning, we decided to use Hive as protocol between
upstream and Kylin. This is good model so far since users could leverage
every ETL tool to do this job, to landing source data into Hive and then
build cube based on it. Even if Kylin supports to read data from RDBMs,
then how about transform? how about load? it will bring ETL parts into
Kylin's scope which is not good idea, I think.

      But read from RDBMs is valid to extend input source rather than Hive
today, not only RDBMs also SparkSQL, Impala, Drill and other SQL on Hadoop.
      How about to build a light tool for this requirement? Which could be
one extension tool for user to leverage.

      Thanks.
Luke





Best Regards!
---------------------

Luke Han

On Wed, Feb 3, 2016 at 9:45 AM, Edward Zhang <yo...@apache.org>
wrote:

> Hi Kylin Community,
>
> I had discussion with Shaofeng (@Shaofengshi) on JIRA KYLIN-1351 to make
> Kylin to support RDBMS as data source. But we want to get more input from
> community to see how much importance and urgency for this feature. Please
> do respond and provide your suggestion if you are in need of this feature
> or are interested in developing this feature.
>
> Though Kylin today supports plugin datasource, this RDBMS feature is not
> trivial in that we need take care of the following problems.
>
> 1. Independent dictionary especially for data type mapping.
> Hive has its different data type system from RDBMS. Kylin dictionary should
> infer column type from HIVE schema today, but we need make sure dictionary
> is dependent of data source so that RDBMS schema can be stored in Kylin
> dictionary
>
> 2. Pipeline
> Do we import data from RDBMS to Hive or directly read data from RDBMS?
> If the destination is Hive, we may reuse current Hive MR cubing job, but we
> need take care of RDBMS to Hive conversion.
> If Kylin directly reads data from RDBMS, we need write a new MR or Spark
> job.
>
> 3. Consistency
> Normally RDBMS supports data insert/update/delete, how does Kylin handle
> that?
>
> 4. Read continuously
> Do we require that RDBMS fact table always has a timestamp field which
> Kylin uses for reading records continuously?
>
> 5. Cube modeling
> Is current cube modeling feature independent enough to support RDBMS
> modeling?
>
> 6. Sharding
> Normally RDBMS can support complicated join queries across multiple tables,
> here the reason we use Kylin is probably that the source table is sharded
> into many children tables and Kylin can query across all the shards once
> after the data is imported into Kylin.
>
> Thanks
> Edward
>

Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Posted by Luke Han <lu...@gmail.com>.

Hi Edward,
     Thanks to raise this discussion, read data from RDBMs is tricky and we
have to come up a very clear design and architecture before implement it.

     There's one thread/JIRA about read data from Oracle directly, but
finally dropped this since there's already many tools could handle it,
extract data from Oracle and load to Hive.

     The concern here is, most RDBMs are not optimized yet for distribution
system to read directly. For example, hundreds Hadoop nodes read data from
MySQL or Oracle or others directly. And also network.

     From the beginning, we decided to use Hive as protocol between
upstream and Kylin. This is good model so far since users could leverage
every ETL tool to do this job, to landing source data into Hive and then
build cube based on it. Even if Kylin supports to read data from RDBMs,
then how about transform? how about load? it will bring ETL parts into
Kylin's scope which is not good idea, I think.

      But read from RDBMs is valid to extend input source rather than Hive
today, not only RDBMs also SparkSQL, Impala, Drill and other SQL on Hadoop.
      How about to build a light tool for this requirement? Which could be
one extension tool for user to leverage.

      Thanks.
Luke





Best Regards!
---------------------

Luke Han

On Wed, Feb 3, 2016 at 9:45 AM, Edward Zhang <yo...@apache.org>
wrote:

> Hi Kylin Community,
>
> I had discussion with Shaofeng (@Shaofengshi) on JIRA KYLIN-1351 to make
> Kylin to support RDBMS as data source. But we want to get more input from
> community to see how much importance and urgency for this feature. Please
> do respond and provide your suggestion if you are in need of this feature
> or are interested in developing this feature.
>
> Though Kylin today supports plugin datasource, this RDBMS feature is not
> trivial in that we need take care of the following problems.
>
> 1. Independent dictionary especially for data type mapping.
> Hive has its different data type system from RDBMS. Kylin dictionary should
> infer column type from HIVE schema today, but we need make sure dictionary
> is dependent of data source so that RDBMS schema can be stored in Kylin
> dictionary
>
> 2. Pipeline
> Do we import data from RDBMS to Hive or directly read data from RDBMS?
> If the destination is Hive, we may reuse current Hive MR cubing job, but we
> need take care of RDBMS to Hive conversion.
> If Kylin directly reads data from RDBMS, we need write a new MR or Spark
> job.
>
> 3. Consistency
> Normally RDBMS supports data insert/update/delete, how does Kylin handle
> that?
>
> 4. Read continuously
> Do we require that RDBMS fact table always has a timestamp field which
> Kylin uses for reading records continuously?
>
> 5. Cube modeling
> Is current cube modeling feature independent enough to support RDBMS
> modeling?
>
> 6. Sharding
> Normally RDBMS can support complicated join queries across multiple tables,
> here the reason we use Kylin is probably that the source table is sharded
> into many children tables and Kylin can query across all the shards once
> after the data is imported into Kylin.
>
> Thanks
> Edward
>