You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by George Ni <ni...@apache.org> on 2020/01/19 14:22:30 UTC

Kylin Building Engine With SparkSql & Parquet

Hi Kylin users & developers,

By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
achieve better performance and it does run much faster compared to MR
engine. Also Hbase has been Kylin’s trustful storage engine since Kylin was
born and it has been proved to be a success for providing the ability to
handle high concurrency queries in extremely large data scale with low
latency. But there are also limitations for HBase, such as filtering is not
flexible as we could only filter by RowKey, measures are usually combined
together which causes more data to be scanned than requested.



So in order to optimize Kylin in both building strategy and storage engine,
development team of Kyligence is introducing a new cube building engine
which uses Spark Sql to construct cuboids with a new strategy and stores
cube results in Parquet files. The building strategy allows Kylin to build
cuboids in a smarter way by choosing and building on the optimal cuboid
source. And Parquet, a columnar storage format available to any project in
the Hadoop ecosystem, will power the filtering ability with the page-level
column index and reduce I/O by saving measures in different columns. Also
with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
Cloud Native way. More information on design and technique details will
come soon.



Below is the comparison in building duration and size of results between
By-layer Spark Cubing and the new cubing strategy.



Environment

4-nodes Hadoop cluster

YRAN has 400GB RAM and 128 cores in total;

CDH 5.1, Apache Kylin 3.0.



Spark

Spark 2.4.1-kylin-r17



Test Data

SSB data

Cube: 15 dimensions, 3 measures (SUM)



Test Scenarios

Build the cube at different source size level: 30 million, 60 million
source rows; Compare the build time with Spark (by layer) + Hbase and
SparkSql + Parquet.


Besides, we attempt to resolve many drawbacks in current query engine,
which relies heavily on Apache Calcite, such as the performance bottleneck
in aggregating large query results which currently can only be operated by
a single worker. By embracing SparkSql, this kind of expensive computing
can be done distributedly. Also combined with Parquet format, plenty of
filtering optimizations could be applied,which will boost Kylin’s query
performance significantly. The features will be open source along with
technique details in the near future.



   - https://issues.apache.org/jira/browse/KYLIN-4188


-- 

---------------------

Best regards,



Ni Chunen / George

Re: Kylin Building Engine With SparkSql & Parquet

Posted by Luke Han <lu...@gmail.com>.

I agree, one storage for next-g kylin is good enough.
But would like to keep the interface as of today's best practices, so that
people could easily extend to other storage options.

Best Regards!
---------------------

Luke Han


On Sat, Feb 1, 2020 at 9:13 PM ShaoFeng Shi <sh...@apache.org> wrote:

> In my opinion, it is very hard to maintain HBase storage and parquet
> storage together. So parquet storage is stable enough, the Kylin 4.0 can no
> longer depend on HBase.
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> nichunen <ni...@apache.org> 于2020年1月30日周四 下午11:04写道：
>
> > Hi Shaofeng,
> >
> >
> > For your questions:
> >
> >
> > 1) When the Parquet storage is released (say in Kylin 4.0), will the
> HBase
> > storage still be kept (co-exist), or totally be replaced?
> > I think we will keep an active branch with releases for Hbase storage, it
> > won’t be totally replaced in the near feature.
> >
> > 2) Is there a migration tool for migrating HBase cubes to the new
> storage?
> >
> > The tool is in the developing plan. What’s more, the metadata will be
> > compatible.
> >
> >
> >
> > Best regards,
> >
> >
> >
> > Ni Chunen / George
> >
> >
> > On 2020/1/21, 4:10 AM, "ShaoFeng Shi" <sh...@apache.org> wrote:
> >
> > Chun en,
> >
> > Thanks for the info. I think we need to discuss more in the community,
> for
> > example:
> >
> > 1) When the Parquet storage is released (say in Kylin 4.0), will the
> HBase
> > storage still be kept (co-exist), or totally be replaced?
> > 2) Is there a migration tool for migrating HBase cubes to the new
> storage?
> >
> > Best regards,
> >
> > Shaofeng Shi 史少锋
> > Apache Kylin PMC
> > Email: shaofengshi@apache.org
> >
> > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> > Join Kylin user mail group: user-subscribe@kylin.apache.org
> > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> >
> >
> >
> >
> > nichunen <ni...@apache.org> 于2020年1月20日周一 下午9:38写道：
> >
> > Hi Shaofeng,
> >
> >
> > Below is our plan for this project, any suggestion will be very welcome.
> >
> >
> > 1. In mid-February of 2020, open source the prototype code of this
> feature
> > to branch "kylin-on-parquet-v2", cube can be bulit with new building
> > engine, and stored with parquet format.
> >
> >
> > 2. In late April of 2020, the query module for the new storage type is
> > scheduled to be ready, a happy path for cube creation, building and query
> > will be available then.
> >
> >
> > 3. In May or June of 2020, a Beta version (Kylin 4.0?) will be released.
> >
> >
> >
> > Best regards,
> >
> >
> >
> > Ni Chunen / George
> >
> >
> >
> > On 01/20/2020 16:00，ShaoFeng Shi<sh...@apache.org> wrote：
> > Hi, Chun en,
> >
> > Thanks for the information. What's the detailed release plan of this
> > feature to the community?
> >
> > Best regards,
> >
> > Shaofeng Shi 史少锋
> > Apache Kylin PMC
> > Email: shaofengshi@apache.org
> >
> > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> > Join Kylin user mail group: user-subscribe@kylin.apache.org
> > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
> >
> >
> >
> >
> > Xiaoxiang Yu <xx...@apache.org> 于2020年1月20日周一 下午1:59写道：
> >
> > Great news!
> > I can foresee Kylin could be in a more Cloud-Native way after the mature
> > of parquet storage. And I wish the developer team will share more detail
> > for its desgin.
> >
> >
> >
> >
> > --
> >
> > Best wishes to you !
> > From ：Xiaoxiang Yu
> >
> >
> >
> > At 2020-01-19 22:22:30, "George Ni" <ni...@apache.org> wrote:
> > Hi Kylin users & developers,
> >
> > By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
> > achieve better performance and it does run much faster compared to MR
> > engine. Also Hbase has been Kylin’s trustful storage engine since Kylin
> > was
> > born and it has been proved to be a success for providing the ability to
> > handle high concurrency queries in extremely large data scale with low
> > latency. But there are also limitations for HBase, such as filtering is
> > not
> > flexible as we could only filter by RowKey, measures are usually combined
> > together which causes more data to be scanned than requested.
> >
> >
> >
> > So in order to optimize Kylin in both building strategy and storage
> > engine,
> > development team of Kyligence is introducing a new cube building engine
> > which uses Spark Sql to construct cuboids with a new strategy and stores
> > cube results in Parquet files. The building strategy allows Kylin to
> build
> > cuboids in a smarter way by choosing and building on the optimal cuboid
> > source. And Parquet, a columnar storage format available to any project
> in
> > the Hadoop ecosystem, will power the filtering ability with the
> page-level
> > column index and reduce I/O by saving measures in different columns. Also
> > with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
> > Cloud Native way. More information on design and technique details will
> > come soon.
> >
> >
> >
> > Below is the comparison in building duration and size of results between
> > By-layer Spark Cubing and the new cubing strategy.
> >
> >
> >
> > Environment
> >
> > 4-nodes Hadoop cluster
> >
> > YRAN has 400GB RAM and 128 cores in total;
> >
> > CDH 5.1, Apache Kylin 3.0.
> >
> >
> >
> > Spark
> >
> > Spark 2.4.1-kylin-r17
> >
> >
> >
> > Test Data
> >
> > SSB data
> >
> > Cube: 15 dimensions, 3 measures (SUM)
> >
> >
> >
> > Test Scenarios
> >
> > Build the cube at different source size level: 30 million, 60 million
> > source rows; Compare the build time with Spark (by layer) + Hbase and
> > SparkSql + Parquet.
> >
> >
> > Besides, we attempt to resolve many drawbacks in current query engine,
> > which relies heavily on Apache Calcite, such as the performance
> bottleneck
> > in aggregating large query results which currently can only be operated
> by
> > a single worker. By embracing SparkSql, this kind of expensive computing
> > can be done distributedly. Also combined with Parquet format, plenty of
> > filtering optimizations could be applied,which will boost Kylin’s query
> > performance significantly. The features will be open source along with
> > technique details in the near future.
> >
> >
> >
> > - https://issues.apache.org/jira/browse/KYLIN-4188
> >
> >
> > --
> >
> > ---------------------
> >
> > Best regards,
> >
> >
> >
> > Ni Chunen / George
> >
> >
> >
> >
>

Re: Kylin Building Engine With SparkSql & Parquet

Posted by ShaoFeng Shi <sh...@apache.org>.

In my opinion, it is very hard to maintain HBase storage and parquet
storage together. So parquet storage is stable enough, the Kylin 4.0 can no
longer depend on HBase.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




nichunen <ni...@apache.org> 于2020年1月30日周四 下午11:04写道：

> Hi Shaofeng,
>
>
> For your questions:
>
>
> 1) When the Parquet storage is released (say in Kylin 4.0), will the HBase
> storage still be kept (co-exist), or totally be replaced?
> I think we will keep an active branch with releases for Hbase storage, it
> won’t be totally replaced in the near feature.
>
> 2) Is there a migration tool for migrating HBase cubes to the new storage?
>
> The tool is in the developing plan. What’s more, the metadata will be
> compatible.
>
>
>
> Best regards,
>
>
>
> Ni Chunen / George
>
>
> On 2020/1/21, 4:10 AM, "ShaoFeng Shi" <sh...@apache.org> wrote:
>
> Chun en,
>
> Thanks for the info. I think we need to discuss more in the community, for
> example:
>
> 1) When the Parquet storage is released (say in Kylin 4.0), will the HBase
> storage still be kept (co-exist), or totally be replaced?
> 2) Is there a migration tool for migrating HBase cubes to the new storage?
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> nichunen <ni...@apache.org> 于2020年1月20日周一 下午9:38写道：
>
> Hi Shaofeng,
>
>
> Below is our plan for this project, any suggestion will be very welcome.
>
>
> 1. In mid-February of 2020, open source the prototype code of this feature
> to branch "kylin-on-parquet-v2", cube can be bulit with new building
> engine, and stored with parquet format.
>
>
> 2. In late April of 2020, the query module for the new storage type is
> scheduled to be ready, a happy path for cube creation, building and query
> will be available then.
>
>
> 3. In May or June of 2020, a Beta version (Kylin 4.0?) will be released.
>
>
>
> Best regards,
>
>
>
> Ni Chunen / George
>
>
>
> On 01/20/2020 16:00，ShaoFeng Shi<sh...@apache.org> wrote：
> Hi, Chun en,
>
> Thanks for the information. What's the detailed release plan of this
> feature to the community?
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> Xiaoxiang Yu <xx...@apache.org> 于2020年1月20日周一 下午1:59写道：
>
> Great news!
> I can foresee Kylin could be in a more Cloud-Native way after the mature
> of parquet storage. And I wish the developer team will share more detail
> for its desgin.
>
>
>
>
> --
>
> Best wishes to you !
> From ：Xiaoxiang Yu
>
>
>
> At 2020-01-19 22:22:30, "George Ni" <ni...@apache.org> wrote:
> Hi Kylin users & developers,
>
> By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
> achieve better performance and it does run much faster compared to MR
> engine. Also Hbase has been Kylin’s trustful storage engine since Kylin
> was
> born and it has been proved to be a success for providing the ability to
> handle high concurrency queries in extremely large data scale with low
> latency. But there are also limitations for HBase, such as filtering is
> not
> flexible as we could only filter by RowKey, measures are usually combined
> together which causes more data to be scanned than requested.
>
>
>
> So in order to optimize Kylin in both building strategy and storage
> engine,
> development team of Kyligence is introducing a new cube building engine
> which uses Spark Sql to construct cuboids with a new strategy and stores
> cube results in Parquet files. The building strategy allows Kylin to build
> cuboids in a smarter way by choosing and building on the optimal cuboid
> source. And Parquet, a columnar storage format available to any project in
> the Hadoop ecosystem, will power the filtering ability with the page-level
> column index and reduce I/O by saving measures in different columns. Also
> with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
> Cloud Native way. More information on design and technique details will
> come soon.
>
>
>
> Below is the comparison in building duration and size of results between
> By-layer Spark Cubing and the new cubing strategy.
>
>
>
> Environment
>
> 4-nodes Hadoop cluster
>
> YRAN has 400GB RAM and 128 cores in total;
>
> CDH 5.1, Apache Kylin 3.0.
>
>
>
> Spark
>
> Spark 2.4.1-kylin-r17
>
>
>
> Test Data
>
> SSB data
>
> Cube: 15 dimensions, 3 measures (SUM)
>
>
>
> Test Scenarios
>
> Build the cube at different source size level: 30 million, 60 million
> source rows; Compare the build time with Spark (by layer) + Hbase and
> SparkSql + Parquet.
>
>
> Besides, we attempt to resolve many drawbacks in current query engine,
> which relies heavily on Apache Calcite, such as the performance bottleneck
> in aggregating large query results which currently can only be operated by
> a single worker. By embracing SparkSql, this kind of expensive computing
> can be done distributedly. Also combined with Parquet format, plenty of
> filtering optimizations could be applied,which will boost Kylin’s query
> performance significantly. The features will be open source along with
> technique details in the near future.
>
>
>
> - https://issues.apache.org/jira/browse/KYLIN-4188
>
>
> --
>
> ---------------------
>
> Best regards,
>
>
>
> Ni Chunen / George
>
>
>
>

Re: Kylin Building Engine With SparkSql & Parquet

Posted by nichunen <ni...@apache.org>.

Hi Shaofeng,

For your questions:

1) When the Parquet storage is released (say in Kylin 4.0), will the HBase storage still be kept (co-exist), or totally be replaced?
I think we will keep an active branch with releases for Hbase storage, it won’t be totally replaced in the near feature.

2) Is there a migration tool for migrating HBase cubes to the new storage?

The tool is in the developing plan. What’s more, the metadata will be compatible.

Best regards,

Ni Chunen / George

On 2020/1/21, 4:10 AM, "ShaoFeng Shi" <sh...@apache.org> wrote:

Chun en,

Thanks for the info. I think we need to discuss more in the community, for
example:

1) When the Parquet storage is released (say in Kylin 4.0), will the HBase
storage still be kept (co-exist), or totally be replaced?
2) Is there a migration tool for migrating HBase cubes to the new storage?

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org

nichunen <ni...@apache.org> 于2020年1月20日周一 下午9:38写道：

Hi Shaofeng,

Below is our plan for this project, any suggestion will be very welcome.

1. In mid-February of 2020, open source the prototype code of this feature
to branch "kylin-on-parquet-v2", cube can be bulit with new building
engine, and stored with parquet format.

2. In late April of 2020, the query module for the new storage type is
scheduled to be ready, a happy path for cube creation, building and query
will be available then.

3. In May or June of 2020, a Beta version (Kylin 4.0?) will be released.

Best regards,

Ni Chunen / George

On 01/20/2020 16:00，ShaoFeng Shi<sh...@apache.org> wrote：
Hi, Chun en,

Thanks for the information. What's the detailed release plan of this
feature to the community?

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org

Xiaoxiang Yu <xx...@apache.org> 于2020年1月20日周一 下午1:59写道：

Great news!
I can foresee Kylin could be in a more Cloud-Native way after the mature
of parquet storage. And I wish the developer team will share more detail
for its desgin.

Best wishes to you !
From ：Xiaoxiang Yu

At 2020-01-19 22:22:30, "George Ni" <ni...@apache.org> wrote:
Hi Kylin users & developers,

By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
achieve better performance and it does run much faster compared to MR
engine. Also Hbase has been Kylin’s trustful storage engine since Kylin
was
born and it has been proved to be a success for providing the ability to
handle high concurrency queries in extremely large data scale with low
latency. But there are also limitations for HBase, such as filtering is
not
flexible as we could only filter by RowKey, measures are usually combined
together which causes more data to be scanned than requested.

So in order to optimize Kylin in both building strategy and storage
engine,
development team of Kyligence is introducing a new cube building engine
which uses Spark Sql to construct cuboids with a new strategy and stores
cube results in Parquet files. The building strategy allows Kylin to build
cuboids in a smarter way by choosing and building on the optimal cuboid
source. And Parquet, a columnar storage format available to any project in
the Hadoop ecosystem, will power the filtering ability with the page-level
column index and reduce I/O by saving measures in different columns. Also
with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
Cloud Native way. More information on design and technique details will
come soon.

Below is the comparison in building duration and size of results between
By-layer Spark Cubing and the new cubing strategy.

Environment

4-nodes Hadoop cluster

YRAN has 400GB RAM and 128 cores in total;

CDH 5.1, Apache Kylin 3.0.

Spark

Spark 2.4.1-kylin-r17

Test Data

SSB data

Cube: 15 dimensions, 3 measures (SUM)

Test Scenarios

Build the cube at different source size level: 30 million, 60 million
source rows; Compare the build time with Spark (by layer) + Hbase and
SparkSql + Parquet.

Besides, we attempt to resolve many drawbacks in current query engine,
which relies heavily on Apache Calcite, such as the performance bottleneck
in aggregating large query results which currently can only be operated by
a single worker. By embracing SparkSql, this kind of expensive computing
can be done distributedly. Also combined with Parquet format, plenty of
filtering optimizations could be applied,which will boost Kylin’s query
performance significantly. The features will be open source along with
technique details in the near future.

- https://issues.apache.org/jira/browse/KYLIN-4188

---------------------

Best regards,

Ni Chunen / George

Re: Kylin Building Engine With SparkSql & Parquet

Posted by Liu ehter <et...@gmail.com>.

Sound exciting. All great features!



On 2020/1/21, 4:10 AM, "ShaoFeng Shi" <sh...@apache.org> wrote:

    Chun en,
    
    Thanks for the info. I think we need to discuss more in the community, for
    example:
    
    1) When the Parquet storage is released (say in Kylin 4.0), will the HBase
    storage still be kept (co-exist), or totally be replaced?
    2) Is there a migration tool for migrating HBase cubes to the new storage?
    
    Best regards,
    
    Shaofeng Shi 史少锋
    Apache Kylin PMC
    Email: shaofengshi@apache.org
    
    Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
    Join Kylin user mail group: user-subscribe@kylin.apache.org
    Join Kylin dev mail group: dev-subscribe@kylin.apache.org
    
    
    
    
    nichunen <ni...@apache.org> 于2020年1月20日周一 下午9:38写道：
    
    > Hi Shaofeng,
    >
    >
    > Below is our plan for this project, any suggestion will be very welcome.
    >
    >
    > 1. In mid-February of 2020, open source the prototype code of this feature
    > to branch "kylin-on-parquet-v2", cube can be bulit with new building
    > engine, and stored with parquet format.
    >
    >
    > 2. In late April of 2020, the query module for the new storage type is
    > scheduled to be ready, a happy path for cube creation, building and query
    > will be available then.
    >
    >
    > 3. In May or June of 2020, a Beta version (Kylin 4.0?) will be released.
    >
    >
    >
    > Best regards,
    >
    >
    >
    > Ni Chunen / George
    >
    >
    >
    > On 01/20/2020 16:00，ShaoFeng Shi<sh...@apache.org> wrote：
    > Hi, Chun en,
    >
    > Thanks for the information. What's the detailed release plan of this
    > feature to the community?
    >
    > Best regards,
    >
    > Shaofeng Shi 史少锋
    > Apache Kylin PMC
    > Email: shaofengshi@apache.org
    >
    > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
    > Join Kylin user mail group: user-subscribe@kylin.apache.org
    > Join Kylin dev mail group: dev-subscribe@kylin.apache.org
    >
    >
    >
    >
    > Xiaoxiang Yu <xx...@apache.org> 于2020年1月20日周一 下午1:59写道：
    >
    > Great news!
    > I can foresee Kylin could be in a more Cloud-Native way after the mature
    > of parquet storage. And I wish the developer team will share more detail
    > for its desgin.
    >
    >
    >
    >
    > --
    >
    > Best wishes to you !
    > From ：Xiaoxiang Yu
    >
    >
    >
    > At 2020-01-19 22:22:30, "George Ni" <ni...@apache.org> wrote:
    > Hi Kylin users & developers,
    >
    > By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
    > achieve better performance and it does run much faster compared to MR
    > engine. Also Hbase has been Kylin’s trustful storage engine since Kylin
    > was
    > born and it has been proved to be a success for providing the ability to
    > handle high concurrency queries in extremely large data scale with low
    > latency. But there are also limitations for HBase, such as filtering is
    > not
    > flexible as we could only filter by RowKey, measures are usually combined
    > together which causes more data to be scanned than requested.
    >
    >
    >
    > So in order to optimize Kylin in both building strategy and storage
    > engine,
    > development team of Kyligence is introducing a new cube building engine
    > which uses Spark Sql to construct cuboids with a new strategy and stores
    > cube results in Parquet files. The building strategy allows Kylin to build
    > cuboids in a smarter way by choosing and building on the optimal cuboid
    > source. And Parquet, a columnar storage format available to any project in
    > the Hadoop ecosystem, will power the filtering ability with the page-level
    > column index and reduce I/O by saving measures in different columns. Also
    > with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
    > Cloud Native way. More information on design and technique details will
    > come soon.
    >
    >
    >
    > Below is the comparison in building duration and size of results between
    > By-layer Spark Cubing and the new cubing strategy.
    >
    >
    >
    > Environment
    >
    > 4-nodes Hadoop cluster
    >
    > YRAN has 400GB RAM and 128 cores in total;
    >
    > CDH 5.1, Apache Kylin 3.0.
    >
    >
    >
    > Spark
    >
    > Spark 2.4.1-kylin-r17
    >
    >
    >
    > Test Data
    >
    > SSB data
    >
    > Cube: 15 dimensions, 3 measures (SUM)
    >
    >
    >
    > Test Scenarios
    >
    > Build the cube at different source size level: 30 million, 60 million
    > source rows; Compare the build time with Spark (by layer) + Hbase and
    > SparkSql + Parquet.
    >
    >
    > Besides, we attempt to resolve many drawbacks in current query engine,
    > which relies heavily on Apache Calcite, such as the performance bottleneck
    > in aggregating large query results which currently can only be operated by
    > a single worker. By embracing SparkSql, this kind of expensive computing
    > can be done distributedly. Also combined with Parquet format, plenty of
    > filtering optimizations could be applied,which will boost Kylin’s query
    > performance significantly. The features will be open source along with
    > technique details in the near future.
    >
    >
    >
    > - https://issues.apache.org/jira/browse/KYLIN-4188
    >
    >
    > --
    >
    > ---------------------
    >
    > Best regards,
    >
    >
    >
    > Ni Chunen / George
    >
    >

Re: Kylin Building Engine With SparkSql & Parquet

Posted by ShaoFeng Shi <sh...@apache.org>.

Chun en,

Thanks for the info. I think we need to discuss more in the community, for
example:

1) When the Parquet storage is released (say in Kylin 4.0), will the HBase
storage still be kept (co-exist), or totally be replaced?
2) Is there a migration tool for migrating HBase cubes to the new storage?

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




nichunen <ni...@apache.org> 于2020年1月20日周一 下午9:38写道：

> Hi Shaofeng,
>
>
> Below is our plan for this project, any suggestion will be very welcome.
>
>
> 1. In mid-February of 2020, open source the prototype code of this feature
> to branch "kylin-on-parquet-v2", cube can be bulit with new building
> engine, and stored with parquet format.
>
>
> 2. In late April of 2020, the query module for the new storage type is
> scheduled to be ready, a happy path for cube creation, building and query
> will be available then.
>
>
> 3. In May or June of 2020, a Beta version (Kylin 4.0?) will be released.
>
>
>
> Best regards,
>
>
>
> Ni Chunen / George
>
>
>
> On 01/20/2020 16:00，ShaoFeng Shi<sh...@apache.org> wrote：
> Hi, Chun en,
>
> Thanks for the information. What's the detailed release plan of this
> feature to the community?
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> Xiaoxiang Yu <xx...@apache.org> 于2020年1月20日周一 下午1:59写道：
>
> Great news!
> I can foresee Kylin could be in a more Cloud-Native way after the mature
> of parquet storage. And I wish the developer team will share more detail
> for its desgin.
>
>
>
>
> --
>
> Best wishes to you !
> From ：Xiaoxiang Yu
>
>
>
> At 2020-01-19 22:22:30, "George Ni" <ni...@apache.org> wrote:
> Hi Kylin users & developers,
>
> By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
> achieve better performance and it does run much faster compared to MR
> engine. Also Hbase has been Kylin’s trustful storage engine since Kylin
> was
> born and it has been proved to be a success for providing the ability to
> handle high concurrency queries in extremely large data scale with low
> latency. But there are also limitations for HBase, such as filtering is
> not
> flexible as we could only filter by RowKey, measures are usually combined
> together which causes more data to be scanned than requested.
>
>
>
> So in order to optimize Kylin in both building strategy and storage
> engine,
> development team of Kyligence is introducing a new cube building engine
> which uses Spark Sql to construct cuboids with a new strategy and stores
> cube results in Parquet files. The building strategy allows Kylin to build
> cuboids in a smarter way by choosing and building on the optimal cuboid
> source. And Parquet, a columnar storage format available to any project in
> the Hadoop ecosystem, will power the filtering ability with the page-level
> column index and reduce I/O by saving measures in different columns. Also
> with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
> Cloud Native way. More information on design and technique details will
> come soon.
>
>
>
> Below is the comparison in building duration and size of results between
> By-layer Spark Cubing and the new cubing strategy.
>
>
>
> Environment
>
> 4-nodes Hadoop cluster
>
> YRAN has 400GB RAM and 128 cores in total;
>
> CDH 5.1, Apache Kylin 3.0.
>
>
>
> Spark
>
> Spark 2.4.1-kylin-r17
>
>
>
> Test Data
>
> SSB data
>
> Cube: 15 dimensions, 3 measures (SUM)
>
>
>
> Test Scenarios
>
> Build the cube at different source size level: 30 million, 60 million
> source rows; Compare the build time with Spark (by layer) + Hbase and
> SparkSql + Parquet.
>
>
> Besides, we attempt to resolve many drawbacks in current query engine,
> which relies heavily on Apache Calcite, such as the performance bottleneck
> in aggregating large query results which currently can only be operated by
> a single worker. By embracing SparkSql, this kind of expensive computing
> can be done distributedly. Also combined with Parquet format, plenty of
> filtering optimizations could be applied,which will boost Kylin’s query
> performance significantly. The features will be open source along with
> technique details in the near future.
>
>
>
> - https://issues.apache.org/jira/browse/KYLIN-4188
>
>
> --
>
> ---------------------
>
> Best regards,
>
>
>
> Ni Chunen / George
>
>

Re: Kylin Building Engine With SparkSql & Parquet

Posted by nichunen <ni...@apache.org>.

Hi Shaofeng,

Below is our plan for this project, any suggestion will be very welcome.

1. In mid-February of 2020, open source the prototype code of this feature to branch "kylin-on-parquet-v2", cube can be bulit with new building engine, and stored with parquet format.

2. In late April of 2020, the query module for the new storage type is scheduled to be ready, a happy path for cube creation, building and query will be available then.

3. In May or June of 2020, a Beta version (Kylin 4.0?) will be released.

Best regards,

Ni Chunen / George

On 01/20/2020 16:00，ShaoFeng Shi<sh...@apache.org> wrote：
Hi, Chun en,

Thanks for the information. What's the detailed release plan of this
feature to the community?

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org

Xiaoxiang Yu <xx...@apache.org> 于2020年1月20日周一 下午1:59写道：

Great news!
I can foresee Kylin could be in a more Cloud-Native way after the mature
of parquet storage. And I wish the developer team will share more detail
for its desgin.

Best wishes to you !
From ：Xiaoxiang Yu

At 2020-01-19 22:22:30, "George Ni" <ni...@apache.org> wrote:
Hi Kylin users & developers,

Below is the comparison in building duration and size of results between
By-layer Spark Cubing and the new cubing strategy.

Environment

4-nodes Hadoop cluster

YRAN has 400GB RAM and 128 cores in total;

CDH 5.1, Apache Kylin 3.0.

Spark

Spark 2.4.1-kylin-r17

Test Data

SSB data

Cube: 15 dimensions, 3 measures (SUM)

Test Scenarios

Build the cube at different source size level: 30 million, 60 million
source rows; Compare the build time with Spark (by layer) + Hbase and
SparkSql + Parquet.

- https://issues.apache.org/jira/browse/KYLIN-4188

---------------------

Best regards,

Ni Chunen / George

Re: Kylin Building Engine With SparkSql & Parquet

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi, Chun en,

Thanks for the information. What's the detailed release plan of this
feature to the community?

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Xiaoxiang Yu <xx...@apache.org> 于2020年1月20日周一 下午1:59写道：

> Great news!
> I can foresee Kylin could be in a more Cloud-Native way after the mature
> of parquet storage. And I wish the developer team will share more detail
> for its desgin.
>
>
>
>
> --
>
> Best wishes to you !
> From ：Xiaoxiang Yu
>
>
>
> At 2020-01-19 22:22:30, "George Ni" <ni...@apache.org> wrote:
> >Hi Kylin users & developers,
> >
> >By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
> >achieve better performance and it does run much faster compared to MR
> >engine. Also Hbase has been Kylin’s trustful storage engine since Kylin
> was
> >born and it has been proved to be a success for providing the ability to
> >handle high concurrency queries in extremely large data scale with low
> >latency. But there are also limitations for HBase, such as filtering is
> not
> >flexible as we could only filter by RowKey, measures are usually combined
> >together which causes more data to be scanned than requested.
> >
> >
> >
> >So in order to optimize Kylin in both building strategy and storage
> engine,
> >development team of Kyligence is introducing a new cube building engine
> >which uses Spark Sql to construct cuboids with a new strategy and stores
> >cube results in Parquet files. The building strategy allows Kylin to build
> >cuboids in a smarter way by choosing and building on the optimal cuboid
> >source. And Parquet, a columnar storage format available to any project in
> >the Hadoop ecosystem, will power the filtering ability with the page-level
> >column index and reduce I/O by saving measures in different columns. Also
> >with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
> >Cloud Native way. More information on design and technique details will
> >come soon.
> >
> >
> >
> >Below is the comparison in building duration and size of results between
> >By-layer Spark Cubing and the new cubing strategy.
> >
> >
> >
> >Environment
> >
> >4-nodes Hadoop cluster
> >
> >YRAN has 400GB RAM and 128 cores in total;
> >
> >CDH 5.1, Apache Kylin 3.0.
> >
> >
> >
> >Spark
> >
> >Spark 2.4.1-kylin-r17
> >
> >
> >
> >Test Data
> >
> >SSB data
> >
> >Cube: 15 dimensions, 3 measures (SUM)
> >
> >
> >
> >Test Scenarios
> >
> >Build the cube at different source size level: 30 million, 60 million
> >source rows; Compare the build time with Spark (by layer) + Hbase and
> >SparkSql + Parquet.
> >
> >
> >Besides, we attempt to resolve many drawbacks in current query engine,
> >which relies heavily on Apache Calcite, such as the performance bottleneck
> >in aggregating large query results which currently can only be operated by
> >a single worker. By embracing SparkSql, this kind of expensive computing
> >can be done distributedly. Also combined with Parquet format, plenty of
> >filtering optimizations could be applied,which will boost Kylin’s query
> >performance significantly. The features will be open source along with
> >technique details in the near future.
> >
> >
> >
> >   - https://issues.apache.org/jira/browse/KYLIN-4188
> >
> >
> >--
> >
> >---------------------
> >
> >Best regards,
> >
> >
> >
> >Ni Chunen / George
>

Re:Kylin Building Engine With SparkSql & Parquet

Posted by Xiaoxiang Yu <xx...@apache.org>.

Great news! 
I can foresee Kylin could be in a more Cloud-Native way after the mature of parquet storage. And I wish the developer team will share more detail for its desgin.




--

Best wishes to you ! 
From ：Xiaoxiang Yu



At 2020-01-19 22:22:30, "George Ni" <ni...@apache.org> wrote:
>Hi Kylin users & developers,
>
>By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
>achieve better performance and it does run much faster compared to MR
>engine. Also Hbase has been Kylin’s trustful storage engine since Kylin was
>born and it has been proved to be a success for providing the ability to
>handle high concurrency queries in extremely large data scale with low
>latency. But there are also limitations for HBase, such as filtering is not
>flexible as we could only filter by RowKey, measures are usually combined
>together which causes more data to be scanned than requested.
>
>
>
>So in order to optimize Kylin in both building strategy and storage engine,
>development team of Kyligence is introducing a new cube building engine
>which uses Spark Sql to construct cuboids with a new strategy and stores
>cube results in Parquet files. The building strategy allows Kylin to build
>cuboids in a smarter way by choosing and building on the optimal cuboid
>source. And Parquet, a columnar storage format available to any project in
>the Hadoop ecosystem, will power the filtering ability with the page-level
>column index and reduce I/O by saving measures in different columns. Also
>with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
>Cloud Native way. More information on design and technique details will
>come soon.
>
>
>
>Below is the comparison in building duration and size of results between
>By-layer Spark Cubing and the new cubing strategy.
>
>
>
>Environment
>
>4-nodes Hadoop cluster
>
>YRAN has 400GB RAM and 128 cores in total;
>
>CDH 5.1, Apache Kylin 3.0.
>
>
>
>Spark
>
>Spark 2.4.1-kylin-r17
>
>
>
>Test Data
>
>SSB data
>
>Cube: 15 dimensions, 3 measures (SUM)
>
>
>
>Test Scenarios
>
>Build the cube at different source size level: 30 million, 60 million
>source rows; Compare the build time with Spark (by layer) + Hbase and
>SparkSql + Parquet.
>
>
>Besides, we attempt to resolve many drawbacks in current query engine,
>which relies heavily on Apache Calcite, such as the performance bottleneck
>in aggregating large query results which currently can only be operated by
>a single worker. By embracing SparkSql, this kind of expensive computing
>can be done distributedly. Also combined with Parquet format, plenty of
>filtering optimizations could be applied,which will boost Kylin’s query
>performance significantly. The features will be open source along with
>technique details in the near future.
>
>
>
>   - https://issues.apache.org/jira/browse/KYLIN-4188
>
>
>-- 
>
>---------------------
>
>Best regards,
>
>
>
>Ni Chunen / George

Re:Kylin Building Engine With SparkSql & Parquet

Posted by 朱卫斌 <co...@126.com>.

Looking forward to it, I believe it will bring a great performance improvement.


| |
weibin0516
|
|
codingforfun@126.com
Best wishes !
|
签名由网易邮箱大师定制


On 01/19/2020 22:22，George Ni<ni...@apache.org> wrote：
Hi Kylin users & developers,

By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
achieve better performance and it does run much faster compared to MR
engine. Also Hbase has been Kylin’s trustful storage engine since Kylin was
born and it has been proved to be a success for providing the ability to
handle high concurrency queries in extremely large data scale with low
latency. But there are also limitations for HBase, such as filtering is not
flexible as we could only filter by RowKey, measures are usually combined
together which causes more data to be scanned than requested.



So in order to optimize Kylin in both building strategy and storage engine,
development team of Kyligence is introducing a new cube building engine
which uses Spark Sql to construct cuboids with a new strategy and stores
cube results in Parquet files. The building strategy allows Kylin to build
cuboids in a smarter way by choosing and building on the optimal cuboid
source. And Parquet, a columnar storage format available to any project in
the Hadoop ecosystem, will power the filtering ability with the page-level
column index and reduce I/O by saving measures in different columns. Also
with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
Cloud Native way. More information on design and technique details will
come soon.



Below is the comparison in building duration and size of results between
By-layer Spark Cubing and the new cubing strategy.



Environment

4-nodes Hadoop cluster

YRAN has 400GB RAM and 128 cores in total;

CDH 5.1, Apache Kylin 3.0.



Spark

Spark 2.4.1-kylin-r17



Test Data

SSB data

Cube: 15 dimensions, 3 measures (SUM)



Test Scenarios

Build the cube at different source size level: 30 million, 60 million
source rows; Compare the build time with Spark (by layer) + Hbase and
SparkSql + Parquet.


Besides, we attempt to resolve many drawbacks in current query engine,
which relies heavily on Apache Calcite, such as the performance bottleneck
in aggregating large query results which currently can only be operated by
a single worker. By embracing SparkSql, this kind of expensive computing
can be done distributedly. Also combined with Parquet format, plenty of
filtering optimizations could be applied,which will boost Kylin’s query
performance significantly. The features will be open source along with
technique details in the near future.



- https://issues.apache.org/jira/browse/KYLIN-4188


--

---------------------

Best regards,



Ni Chunen / George

Re:Kylin Building Engine With SparkSql & Parquet

Posted by 朱卫斌 <co...@126.com>.

Looking forward to it, I believe it will bring a great performance improvement.


| |
weibin0516
|
|
codingforfun@126.com
Best wishes !
|
签名由网易邮箱大师定制


On 01/19/2020 22:22，George Ni<ni...@apache.org> wrote：
Hi Kylin users & developers,

By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
achieve better performance and it does run much faster compared to MR
engine. Also Hbase has been Kylin’s trustful storage engine since Kylin was
born and it has been proved to be a success for providing the ability to
handle high concurrency queries in extremely large data scale with low
latency. But there are also limitations for HBase, such as filtering is not
flexible as we could only filter by RowKey, measures are usually combined
together which causes more data to be scanned than requested.



So in order to optimize Kylin in both building strategy and storage engine,
development team of Kyligence is introducing a new cube building engine
which uses Spark Sql to construct cuboids with a new strategy and stores
cube results in Parquet files. The building strategy allows Kylin to build
cuboids in a smarter way by choosing and building on the optimal cuboid
source. And Parquet, a columnar storage format available to any project in
the Hadoop ecosystem, will power the filtering ability with the page-level
column index and reduce I/O by saving measures in different columns. Also
with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
Cloud Native way. More information on design and technique details will
come soon.



Below is the comparison in building duration and size of results between
By-layer Spark Cubing and the new cubing strategy.



Environment

4-nodes Hadoop cluster

YRAN has 400GB RAM and 128 cores in total;

CDH 5.1, Apache Kylin 3.0.



Spark

Spark 2.4.1-kylin-r17



Test Data

SSB data

Cube: 15 dimensions, 3 measures (SUM)



Test Scenarios

Build the cube at different source size level: 30 million, 60 million
source rows; Compare the build time with Spark (by layer) + Hbase and
SparkSql + Parquet.


Besides, we attempt to resolve many drawbacks in current query engine,
which relies heavily on Apache Calcite, such as the performance bottleneck
in aggregating large query results which currently can only be operated by
a single worker. By embracing SparkSql, this kind of expensive computing
can be done distributedly. Also combined with Parquet format, plenty of
filtering optimizations could be applied,which will boost Kylin’s query
performance significantly. The features will be open source along with
technique details in the near future.



- https://issues.apache.org/jira/browse/KYLIN-4188


--

---------------------

Best regards,



Ni Chunen / George