You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Jacky Li <ja...@apache.org> on 2021/09/01 06:24:47 UTC

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

+1
It is a really good feature, looking forward to it.

Suggest to break it down to small tasks so that it is easy to review

Regards,
Jackhy

On 2021/08/31 17:47:35, Akash Nilugal <ak...@gmail.com> wrote: 
> Hi Community,
> 
> OLTP systems like Mysql are used heavily for storing transactional data in
> real-time and the same data is later used for doing fraud detection and
> taking various data-driven business decisions. Since OLTP systems are not
> suited for analytical queries due to their row-based storage, there is a
> need to store this primary data into big data storage in a way that data on
> DFS is an exact replica of the data present in Mysql. Traditional ways for
> capturing data from primary databases, like Apache Sqoop, use pull-based
> CDC approaches which put additional load on the primary databases. Hence
> log-based CDC solutions became increasingly popular. However, there are 2
> aspects to this problem. We should be able to incrementally capture the
> data changes from primary databases and should be able to incrementally
> ingest the same in the data lake so that the overall latency decreases. The
> former is taken care of using log-based CDC systems like Maxwell and
> Debezium. Here we are proposing a solution for the second aspect using
> Apache Carbondata.
> 
> Carbondata streamer tool enables users to incrementally ingest data from
> various sources, like Kafka and DFS into their data lakes. The tool comes
> with out-of-the-box support for almost all types of schema evolution use
> cases. Currently, this tool can be launched as a spark application either
> in continuous mode or a one-time job.
> 
> Further details are present in the design document. Please review the
> design and help to improve it. I'm attaching the link to the google doc,
> you can directly comment on that. Any suggestions and improvements are most
> welcome.
> 
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
> 
> Thanks
> 
> Regards,
> Akash R Nilugal
>

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Posted by Akash r <ak...@gmail.com>.

Hi Likun,

Thanks for the approval.
As you can see in the design doc the last section, I have divided as the
scope and added points
we will proceed in that manner

Regards,
Akash

On Wed, Sep 1, 2021 at 11:54 AM Jacky Li <ja...@apache.org> wrote:

> +1
> It is a really good feature, looking forward to it.
>
> Suggest to break it down to small tasks so that it is easy to review
>
> Regards,
> Jackhy
>
> On 2021/08/31 17:47:35, Akash Nilugal <ak...@gmail.com> wrote:
> > Hi Community,
> >
> > OLTP systems like Mysql are used heavily for storing transactional data
> in
> > real-time and the same data is later used for doing fraud detection and
> > taking various data-driven business decisions. Since OLTP systems are not
> > suited for analytical queries due to their row-based storage, there is a
> > need to store this primary data into big data storage in a way that data
> on
> > DFS is an exact replica of the data present in Mysql. Traditional ways
> for
> > capturing data from primary databases, like Apache Sqoop, use pull-based
> > CDC approaches which put additional load on the primary databases. Hence
> > log-based CDC solutions became increasingly popular. However, there are 2
> > aspects to this problem. We should be able to incrementally capture the
> > data changes from primary databases and should be able to incrementally
> > ingest the same in the data lake so that the overall latency decreases.
> The
> > former is taken care of using log-based CDC systems like Maxwell and
> > Debezium. Here we are proposing a solution for the second aspect using
> > Apache Carbondata.
> >
> > Carbondata streamer tool enables users to incrementally ingest data from
> > various sources, like Kafka and DFS into their data lakes. The tool comes
> > with out-of-the-box support for almost all types of schema evolution use
> > cases. Currently, this tool can be launched as a spark application either
> > in continuous mode or a one-time job.
> >
> > Further details are present in the design document. Please review the
> > design and help to improve it. I'm attaching the link to the google doc,
> > you can directly comment on that. Any suggestions and improvements are
> most
> > welcome.
> >
> >
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
> >
> > Thanks
> >
> > Regards,
> > Akash R Nilugal
> >
>