You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Akash Nilugal <ak...@gmail.com> on 2021/08/31 17:47:35 UTC

[DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Hi Community,

OLTP systems like Mysql are used heavily for storing transactional data in
real-time and the same data is later used for doing fraud detection and
taking various data-driven business decisions. Since OLTP systems are not
suited for analytical queries due to their row-based storage, there is a
need to store this primary data into big data storage in a way that data on
DFS is an exact replica of the data present in Mysql. Traditional ways for
capturing data from primary databases, like Apache Sqoop, use pull-based
CDC approaches which put additional load on the primary databases. Hence
log-based CDC solutions became increasingly popular. However, there are 2
aspects to this problem. We should be able to incrementally capture the
data changes from primary databases and should be able to incrementally
ingest the same in the data lake so that the overall latency decreases. The
former is taken care of using log-based CDC systems like Maxwell and
Debezium. Here we are proposing a solution for the second aspect using
Apache Carbondata.

Carbondata streamer tool enables users to incrementally ingest data from
various sources, like Kafka and DFS into their data lakes. The tool comes
with out-of-the-box support for almost all types of schema evolution use
cases. Currently, this tool can be launched as a spark application either
in continuous mode or a one-time job.

Further details are present in the design document. Please review the
design and help to improve it. I'm attaching the link to the google doc,
you can directly comment on that. Any suggestions and improvements are most
welcome.

https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing

Thanks

Regards,
Akash R Nilugal

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Posted by Akash r <ak...@gmail.com>.

Hi Ravi,

Thanks for the approval and the questions, please find the comments below

1. Generally CDC includes IUD operations, so how are you planning to handle
them? Are you planning to merge command? If yes how frequently do you want
to
merge it?
You are right, it includes mainly the IUD operations. So I will be
preparing the data frames by reading data from the source(Kafka or DFS) and
then call
merge APIs which are introduced in the last carbondata 2.2.0 release.
A configuration or the CLI argument will be exposed to the user, he can
decide the batch time interval. If not given we will use the default which
we need to decide
 5 or 10 seconds based on the delay and the processing and data rate.

2. How you can make sure the Kafka exactly-once semantics( how can you
ensure data is written once without duplication)
We have exposed two properties for the deduplication, one is for insert and
one is for upsert.
During insert, if the property is enabled the duplicate records which are
already present in the target table will be dropped.
In an upsert case, the duplicates(just in the incoming dataset) will be
removed based on the latest record.

In the scenario of the same batch comes for upsert again, in that case,
carbon doesn't write any duplicate data, it will just do a reupdate, which
inturn doesn't affect the store.
Here better to make one more update operation, instead of maintaining some
metadata to identify whether this batch has been processed before or not.

Please give your suggestion on this.

Thanks

Regards,
Akash

On Wed, Sep 1, 2021 at 6:11 PM Ravindra Pesala <ra...@gmail.com>
wrote:

> +1
>
> I want to understand few clarifications regarding the design.
> 1. Generally CDC includes IUD operations, so how are you planning to handle
> them? Are you planning to merge command? If yes how frequent you want to
> merge it?
> 2. How you can make sure the Kafka exactly once semantics( how can you
> ensure data is written once with out duplication)
>
>
> Regards,
> Ravindra.
>
> On Wed, 1 Sep 2021 at 1:48 AM, Akash Nilugal <ak...@gmail.com>
> wrote:
>
> > Hi Community,
> >
> > OLTP systems like Mysql are used heavily for storing transactional data
> in
> > real-time and the same data is later used for doing fraud detection and
> > taking various data-driven business decisions. Since OLTP systems are not
> > suited for analytical queries due to their row-based storage, there is a
> > need to store this primary data into big data storage in a way that data
> on
> > DFS is an exact replica of the data present in Mysql. Traditional ways
> for
> > capturing data from primary databases, like Apache Sqoop, use pull-based
> > CDC approaches which put additional load on the primary databases. Hence
> > log-based CDC solutions became increasingly popular. However, there are 2
> > aspects to this problem. We should be able to incrementally capture the
> > data changes from primary databases and should be able to incrementally
> > ingest the same in the data lake so that the overall latency decreases.
> The
> > former is taken care of using log-based CDC systems like Maxwell and
> > Debezium. Here we are proposing a solution for the second aspect using
> > Apache Carbondata.
> >
> > Carbondata streamer tool enables users to incrementally ingest data from
> > various sources, like Kafka and DFS into their data lakes. The tool comes
> > with out-of-the-box support for almost all types of schema evolution use
> > cases. Currently, this tool can be launched as a spark application either
> > in continuous mode or a one-time job.
> >
> > Further details are present in the design document. Please review the
> > design and help to improve it. I'm attaching the link to the google doc,
> > you can directly comment on that. Any suggestions and improvements are
> most
> > welcome.
> >
> >
> >
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
> >
> > Thanks
> >
> > Regards,
> > Akash R Nilugal
> >
> --
> Thanks & Regards,
> Ravi
>

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Posted by Ravindra Pesala <ra...@gmail.com>.

+1

I want to understand few clarifications regarding the design.
1. Generally CDC includes IUD operations, so how are you planning to handle
them? Are you planning to merge command? If yes how frequent you want to
merge it?
2. How you can make sure the Kafka exactly once semantics( how can you
ensure data is written once with out duplication)


Regards,
Ravindra.

On Wed, 1 Sep 2021 at 1:48 AM, Akash Nilugal <ak...@gmail.com> wrote:

> Hi Community,
>
> OLTP systems like Mysql are used heavily for storing transactional data in
> real-time and the same data is later used for doing fraud detection and
> taking various data-driven business decisions. Since OLTP systems are not
> suited for analytical queries due to their row-based storage, there is a
> need to store this primary data into big data storage in a way that data on
> DFS is an exact replica of the data present in Mysql. Traditional ways for
> capturing data from primary databases, like Apache Sqoop, use pull-based
> CDC approaches which put additional load on the primary databases. Hence
> log-based CDC solutions became increasingly popular. However, there are 2
> aspects to this problem. We should be able to incrementally capture the
> data changes from primary databases and should be able to incrementally
> ingest the same in the data lake so that the overall latency decreases. The
> former is taken care of using log-based CDC systems like Maxwell and
> Debezium. Here we are proposing a solution for the second aspect using
> Apache Carbondata.
>
> Carbondata streamer tool enables users to incrementally ingest data from
> various sources, like Kafka and DFS into their data lakes. The tool comes
> with out-of-the-box support for almost all types of schema evolution use
> cases. Currently, this tool can be launched as a spark application either
> in continuous mode or a one-time job.
>
> Further details are present in the design document. Please review the
> design and help to improve it. I'm attaching the link to the google doc,
> you can directly comment on that. Any suggestions and improvements are most
> welcome.
>
>
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
>
> Thanks
>
> Regards,
> Akash R Nilugal
>
-- 
Thanks & Regards,
Ravi

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Posted by Shreelekhya Gampa <sh...@gmail.com>.

+1 for the feature.

On 2021/08/31 17:47:35, Akash Nilugal <ak...@gmail.com> wrote: 
> Hi Community,
> 
> OLTP systems like Mysql are used heavily for storing transactional data in
> real-time and the same data is later used for doing fraud detection and
> taking various data-driven business decisions. Since OLTP systems are not
> suited for analytical queries due to their row-based storage, there is a
> need to store this primary data into big data storage in a way that data on
> DFS is an exact replica of the data present in Mysql. Traditional ways for
> capturing data from primary databases, like Apache Sqoop, use pull-based
> CDC approaches which put additional load on the primary databases. Hence
> log-based CDC solutions became increasingly popular. However, there are 2
> aspects to this problem. We should be able to incrementally capture the
> data changes from primary databases and should be able to incrementally
> ingest the same in the data lake so that the overall latency decreases. The
> former is taken care of using log-based CDC systems like Maxwell and
> Debezium. Here we are proposing a solution for the second aspect using
> Apache Carbondata.
> 
> Carbondata streamer tool enables users to incrementally ingest data from
> various sources, like Kafka and DFS into their data lakes. The tool comes
> with out-of-the-box support for almost all types of schema evolution use
> cases. Currently, this tool can be launched as a spark application either
> in continuous mode or a one-time job.
> 
> Further details are present in the design document. Please review the
> design and help to improve it. I'm attaching the link to the google doc,
> you can directly comment on that. Any suggestions and improvements are most
> welcome.
> 
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
> 
> Thanks
> 
> Regards,
> Akash R Nilugal
>

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Posted by Mahesh Raju Somalaraju <ma...@gmail.com>.

+1 for the streamer tool

Thanks & Regards
-Mahesh Raju S
Githubid: maheshrajus

On Tue, Aug 31, 2021 at 11:18 PM Akash Nilugal <ak...@gmail.com>
wrote:

> Hi Community,
>
> OLTP systems like Mysql are used heavily for storing transactional data in
> real-time and the same data is later used for doing fraud detection and
> taking various data-driven business decisions. Since OLTP systems are not
> suited for analytical queries due to their row-based storage, there is a
> need to store this primary data into big data storage in a way that data on
> DFS is an exact replica of the data present in Mysql. Traditional ways for
> capturing data from primary databases, like Apache Sqoop, use pull-based
> CDC approaches which put additional load on the primary databases. Hence
> log-based CDC solutions became increasingly popular. However, there are 2
> aspects to this problem. We should be able to incrementally capture the
> data changes from primary databases and should be able to incrementally
> ingest the same in the data lake so that the overall latency decreases. The
> former is taken care of using log-based CDC systems like Maxwell and
> Debezium. Here we are proposing a solution for the second aspect using
> Apache Carbondata.
>
> Carbondata streamer tool enables users to incrementally ingest data from
> various sources, like Kafka and DFS into their data lakes. The tool comes
> with out-of-the-box support for almost all types of schema evolution use
> cases. Currently, this tool can be launched as a spark application either
> in continuous mode or a one-time job.
>
> Further details are present in the design document. Please review the
> design and help to improve it. I'm attaching the link to the google doc,
> you can directly comment on that. Any suggestions and improvements are most
> welcome.
>
>
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
>
> Thanks
>
> Regards,
> Akash R Nilugal
>

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Posted by Pratyaksh Sharma <pr...@gmail.com>.

Hi Indhu,

Apologies for the late reply. Please find the below inline answers -

1. For Multi-Table merge scenario, does it support concurrent cdc or
sequential cdc to target table?

> In phase 1, we are supporting a scenario where multiple tables (all with
the same schema) are being pushed to same topic and the events from that
topic will be ingested in a single iteration.

In next phases when we support ingesting tables with different schemas,
then we plan to have concurrent ingestion only.

2. On failure scenarios (like streamer tool is killed/crashed), how we can
ensure data is not duplicated on restarting the Kafka ingest ?

> For this scenario, we have few configurable options like `--deduplicate`
and `--combine-before-upsert`. If they are set to true, the following
operations are done -
a) Incoming batch of records is deduplicated for events with the same
record key.
b) In case of INSERT operation type, all existing records are removed from
the incoming batch.

Hence the use of spark streaming checkpointing and these options helps
ensure there are no duplicates.

Hope that answers your queries.

On Mon, Sep 6, 2021 at 7:06 PM Nihal ojha <ni...@gmail.com> wrote:

> +1,  good idea to implement streamer tool.
>
> Regards
> Nihal
>
> On 2021/08/31 17:47:35, Akash Nilugal <ak...@gmail.com> wrote:
> > Hi Community,
> >
> > OLTP systems like Mysql are used heavily for storing transactional data
> in
> > real-time and the same data is later used for doing fraud detection and
> > taking various data-driven business decisions. Since OLTP systems are not
> > suited for analytical queries due to their row-based storage, there is a
> > need to store this primary data into big data storage in a way that data
> on
> > DFS is an exact replica of the data present in Mysql. Traditional ways
> for
> > capturing data from primary databases, like Apache Sqoop, use pull-based
> > CDC approaches which put additional load on the primary databases. Hence
> > log-based CDC solutions became increasingly popular. However, there are 2
> > aspects to this problem. We should be able to incrementally capture the
> > data changes from primary databases and should be able to incrementally
> > ingest the same in the data lake so that the overall latency decreases.
> The
> > former is taken care of using log-based CDC systems like Maxwell and
> > Debezium. Here we are proposing a solution for the second aspect using
> > Apache Carbondata.
> >
> > Carbondata streamer tool enables users to incrementally ingest data from
> > various sources, like Kafka and DFS into their data lakes. The tool comes
> > with out-of-the-box support for almost all types of schema evolution use
> > cases. Currently, this tool can be launched as a spark application either
> > in continuous mode or a one-time job.
> >
> > Further details are present in the design document. Please review the
> > design and help to improve it. I'm attaching the link to the google doc,
> > you can directly comment on that. Any suggestions and improvements are
> most
> > welcome.
> >
> >
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
> >
> > Thanks
> >
> > Regards,
> > Akash R Nilugal
> >
>

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Posted by Nihal ojha <ni...@gmail.com>.

+1,  good idea to implement streamer tool.

Regards
Nihal

On 2021/08/31 17:47:35, Akash Nilugal <ak...@gmail.com> wrote: 
> Hi Community,
> 
> OLTP systems like Mysql are used heavily for storing transactional data in
> real-time and the same data is later used for doing fraud detection and
> taking various data-driven business decisions. Since OLTP systems are not
> suited for analytical queries due to their row-based storage, there is a
> need to store this primary data into big data storage in a way that data on
> DFS is an exact replica of the data present in Mysql. Traditional ways for
> capturing data from primary databases, like Apache Sqoop, use pull-based
> CDC approaches which put additional load on the primary databases. Hence
> log-based CDC solutions became increasingly popular. However, there are 2
> aspects to this problem. We should be able to incrementally capture the
> data changes from primary databases and should be able to incrementally
> ingest the same in the data lake so that the overall latency decreases. The
> former is taken care of using log-based CDC systems like Maxwell and
> Debezium. Here we are proposing a solution for the second aspect using
> Apache Carbondata.
> 
> Carbondata streamer tool enables users to incrementally ingest data from
> various sources, like Kafka and DFS into their data lakes. The tool comes
> with out-of-the-box support for almost all types of schema evolution use
> cases. Currently, this tool can be launched as a spark application either
> in continuous mode or a one-time job.
> 
> Further details are present in the design document. Please review the
> design and help to improve it. I'm attaching the link to the google doc,
> you can directly comment on that. Any suggestions and improvements are most
> welcome.
> 
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
> 
> Thanks
> 
> Regards,
> Akash R Nilugal
>

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Posted by Indhumathi M <in...@gmail.com>.

+1 for the streamer tool.

I have some questions listed below.
1. For Multi-Table merge scenario, does it support concurrent cdc or
sequential cdc to target table?
2. On failure scenarios (like streamer tool is killed/crashed), how we can
ensure data is not duplicated on restarting the Kafka ingest ?

Regards,
Indhumathi M

On Tue, 31 Aug 2021 at 11:18 PM, Akash Nilugal <ak...@gmail.com>
wrote:

> Hi Community,
>
> OLTP systems like Mysql are used heavily for storing transactional data in
> real-time and the same data is later used for doing fraud detection and
> taking various data-driven business decisions. Since OLTP systems are not
> suited for analytical queries due to their row-based storage, there is a
> need to store this primary data into big data storage in a way that data on
> DFS is an exact replica of the data present in Mysql. Traditional ways for
> capturing data from primary databases, like Apache Sqoop, use pull-based
> CDC approaches which put additional load on the primary databases. Hence
> log-based CDC solutions became increasingly popular. However, there are 2
> aspects to this problem. We should be able to incrementally capture the
> data changes from primary databases and should be able to incrementally
> ingest the same in the data lake so that the overall latency decreases. The
> former is taken care of using log-based CDC systems like Maxwell and
> Debezium. Here we are proposing a solution for the second aspect using
> Apache Carbondata.
>
> Carbondata streamer tool enables users to incrementally ingest data from
> various sources, like Kafka and DFS into their data lakes. The tool comes
> with out-of-the-box support for almost all types of schema evolution use
> cases. Currently, this tool can be launched as a spark application either
> in continuous mode or a one-time job.
>
> Further details are present in the design document. Please review the
> design and help to improve it. I'm attaching the link to the google doc,
> you can directly comment on that. Any suggestions and improvements are most
> welcome.
>
>
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
>
> Thanks
>
> Regards,
> Akash R Nilugal
>

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Posted by Akash r <ak...@gmail.com>.

Hi Likun,

Thanks for the approval.
As you can see in the design doc the last section, I have divided as the
scope and added points
we will proceed in that manner

Regards,
Akash

On Wed, Sep 1, 2021 at 11:54 AM Jacky Li <ja...@apache.org> wrote:

> +1
> It is a really good feature, looking forward to it.
>
> Suggest to break it down to small tasks so that it is easy to review
>
> Regards,
> Jackhy
>
> On 2021/08/31 17:47:35, Akash Nilugal <ak...@gmail.com> wrote:
> > Hi Community,
> >
> > OLTP systems like Mysql are used heavily for storing transactional data
> in
> > real-time and the same data is later used for doing fraud detection and
> > taking various data-driven business decisions. Since OLTP systems are not
> > suited for analytical queries due to their row-based storage, there is a
> > need to store this primary data into big data storage in a way that data
> on
> > DFS is an exact replica of the data present in Mysql. Traditional ways
> for
> > capturing data from primary databases, like Apache Sqoop, use pull-based
> > CDC approaches which put additional load on the primary databases. Hence
> > log-based CDC solutions became increasingly popular. However, there are 2
> > aspects to this problem. We should be able to incrementally capture the
> > data changes from primary databases and should be able to incrementally
> > ingest the same in the data lake so that the overall latency decreases.
> The
> > former is taken care of using log-based CDC systems like Maxwell and
> > Debezium. Here we are proposing a solution for the second aspect using
> > Apache Carbondata.
> >
> > Carbondata streamer tool enables users to incrementally ingest data from
> > various sources, like Kafka and DFS into their data lakes. The tool comes
> > with out-of-the-box support for almost all types of schema evolution use
> > cases. Currently, this tool can be launched as a spark application either
> > in continuous mode or a one-time job.
> >
> > Further details are present in the design document. Please review the
> > design and help to improve it. I'm attaching the link to the google doc,
> > you can directly comment on that. Any suggestions and improvements are
> most
> > welcome.
> >
> >
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
> >
> > Thanks
> >
> > Regards,
> > Akash R Nilugal
> >
>

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Posted by Jacky Li <ja...@apache.org>.

+1
It is a really good feature, looking forward to it.

Suggest to break it down to small tasks so that it is easy to review

Regards,
Jackhy

On 2021/08/31 17:47:35, Akash Nilugal <ak...@gmail.com> wrote: 
> Hi Community,
> 
> OLTP systems like Mysql are used heavily for storing transactional data in
> real-time and the same data is later used for doing fraud detection and
> taking various data-driven business decisions. Since OLTP systems are not
> suited for analytical queries due to their row-based storage, there is a
> need to store this primary data into big data storage in a way that data on
> DFS is an exact replica of the data present in Mysql. Traditional ways for
> capturing data from primary databases, like Apache Sqoop, use pull-based
> CDC approaches which put additional load on the primary databases. Hence
> log-based CDC solutions became increasingly popular. However, there are 2
> aspects to this problem. We should be able to incrementally capture the
> data changes from primary databases and should be able to incrementally
> ingest the same in the data lake so that the overall latency decreases. The
> former is taken care of using log-based CDC systems like Maxwell and
> Debezium. Here we are proposing a solution for the second aspect using
> Apache Carbondata.
> 
> Carbondata streamer tool enables users to incrementally ingest data from
> various sources, like Kafka and DFS into their data lakes. The tool comes
> with out-of-the-box support for almost all types of schema evolution use
> cases. Currently, this tool can be launched as a spark application either
> in continuous mode or a one-time job.
> 
> Further details are present in the design document. Please review the
> design and help to improve it. I'm attaching the link to the google doc,
> you can directly comment on that. Any suggestions and improvements are most
> welcome.
> 
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
> 
> Thanks
> 
> Regards,
> Akash R Nilugal
>