You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by vincent gromakowski <vi...@gmail.com> on 2016/09/23 14:32:47 UTC

carbondata and idempotence

Hi Carbondata community,
I am evaluating various file format right now and found Carbondata to be
interesting specially with the multiple index used to avoid full scan but I
am asking if there is any way to achieve idem potence when writing to
Carbondata from Spark (or alternative) ?
A strong requirement is to avoid a Spark worker crash to write duplicated
entries in Carbon...
Tx

Vincent

Re: carbondata and idempotence

Posted by chenliang613 <ch...@gmail.com>.

Hi Vincent

Happy to hear you are interested in Apache CarbonData.

Write to CarbonData file from Spark, please refer to the example :
DataFrameAPIExample.
Can you explain more about this requirement:A strong requirement is to avoid
a Spark worker crash to write duplicated entries in Carbon... ?

Regards
Liang



--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-and-idempotence-tp1416p1417.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

RE: carbondata and idempotence

Posted by Jihong Ma <Ji...@huawei.com>.

Hi Vincent,

In batch mode, with overwrite savemode, we can achieve exactly-once as we will simply overwrite if there are existing files, other than that, there is no guarantee since DF/DS/RDD doesn't maintain any checkpoints/WAL to know where it left before crash..

In Streaming mode, we will consider go further to guarantee exactly-once semantics with the help of check-pointing the offset/WAL, and introduce 'transactional' state to uniquely identify the current batch of data, and only write it out once (ignore if it already exists).

Jihong

-----Original Message-----
From: vincent [mailto:vincent.gromakowski@gmail.com] 
Sent: Tuesday, September 27, 2016 7:11 AM
To: dev@carbondata.incubator.apache.org
Subject: RE: carbondata and idempotence

Hi
thanks for your answer. My question is about both streaming and batch. Even
in batch if a worker crash or if speculation is activated, the worker's task
that failed will be relaunched on another worker. For example the worker has
crashed after having ingested 20 000 lines on the 100 000 lines of the task,
then the new worker will write the entire 100 000 lines and then resulting
in 20 000 duplicated entries in the storage layer.
This issue is generally managed by using primary key or transactions so the
new task will override the 20 000 lines, or the transaction of the first 20
000 lines would be rolled back.



--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-and-idempotence-tp1416p1518.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

RE: carbondata and idempotence

Posted by vincent <vi...@gmail.com>.

Hi
thanks for your answer. My question is about both streaming and batch. Even
in batch if a worker crash or if speculation is activated, the worker's task
that failed will be relaunched on another worker. For example the worker has
crashed after having ingested 20 000 lines on the 100 000 lines of the task,
then the new worker will write the entire 100 000 lines and then resulting
in 20 000 duplicated entries in the storage layer.
This issue is generally managed by using primary key or transactions so the
new task will override the 20 000 lines, or the transaction of the first 20
000 lines would be rolled back.



--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-and-idempotence-tp1416p1518.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

RE: carbondata and idempotence

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi

I second Jenny here. It's not yet supported but definitely a good feature.

Regards
JB



On Sep 23, 2016, 14:03, at 14:03, Jihong Ma <Ji...@huawei.com> wrote:
>Hi Vincent,
>
>Are you referring to writing out Spark streaming data to Carbon file? 
>we don't support it yet, but it is in our near term plan to add the
>integration, we will start the discussion in the dev list soon and
>would love to hear your input, we will take into account the old
>DStream interface as well as Spark 2.0  structured streaming, we would
>like to ensure exactly-once semantics and design Carbon as an
>idempotent sink. 
>
>At the moment, we have fully integrated with Spark SQL with both SQL
>and API interface, with the help multi-level indexes, we have seen
>dramatic performance boost compared to other columnar file format on
>hadoop eco-system. You are welcome to try it out for your batch
>processing workload, the streaming ingest will come out a little later.
>
>
>Regards.
>
>Jenny   
>
>-----Original Message-----
>From: vincent gromakowski [mailto:vincent.gromakowski@gmail.com] 
>Sent: Friday, September 23, 2016 7:33 AM
>To: dev@carbondata.incubator.apache.org
>Subject: carbondata and idempotence
>
>Hi Carbondata community,
>I am evaluating various file format right now and found Carbondata to
>be
>interesting specially with the multiple index used to avoid full scan
>but I
>am asking if there is any way to achieve idem potence when writing to
>Carbondata from Spark (or alternative) ?
>A strong requirement is to avoid a Spark worker crash to write
>duplicated
>entries in Carbon...
>Tx
>
>Vincent

RE: carbondata and idempotence

Posted by Jihong Ma <Ji...@huawei.com>.

Hi Vincent,

Are you referring to writing out Spark streaming data to Carbon file?  we don't support it yet, but it is in our near term plan to add the integration, we will start the discussion in the dev list soon and would love to hear your input, we will take into account the old DStream interface as well as Spark 2.0  structured streaming, we would like to ensure exactly-once semantics and design Carbon as an idempotent sink. 

At the moment, we have fully integrated with Spark SQL with both SQL and API interface, with the help multi-level indexes, we have seen dramatic performance boost compared to other columnar file format on hadoop eco-system. You are welcome to try it out for your batch processing workload, the streaming ingest will come out a little later. 

Regards.

Jenny   

-----Original Message-----
From: vincent gromakowski [mailto:vincent.gromakowski@gmail.com] 
Sent: Friday, September 23, 2016 7:33 AM
To: dev@carbondata.incubator.apache.org
Subject: carbondata and idempotence

Hi Carbondata community,
I am evaluating various file format right now and found Carbondata to be
interesting specially with the multiple index used to avoid full scan but I
am asking if there is any way to achieve idem potence when writing to
Carbondata from Spark (or alternative) ?
A strong requirement is to avoid a Spark worker crash to write duplicated
entries in Carbon...
Tx

Vincent