You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by GitBox <gi...@apache.org> on 2021/12/10 14:00:46 UTC
[GitHub] [carbondata] pratyakshsharma opened a new pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
pratyakshsharma opened a new pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243
### Why is this PR needed?
### What changes were proposed in this PR?
### Does this PR introduce any user interface change?
- No
- Yes. (please explain the change and update document)
### Is any new testcase added?
- No
- Yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001940835
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/579/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] asfgit closed pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991114414
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6159/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991614577
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997387132
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6170/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-998513376
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6176/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991645706
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4417/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-994092913
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6165/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-993988013
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4422/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-995222533
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/557/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-998706543
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/569/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000079551
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000311466
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4440/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000657172
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4441/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001109541
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4442/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991566753
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997377955
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6169/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-990997380
@akashrn5 please take a look.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-998512515
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4432/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-996597337
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6167/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1002070646
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/581/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#discussion_r769692957
##########
File path: docs/scd-and-cdc-guide.md
##########
@@ -131,4 +131,88 @@ clauses can have at most one UPDATE and one DELETE action, These clauses have th
* Please refer example class [MergeTestCase](https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala) to understand and implement scd and cdc scenarios using APIs.
* Please refer example class [DataMergeIntoExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataMergeIntoExample.scala) to understand and implement scd and cdc scenarios using sql.
-* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
\ No newline at end of file
+* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
+
+### Streamer Tool
+
+Carbondata streamer tool is a very powerful tool for incrementally capturing change events from varied sources like kafka or DFS and merging them into target carbondata table. This essentially means one needs to integrate with external solutions like Debezium or Maxwell for moving the change events to kafka, if one wishes to capture changes from primary databases like mysql. The tool currently requires incoming data to be present in avro format and incoming schema to evolve in backwards compatible way.
+
+Below is a high level architecture of how the overall pipeline looks like -
+
+![Carbondata streamer tool pipeline](../docs/images/carbondata-streamer-tool-pipeline.png?raw=true)
+
+#### Configs
+
+Streamer tool exposes below configs for users to cater to their CDC use cases -
+
+| Parameter | Default Value | Description |
+|-----------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.streamer.target.database | (none) | The database name where the target table is present to merge the incoming data. If not given by user, system will take the current database in the spark session. |
Review comment:
in default value section, please add it takes current database from spark session
##########
File path: docs/scd-and-cdc-guide.md
##########
@@ -131,4 +131,88 @@ clauses can have at most one UPDATE and one DELETE action, These clauses have th
* Please refer example class [MergeTestCase](https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala) to understand and implement scd and cdc scenarios using APIs.
* Please refer example class [DataMergeIntoExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataMergeIntoExample.scala) to understand and implement scd and cdc scenarios using sql.
-* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
\ No newline at end of file
+* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
+
+### Streamer Tool
+
+Carbondata streamer tool is a very powerful tool for incrementally capturing change events from varied sources like kafka or DFS and merging them into target carbondata table. This essentially means one needs to integrate with external solutions like Debezium or Maxwell for moving the change events to kafka, if one wishes to capture changes from primary databases like mysql. The tool currently requires incoming data to be present in avro format and incoming schema to evolve in backwards compatible way.
+
+Below is a high level architecture of how the overall pipeline looks like -
+
+![Carbondata streamer tool pipeline](../docs/images/carbondata-streamer-tool-pipeline.png?raw=true)
+
+#### Configs
+
+Streamer tool exposes below configs for users to cater to their CDC use cases -
+
+| Parameter | Default Value | Description |
+|-----------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.streamer.target.database | (none) | The database name where the target table is present to merge the incoming data. If not given by user, system will take the current database in the spark session. |
+| carbon.streamer.target.table | (none) | The target carbondata table where the data has to be merged. If this is not configured by user, the operation will fail. |
+| carbon.streamer.source.type | kafka | Streamer tool currently supports two types of data sources. One can ingest data from either kafka or DFS into target carbondata table using streamer tool. |
+| carbon.streamer.dfs.input.path | (none) | An absolute path on a given file system from where data needs to be read to ingest into the target carbondata table. Mandatory if the ingestion source type is DFS. |
+| schema.registry.url | (none) | Streamer tool supports 2 different ways to supply schema of incoming data. Schemas can be supplied using avro files (file based schema provider) or using schema registry. This property defines the url to connect to in case schema registry is used as the schema source. |
+| carbon.streamer.input.kafka.topic | (none) | This is a mandatory property to be set in case kafka is chosen as the source of data. This property defines the topics from where streamer tool will consume the data. |
+| bootstrap.servers | (none) | This is another mandatory property in case kafka is chosen as the source of data. This defines the end points for kafka brokers. |
+| auto.offset.reset | earliest | Streamer tool maintains checkpoints to keep a track of the incoming messages which are already consumed. In case of first ingestion using kafka source, this property defines the offset from where ingestion will start. This property can take only 2 valid values - `latest` and `earliest` |
+| key.deserializer | `org.apache.kafka.common.serialization.StringDeserializer` | Any message in kafka is ultimately a key value pair in the form of serialized bytes. This property defines the deserializer to deserialize the key of a message. |
+| value.deserializer | `io.confluent.kafka.serializers.KafkaAvroDeserializer` | This property defines the class which will be used for deserializing the values present in kafka topic. |
Review comment:
actually, we are not using both `key.deserializer` and `value.deserializer` in code and we are using an inbuilt spark Avro deserializer, so can we remove these two from code and also from the doc?
##########
File path: docs/configuration-parameters.md
##########
@@ -179,7 +179,6 @@ This section provides the details of all the configurations required for the Car
| carbon.update.storage.level | MEMORY_AND_DISK | Storage level to persist dataset of a RDD/dataframe. Applicable when ***carbon.update.persist.enable*** is **true**, if user's executor has less memory, set this parameter to 'MEMORY_AND_DISK_SER' or other storage level to correspond to different environment. [See detail](http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). |
| carbon.update.check.unique.value | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |
-
Review comment:
please revert this change if not needed
##########
File path: docs/scd-and-cdc-guide.md
##########
@@ -131,4 +131,88 @@ clauses can have at most one UPDATE and one DELETE action, These clauses have th
* Please refer example class [MergeTestCase](https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala) to understand and implement scd and cdc scenarios using APIs.
* Please refer example class [DataMergeIntoExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataMergeIntoExample.scala) to understand and implement scd and cdc scenarios using sql.
-* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
\ No newline at end of file
+* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
+
+### Streamer Tool
+
+Carbondata streamer tool is a very powerful tool for incrementally capturing change events from varied sources like kafka or DFS and merging them into target carbondata table. This essentially means one needs to integrate with external solutions like Debezium or Maxwell for moving the change events to kafka, if one wishes to capture changes from primary databases like mysql. The tool currently requires incoming data to be present in avro format and incoming schema to evolve in backwards compatible way.
+
+Below is a high level architecture of how the overall pipeline looks like -
+
+![Carbondata streamer tool pipeline](../docs/images/carbondata-streamer-tool-pipeline.png?raw=true)
+
+#### Configs
+
+Streamer tool exposes below configs for users to cater to their CDC use cases -
+
+| Parameter | Default Value | Description |
+|-----------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.streamer.target.database | (none) | The database name where the target table is present to merge the incoming data. If not given by user, system will take the current database in the spark session. |
+| carbon.streamer.target.table | (none) | The target carbondata table where the data has to be merged. If this is not configured by user, the operation will fail. |
+| carbon.streamer.source.type | kafka | Streamer tool currently supports two types of data sources. One can ingest data from either kafka or DFS into target carbondata table using streamer tool. |
+| carbon.streamer.dfs.input.path | (none) | An absolute path on a given file system from where data needs to be read to ingest into the target carbondata table. Mandatory if the ingestion source type is DFS. |
+| schema.registry.url | (none) | Streamer tool supports 2 different ways to supply schema of incoming data. Schemas can be supplied using avro files (file based schema provider) or using schema registry. This property defines the url to connect to in case schema registry is used as the schema source. |
+| carbon.streamer.input.kafka.topic | (none) | This is a mandatory property to be set in case kafka is chosen as the source of data. This property defines the topics from where streamer tool will consume the data. |
+| bootstrap.servers | (none) | This is another mandatory property in case kafka is chosen as the source of data. This defines the end points for kafka brokers. |
+| auto.offset.reset | earliest | Streamer tool maintains checkpoints to keep a track of the incoming messages which are already consumed. In case of first ingestion using kafka source, this property defines the offset from where ingestion will start. This property can take only 2 valid values - `latest` and `earliest` |
+| key.deserializer | `org.apache.kafka.common.serialization.StringDeserializer` | Any message in kafka is ultimately a key value pair in the form of serialized bytes. This property defines the deserializer to deserialize the key of a message. |
+| value.deserializer | `io.confluent.kafka.serializers.KafkaAvroDeserializer` | This property defines the class which will be used for deserializing the values present in kafka topic. |
+| enable.auto.commit | false | Kafka maintains an internal topic for storing offsets corresponding to the consumer groups. This property determines if kafka should actually go forward and commit the offsets consumed in this internal topic. We recommend to keep it as false since we use spark streaming checkpointing to take care of the same. |
+| group.id | (none) | Streamer tool is ultimately a consumer for kafka. This property determines the consumer group id streamer tool belongs to. |
+| carbon.streamer.input.payload.format | avro | This determines the format of the incoming messages from source. Currently only avro is supported. We have plans to extend this support to json as well in near future. Avro is the most preferred format for CDC use cases since it helps in making the message size very compact and has good support for schema evolution use cases as well. |
+| carbon.streamer.schema.provider | SchemaRegistry | As discussed earlier, streamer tool supports 2 ways of supplying schema for incoming messages - schema registry and avro files. Confluent schema registry is the preferred way when using avro as the input format. |
+| carbon.streamer.source.schema.path | (none) | This property defines the absolute path where files containing schemas for incoming messages are present. |
+| carbon.streamer.merge.operation.type | upsert | This defines the operation that needs to be performed on the incoming batch of data while writing it to target data set. |
+| carbon.streamer.merge.operation.field | (none) | This property defines the field in incoming schema which contains the type of operation performed at source. For example, Debezium includes a field called `op` when reading change events from primary database. Do not confuse this property with `carbon.streamer.merge.operation.type` which defines the operation to be performed on the incoming batch of data. However this property is needed so that streamer tool is able to identify rows deleted at source when the operation type is `upsert`. |
+| carbon.streamer.record.key.field | (none) | This defines the record key for a particular incoming record. This is used by the streamer tool for performing deduplication. In case this is not defined, operation will fail. |
+| carbon.streamer.batch.interval | 10 | Minimum batch interval time between 2 continuous ingestion in continuous mode. Should be specified in seconds. |
+| carbon.streamer.source.ordering.field | <none> | Name of the field from source schema whose value can be used for picking the latest updates for a particular record in the incoming batch in case of multiple updates for the same record key. Useful if the write operation type is UPDATE or UPSERT. This will be used only if `carbon.streamer.upsert.deduplicate` is enabled. |
+| carbon.streamer.insert.deduplicate | false | This property specifies if the incoming batch needs to be deduplicated in case of INSERT operation type. If set to true, the incoming batch will be deduplicated against the existing data in the target carbondata table. |
+| carbon.streamer.upsert.deduplicate | true | This property specifies if the incoming batch needs to be deduplicated (when multiple updates for the same record key are present in the incoming batch) in case of UPSERT/UPDATE operation type. If set to true, the user needs to provide proper value for the source ordering field as well. |
+| carbon.streamer.meta.columns | (none) | Generally when performing CDC operations on primary databases, few metadata columns are added along with the actual columns for book keeping purposes. This property enables users to list down all such metadata fields (comma separated) which should not be merged with the target carboondata table. |
+| carbon.enable.schema.enforcement | true | This flag decides if table schema needs to change as per the incoming batch schema. If set to true, incoming schema will be validated with existing table schema. If the schema has evolved, the incoming batch cannot be ingested and job will simply fail. |
+
+#### Commands
+
+1. For kafka source -
+
+```
+bin/spark-submit --class org.apache.carbondata.streamer.CarbonDataStreamer \
+--master spark://root1-ThinkPad-T490s:7077 \
+jars/apache-carbondata-2.3.0-SNAPSHOT-bin-spark2.4.5-hadoop2.7.2.jar \
Review comment:
for master url, remove the specific address and replace with more generalized address. Also carbondata 2.3.0 is not yet released, so the jar name also instead of giving actual value, you can mention like `<carbondata assembly jar path>` and `<spark master url>`, `<schema registry URL>`. This looks better i guess
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-995231504
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6166/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-998719392
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6178/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000683667
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/576/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000145766
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6183/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1002057639
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4446/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-996609517
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4424/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991054672
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4416/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991630549
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6160/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1002137006
LGTM, the CI failure is not related to this PR
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001933888
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4444/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001932080
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6188/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] ydvpankaj99 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
ydvpankaj99 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001373413
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000676215
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6185/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#discussion_r770227892
##########
File path: docs/scd-and-cdc-guide.md
##########
@@ -131,4 +131,88 @@ clauses can have at most one UPDATE and one DELETE action, These clauses have th
* Please refer example class [MergeTestCase](https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala) to understand and implement scd and cdc scenarios using APIs.
* Please refer example class [DataMergeIntoExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataMergeIntoExample.scala) to understand and implement scd and cdc scenarios using sql.
-* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
\ No newline at end of file
+* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
+
+### Streamer Tool
+
+Carbondata streamer tool is a very powerful tool for incrementally capturing change events from varied sources like kafka or DFS and merging them into target carbondata table. This essentially means one needs to integrate with external solutions like Debezium or Maxwell for moving the change events to kafka, if one wishes to capture changes from primary databases like mysql. The tool currently requires incoming data to be present in avro format and incoming schema to evolve in backwards compatible way.
+
+Below is a high level architecture of how the overall pipeline looks like -
+
+![Carbondata streamer tool pipeline](../docs/images/carbondata-streamer-tool-pipeline.png?raw=true)
+
+#### Configs
+
+Streamer tool exposes below configs for users to cater to their CDC use cases -
+
+| Parameter | Default Value | Description |
+|-----------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.streamer.target.database | (none) | The database name where the target table is present to merge the incoming data. If not given by user, system will take the current database in the spark session. |
+| carbon.streamer.target.table | (none) | The target carbondata table where the data has to be merged. If this is not configured by user, the operation will fail. |
+| carbon.streamer.source.type | kafka | Streamer tool currently supports two types of data sources. One can ingest data from either kafka or DFS into target carbondata table using streamer tool. |
+| carbon.streamer.dfs.input.path | (none) | An absolute path on a given file system from where data needs to be read to ingest into the target carbondata table. Mandatory if the ingestion source type is DFS. |
+| schema.registry.url | (none) | Streamer tool supports 2 different ways to supply schema of incoming data. Schemas can be supplied using avro files (file based schema provider) or using schema registry. This property defines the url to connect to in case schema registry is used as the schema source. |
+| carbon.streamer.input.kafka.topic | (none) | This is a mandatory property to be set in case kafka is chosen as the source of data. This property defines the topics from where streamer tool will consume the data. |
+| bootstrap.servers | (none) | This is another mandatory property in case kafka is chosen as the source of data. This defines the end points for kafka brokers. |
+| auto.offset.reset | earliest | Streamer tool maintains checkpoints to keep a track of the incoming messages which are already consumed. In case of first ingestion using kafka source, this property defines the offset from where ingestion will start. This property can take only 2 valid values - `latest` and `earliest` |
+| key.deserializer | `org.apache.kafka.common.serialization.StringDeserializer` | Any message in kafka is ultimately a key value pair in the form of serialized bytes. This property defines the deserializer to deserialize the key of a message. |
+| value.deserializer | `io.confluent.kafka.serializers.KafkaAvroDeserializer` | This property defines the class which will be used for deserializing the values present in kafka topic. |
Review comment:
yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991614577
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/551/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997388158
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4427/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997377506
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/560/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001118920
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/577/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001380833
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6187/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] brijoobopanna commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
brijoobopanna commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-998691004
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#discussion_r769919929
##########
File path: docs/scd-and-cdc-guide.md
##########
@@ -131,4 +131,88 @@ clauses can have at most one UPDATE and one DELETE action, These clauses have th
* Please refer example class [MergeTestCase](https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala) to understand and implement scd and cdc scenarios using APIs.
* Please refer example class [DataMergeIntoExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataMergeIntoExample.scala) to understand and implement scd and cdc scenarios using sql.
-* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
\ No newline at end of file
+* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
+
+### Streamer Tool
+
+Carbondata streamer tool is a very powerful tool for incrementally capturing change events from varied sources like kafka or DFS and merging them into target carbondata table. This essentially means one needs to integrate with external solutions like Debezium or Maxwell for moving the change events to kafka, if one wishes to capture changes from primary databases like mysql. The tool currently requires incoming data to be present in avro format and incoming schema to evolve in backwards compatible way.
+
+Below is a high level architecture of how the overall pipeline looks like -
+
+![Carbondata streamer tool pipeline](../docs/images/carbondata-streamer-tool-pipeline.png?raw=true)
+
+#### Configs
+
+Streamer tool exposes below configs for users to cater to their CDC use cases -
+
+| Parameter | Default Value | Description |
+|-----------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.streamer.target.database | (none) | The database name where the target table is present to merge the incoming data. If not given by user, system will take the current database in the spark session. |
+| carbon.streamer.target.table | (none) | The target carbondata table where the data has to be merged. If this is not configured by user, the operation will fail. |
+| carbon.streamer.source.type | kafka | Streamer tool currently supports two types of data sources. One can ingest data from either kafka or DFS into target carbondata table using streamer tool. |
+| carbon.streamer.dfs.input.path | (none) | An absolute path on a given file system from where data needs to be read to ingest into the target carbondata table. Mandatory if the ingestion source type is DFS. |
+| schema.registry.url | (none) | Streamer tool supports 2 different ways to supply schema of incoming data. Schemas can be supplied using avro files (file based schema provider) or using schema registry. This property defines the url to connect to in case schema registry is used as the schema source. |
+| carbon.streamer.input.kafka.topic | (none) | This is a mandatory property to be set in case kafka is chosen as the source of data. This property defines the topics from where streamer tool will consume the data. |
+| bootstrap.servers | (none) | This is another mandatory property in case kafka is chosen as the source of data. This defines the end points for kafka brokers. |
+| auto.offset.reset | earliest | Streamer tool maintains checkpoints to keep a track of the incoming messages which are already consumed. In case of first ingestion using kafka source, this property defines the offset from where ingestion will start. This property can take only 2 valid values - `latest` and `earliest` |
+| key.deserializer | `org.apache.kafka.common.serialization.StringDeserializer` | Any message in kafka is ultimately a key value pair in the form of serialized bytes. This property defines the deserializer to deserialize the key of a message. |
+| value.deserializer | `io.confluent.kafka.serializers.KafkaAvroDeserializer` | This property defines the class which will be used for deserializing the values present in kafka topic. |
Review comment:
So we do not want to keep it configurable as of now from user's point of view?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#discussion_r769919929
##########
File path: docs/scd-and-cdc-guide.md
##########
@@ -131,4 +131,88 @@ clauses can have at most one UPDATE and one DELETE action, These clauses have th
* Please refer example class [MergeTestCase](https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala) to understand and implement scd and cdc scenarios using APIs.
* Please refer example class [DataMergeIntoExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataMergeIntoExample.scala) to understand and implement scd and cdc scenarios using sql.
-* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
\ No newline at end of file
+* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
+
+### Streamer Tool
+
+Carbondata streamer tool is a very powerful tool for incrementally capturing change events from varied sources like kafka or DFS and merging them into target carbondata table. This essentially means one needs to integrate with external solutions like Debezium or Maxwell for moving the change events to kafka, if one wishes to capture changes from primary databases like mysql. The tool currently requires incoming data to be present in avro format and incoming schema to evolve in backwards compatible way.
+
+Below is a high level architecture of how the overall pipeline looks like -
+
+![Carbondata streamer tool pipeline](../docs/images/carbondata-streamer-tool-pipeline.png?raw=true)
+
+#### Configs
+
+Streamer tool exposes below configs for users to cater to their CDC use cases -
+
+| Parameter | Default Value | Description |
+|-----------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.streamer.target.database | (none) | The database name where the target table is present to merge the incoming data. If not given by user, system will take the current database in the spark session. |
+| carbon.streamer.target.table | (none) | The target carbondata table where the data has to be merged. If this is not configured by user, the operation will fail. |
+| carbon.streamer.source.type | kafka | Streamer tool currently supports two types of data sources. One can ingest data from either kafka or DFS into target carbondata table using streamer tool. |
+| carbon.streamer.dfs.input.path | (none) | An absolute path on a given file system from where data needs to be read to ingest into the target carbondata table. Mandatory if the ingestion source type is DFS. |
+| schema.registry.url | (none) | Streamer tool supports 2 different ways to supply schema of incoming data. Schemas can be supplied using avro files (file based schema provider) or using schema registry. This property defines the url to connect to in case schema registry is used as the schema source. |
+| carbon.streamer.input.kafka.topic | (none) | This is a mandatory property to be set in case kafka is chosen as the source of data. This property defines the topics from where streamer tool will consume the data. |
+| bootstrap.servers | (none) | This is another mandatory property in case kafka is chosen as the source of data. This defines the end points for kafka brokers. |
+| auto.offset.reset | earliest | Streamer tool maintains checkpoints to keep a track of the incoming messages which are already consumed. In case of first ingestion using kafka source, this property defines the offset from where ingestion will start. This property can take only 2 valid values - `latest` and `earliest` |
+| key.deserializer | `org.apache.kafka.common.serialization.StringDeserializer` | Any message in kafka is ultimately a key value pair in the form of serialized bytes. This property defines the deserializer to deserialize the key of a message. |
+| value.deserializer | `io.confluent.kafka.serializers.KafkaAvroDeserializer` | This property defines the class which will be used for deserializing the values present in kafka topic. |
Review comment:
So we do not want to keep it configurable as of now?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997378756
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4426/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997383843
@ydvpankaj99 can you please check why this error keeps on coming? Can we do something to make CI more stable?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997383867
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001135110
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6186/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-999842371
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4438/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1002045309
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6190/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991936752
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991967054
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6161/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991566753
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#discussion_r769867806
##########
File path: docs/scd-and-cdc-guide.md
##########
@@ -131,4 +131,88 @@ clauses can have at most one UPDATE and one DELETE action, These clauses have th
* Please refer example class [MergeTestCase](https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala) to understand and implement scd and cdc scenarios using APIs.
* Please refer example class [DataMergeIntoExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataMergeIntoExample.scala) to understand and implement scd and cdc scenarios using sql.
-* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
\ No newline at end of file
+* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs.
+
+### Streamer Tool
+
+Carbondata streamer tool is a very powerful tool for incrementally capturing change events from varied sources like kafka or DFS and merging them into target carbondata table. This essentially means one needs to integrate with external solutions like Debezium or Maxwell for moving the change events to kafka, if one wishes to capture changes from primary databases like mysql. The tool currently requires incoming data to be present in avro format and incoming schema to evolve in backwards compatible way.
+
+Below is a high level architecture of how the overall pipeline looks like -
+
+![Carbondata streamer tool pipeline](../docs/images/carbondata-streamer-tool-pipeline.png?raw=true)
+
+#### Configs
+
+Streamer tool exposes below configs for users to cater to their CDC use cases -
+
+| Parameter | Default Value | Description |
+|-----------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.streamer.target.database | (none) | The database name where the target table is present to merge the incoming data. If not given by user, system will take the current database in the spark session. |
+| carbon.streamer.target.table | (none) | The target carbondata table where the data has to be merged. If this is not configured by user, the operation will fail. |
+| carbon.streamer.source.type | kafka | Streamer tool currently supports two types of data sources. One can ingest data from either kafka or DFS into target carbondata table using streamer tool. |
+| carbon.streamer.dfs.input.path | (none) | An absolute path on a given file system from where data needs to be read to ingest into the target carbondata table. Mandatory if the ingestion source type is DFS. |
+| schema.registry.url | (none) | Streamer tool supports 2 different ways to supply schema of incoming data. Schemas can be supplied using avro files (file based schema provider) or using schema registry. This property defines the url to connect to in case schema registry is used as the schema source. |
+| carbon.streamer.input.kafka.topic | (none) | This is a mandatory property to be set in case kafka is chosen as the source of data. This property defines the topics from where streamer tool will consume the data. |
+| bootstrap.servers | (none) | This is another mandatory property in case kafka is chosen as the source of data. This defines the end points for kafka brokers. |
+| auto.offset.reset | earliest | Streamer tool maintains checkpoints to keep a track of the incoming messages which are already consumed. In case of first ingestion using kafka source, this property defines the offset from where ingestion will start. This property can take only 2 valid values - `latest` and `earliest` |
+| key.deserializer | `org.apache.kafka.common.serialization.StringDeserializer` | Any message in kafka is ultimately a key value pair in the form of serialized bytes. This property defines the deserializer to deserialize the key of a message. |
+| value.deserializer | `io.confluent.kafka.serializers.KafkaAvroDeserializer` | This property defines the class which will be used for deserializing the values present in kafka topic. |
+| enable.auto.commit | false | Kafka maintains an internal topic for storing offsets corresponding to the consumer groups. This property determines if kafka should actually go forward and commit the offsets consumed in this internal topic. We recommend to keep it as false since we use spark streaming checkpointing to take care of the same. |
+| group.id | (none) | Streamer tool is ultimately a consumer for kafka. This property determines the consumer group id streamer tool belongs to. |
+| carbon.streamer.input.payload.format | avro | This determines the format of the incoming messages from source. Currently only avro is supported. We have plans to extend this support to json as well in near future. Avro is the most preferred format for CDC use cases since it helps in making the message size very compact and has good support for schema evolution use cases as well. |
+| carbon.streamer.schema.provider | SchemaRegistry | As discussed earlier, streamer tool supports 2 ways of supplying schema for incoming messages - schema registry and avro files. Confluent schema registry is the preferred way when using avro as the input format. |
+| carbon.streamer.source.schema.path | (none) | This property defines the absolute path where files containing schemas for incoming messages are present. |
+| carbon.streamer.merge.operation.type | upsert | This defines the operation that needs to be performed on the incoming batch of data while writing it to target data set. |
+| carbon.streamer.merge.operation.field | (none) | This property defines the field in incoming schema which contains the type of operation performed at source. For example, Debezium includes a field called `op` when reading change events from primary database. Do not confuse this property with `carbon.streamer.merge.operation.type` which defines the operation to be performed on the incoming batch of data. However this property is needed so that streamer tool is able to identify rows deleted at source when the operation type is `upsert`. |
+| carbon.streamer.record.key.field | (none) | This defines the record key for a particular incoming record. This is used by the streamer tool for performing deduplication. In case this is not defined, operation will fail. |
+| carbon.streamer.batch.interval | 10 | Minimum batch interval time between 2 continuous ingestion in continuous mode. Should be specified in seconds. |
+| carbon.streamer.source.ordering.field | <none> | Name of the field from source schema whose value can be used for picking the latest updates for a particular record in the incoming batch in case of multiple updates for the same record key. Useful if the write operation type is UPDATE or UPSERT. This will be used only if `carbon.streamer.upsert.deduplicate` is enabled. |
+| carbon.streamer.insert.deduplicate | false | This property specifies if the incoming batch needs to be deduplicated in case of INSERT operation type. If set to true, the incoming batch will be deduplicated against the existing data in the target carbondata table. |
+| carbon.streamer.upsert.deduplicate | true | This property specifies if the incoming batch needs to be deduplicated (when multiple updates for the same record key are present in the incoming batch) in case of UPSERT/UPDATE operation type. If set to true, the user needs to provide proper value for the source ordering field as well. |
+| carbon.streamer.meta.columns | (none) | Generally when performing CDC operations on primary databases, few metadata columns are added along with the actual columns for book keeping purposes. This property enables users to list down all such metadata fields (comma separated) which should not be merged with the target carboondata table. |
+| carbon.enable.schema.enforcement | true | This flag decides if table schema needs to change as per the incoming batch schema. If set to true, incoming schema will be validated with existing table schema. If the schema has evolved, the incoming batch cannot be ingested and job will simply fail. |
+
+#### Commands
+
+1. For kafka source -
+
+```
+bin/spark-submit --class org.apache.carbondata.streamer.CarbonDataStreamer \
+--master spark://root1-ThinkPad-T490s:7077 \
+jars/apache-carbondata-2.3.0-SNAPSHOT-bin-spark2.4.5-hadoop2.7.2.jar \
Review comment:
my bad, missed it completely. Thank you for pointing this out.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] kunal642 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
kunal642 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997661243
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997767753
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4429/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001108555
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000141602
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/574/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000637520
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001382980
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4443/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] ydvpankaj99 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
ydvpankaj99 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001951104
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-994057697
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/556/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991968793
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4418/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991959164
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/552/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
Indhumathi27 commented on a change in pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#discussion_r767598589
##########
File path: docs/configuration-parameters.md
##########
@@ -179,6 +179,33 @@ This section provides the details of all the configurations required for the Car
| carbon.update.storage.level | MEMORY_AND_DISK | Storage level to persist dataset of a RDD/dataframe. Applicable when ***carbon.update.persist.enable*** is **true**, if user's executor has less memory, set this parameter to 'MEMORY_AND_DISK_SER' or other storage level to correspond to different environment. [See detail](http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). |
| carbon.update.check.unique.value | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |
+## Streamer tool Configuration
+| Parameter | Default Value | Description |
+|-----------------------------------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+ | carbon.streamer.target.database | <none> | The database name where the target table is present to merge the incoming data. If not given by user, system will take the current database in the spark session. |
Review comment:
looks like <none> is not displayed in the document. can change to (none). Please handle in other places also
##########
File path: docs/configuration-parameters.md
##########
@@ -179,6 +179,33 @@ This section provides the details of all the configurations required for the Car
| carbon.update.storage.level | MEMORY_AND_DISK | Storage level to persist dataset of a RDD/dataframe. Applicable when ***carbon.update.persist.enable*** is **true**, if user's executor has less memory, set this parameter to 'MEMORY_AND_DISK_SER' or other storage level to correspond to different environment. [See detail](http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). |
| carbon.update.check.unique.value | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |
+## Streamer tool Configuration
+| Parameter | Default Value | Description |
+|-----------------------------------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+ | carbon.streamer.target.database | <none> | The database name where the target table is present to merge the incoming data. If not given by user, system will take the current database in the spark session. |
+ | carbon.streamer.target.table | <none> | The target carbondata table where the data has to be merged. If this is not configured by user, the operation will fail. |
+ | carbon.streamer.source.type | kafka | Streamer tool currently supports 2 different types of data sources. One can ingest data from either kafka or DFS into target carbondata table using streamer tool. |
Review comment:
```suggestion
| carbon.streamer.source.type | kafka | Streamer tool currently supports two types of data sources. One can ingest data from either kafka or DFS into target carbondata table using streamer tool. |
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-998717489
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4434/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-998513625
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/567/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997375302
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997785562
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6172/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997809907
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/563/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000107617
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4439/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] ydvpankaj99 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
ydvpankaj99 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001870287
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000374128
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/575/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-999908520
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/573/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-995186200
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/4423/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-996591621
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/558/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-996572571
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-997388197
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/561/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1001383575
Build Failed with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/578/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-999575634
@pratyakshsharma can you please fix the compilation issues
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-999891555
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6182/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000269198
retest this please
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-1000356245
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/6184/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] pratyakshsharma commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-999805561
@akashrn5 Fixed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs
Posted by GitBox <gi...@apache.org>.
CarbonDataQA2 commented on pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#issuecomment-991136565
Build Success with Spark 3.1, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_3.1/550/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@carbondata.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org