You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@gobblin.apache.org by "roman.tarasov@mgid.com" <ro...@mgid.com> on 2019/02/12 10:48:33 UTC

Gobblin Avro to Json convert

Hello,

I,m hadoop cluster admin in MGID company.

We try to use gobblin to ingest from kafka to hdfs.

We have 4 kafka clusters (not confluent, but we use confluent schema 
registry) and our application write to kafka in avro.

Our problem is convert avro to parquet befor write to hdfs.
For this we use converter 
converter.classes=org.apache.gobblin.converter.avro.AvroToJsonStringConverter,org.apache.gobblin.converter.json.JsonStringToJsonIntermediateConverter,org.apache.gobblin.converter.parquet.JsonIntermediateToParquetGroupConverter

But we have an error at sturtup our job
In test, use standalone mode

java.lang.IllegalStateException: This is not a JSON Array. at 
com.google.gson.JsonElement.getAsJsonArray(JsonElement.java:106) at 
org.apache.gobblin.converter.json.JsonStringToJsonIntermediateConverter.convertSchema(JsonStringToJsonIntermediateConverter.java:71) 
at 
org.apache.gobblin.converter.json.JsonStringToJsonIntermediateConverter.convertSchema(JsonStringToJsonIntermediateConverter.java:48) 
at

Maybe you know how to solve this problem?

Our test config

|kafka.brokers=kafka-node:9092 
#kafka.schema.registry.class=org.apache.gobblin.source.extractor.extract.kafka.ConfluentKafkaSchemaRegistry 
kafka.deserializer.type=CONFLUENT_AVRO 
kafka.schema.registry.url=http://kafka-node:8081 
source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaDeserializerSource 
extract.namespace=org.apache.gobblin.extract.kafka 
converter.classes="org.apache.gobblin.converter.json.JsonStringToJsonIntermediateConverter,org.apache.gobblin.converter.parquet.JsonIntermediateToParquetGroupConverter" 
extract.namespace=org.apache.gobblin.extract.converter 
writer.builder.class=org.apache.gobblin.writer.ParquetDataWriterBuilderwriter.destination.type=HDFS 
writer.output.format=PARQUET writer.file.path.type=tablename 
topic.name=test_hdfs topic.whitelist=test_hdfs 
data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher 
Regards, Roman Tarasov! |

Re: Gobblin Avro to Json convert

Posted by Hung Tran <hu...@linkedin.com>.

Hi Roman,


There is a log line log.info("Schema: " + inputSchema) in JsonStringToJsonIntermediateConverter. What does this print?

Also, the converter config you mentioned is not the same as the one you listed in the `Our test confg` section. Which config are you using?


Hung.

________________________________
From: roman.tarasov@mgid.com <ro...@mgid.com>
Sent: Tuesday, February 12, 2019 2:48:33 AM
To: user@gobblin.incubator.apache.org
Subject: Gobblin Avro to Json convert


Hello,


I,m hadoop cluster admin in MGID company.


We try to use gobblin to ingest from kafka to hdfs.

We have 4 kafka clusters (not confluent, but we use confluent schema registry) and our application write to kafka in avro.

Our problem is convert avro to parquet befor write to hdfs.
For this we use converter converter.classes=org.apache.gobblin.converter.avro.AvroToJsonStringConverter,org.apache.gobblin.converter.json.JsonStringToJsonIntermediateConverter,org.apache.gobblin.converter.parquet.JsonIntermediateToParquetGroupConverter

But we have an error at sturtup our job
In test, use standalone mode

java.lang.IllegalStateException: This is not a JSON Array. at com.google.gson.JsonElement.getAsJsonArray(JsonElement.java:106) at org.apache.gobblin.converter.json.JsonStringToJsonIntermediateConverter.convertSchema(JsonStringToJsonIntermediateConverter.java:71) at org.apache.gobblin.converter.json.JsonStringToJsonIntermediateConverter.convertSchema(JsonStringToJsonIntermediateConverter.java:48) at

Maybe you know how to solve this problem?

Our test config


kafka.brokers=kafka-node:9092

#kafka.schema.registry.class=org.apache.gobblin.source.extractor.extract.kafka.ConfluentKafkaSchemaRegistry
kafka.deserializer.type=CONFLUENT_AVRO
kafka.schema.registry.url=http://kafka-node:8081
source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaDeserializerSource
extract.namespace=org.apache.gobblin.extract.kafka

converter.classes="org.apache.gobblin.converter.json.JsonStringToJsonIntermediateConverter,org.apache.gobblin.converter.parquet.JsonIntermediateToParquetGroupConverter"

extract.namespace=org.apache.gobblin.extract.converter

writer.builder.class=org.apache.gobblin.writer.ParquetDataWriterBuilder
writer.destination.type=HDFS
writer.output.format=PARQUET

writer.file.path.type=tablename
topic.name=test_hdfs
topic.whitelist=test_hdfs

data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher



Regards,
Roman Tarasov!