You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@gobblin.apache.org by Rodrigo Nicolas Garcia <ro...@despegar.com> on 2019/02/08 14:53:17 UTC
Error with partitioned data on S3.
Hello,
I have a cron job that runs every hour. If I configure
TimeBasedAvroWriterPartitioner I have problems with the second and third
run:
I also have a cron job per 15 mins, if I use S3N to publish data in this
location:
s3bucket/error-event/event_date=2019-01-25/avrofiles
But in the second run, It publish data in a wrong location:
s3bucket/error-event/event_date=2019-01-25/event_date=2019-01-25/avrofiles
(Instead of publishing in: event/event_date=2019-01-25/new_avrofiles)
The third time fails throwing an exception about md5 on a folder. That’s
the same problem that I’ve had with the S3 driver. Also, If I try to use
S3A It doesn’t work either.
if I didn't partition data the job works fine.
JobConfig:
job.name=KafkaErrorEventsToS3
job.group=KafkaToS3
job.description=Job to structured data from Kafka to S3 format avro whit
history store.
job.lock.enabled=true
job.schedule=0 0/60 * * * ?
mr.job.max.mappers=1
task.execution.synchronousExecutionModel=false
kafka.brokers=rc-aws-kafka-00:9092
kafka.deserializer.type=GSON
topic.whitelist=error-event
bootstrap.with.offset=earliest
source.timezone=UTC
source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaSimpleSource
extract.namespace=com.despegar.sem.extract.kafka
extract.limit.enabled=true
extract.limit.type=time
extract.limit.timeLimit=15
extract.limit.timeLimitTimeunit=minutes
converter.classes=com.despegar.sem.gobblin.converters.KafkaJsonMessageAvroConverter
avro.schema.literal={"namespace":"com.sem.tracker.avro","type":"record","name":"KafkaMessageAvro","fields":[{"name":"referrer","type":["string","null"]},{"name":"redirecturl","type":["string","null"]},{"name":"url","type":["string","null"]},{"name":"trackeame_user_id","type":["string","null"]},{"name":"click_id","type":["string","null"]},{"name":"ip","type":["string","null"]},{"name":"date_time","type":["string","null"]},{"name":"type","type":["string","null"]},{"name":"client_id","type":["string","null"]},{"name":"source","type":["string","null"]},{"name":"traffic","type":["string","null"]},{"name":"other_fields_json","type":["string","null"]},{"name":"event_date","type":["string","null"]}]}
writer.builder.class=org.apache.gobblin.writer.AvroDataWriterBuilder
writer.file.path.type=tablename
writer.destination.type=HDFS
writer.output.format=AVRO
writer.fs.uri=file:///
writer.codec.type=snappy
#unpartitioned works!
#writer.partitioner.class=com.despegar.sem.gobblin.partitioner.AvroDateTimePartitioner
#writer.partition.granularity=day
#writer.partition.pattern='event_date='YYYY-MM-dd
#writer.partition.timezone=UTC
fs.s3a.buffer.dir=${env:GOBBLIN_WORK_DIR}/s3a
fs.s3.buffer.dir=${env:GOBBLIN_WORK_DIR}/s3}
fs.s3n.buffer.dir=${env:GOBBLIN_WORK_DIR}/s3n
data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
data.publisher.overwrite.enabled=true
data.publisher.fs.uri=s3a://bucket
data.publisher.replace.final.dir=false
#partitioned on root bucket
#data.publisher.final.dir=/
data.publisher.final.dir=/unpartitioned/
Greetings.
--
*Ing. Rodrigo Nicolás García*
Líder Técnico - Trackeame - Marketing Online Tools
Juana Manso N° 999/1069 - piso 2° - C.A.B.A.
<https://maps.google.com/?q=Juana+Manso+N%C2%B0+999/1069+-+piso+2%C2%B0+-+C.A.B.A.&entry=gmail&source=g>
(C1107CBS)
-----
Este mensaje es confidencial y puede contener información amparada por el
secreto profesional.
Si usted ha recibido este e-mail por error, por favor comuníquenoslo
inmediatamente respondiendo a este e-mail y luego eliminándolo de su
sistema.
El contenido de este mensaje no deberá ser copiado ni divulgado a ninguna
persona.