You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@gobblin.apache.org by Rodrigo Nicolas Garcia <ro...@despegar.com> on 2019/02/08 14:53:17 UTC

Error with partitioned data on S3.

Hello,

I have a cron job that runs every hour. If I configure
TimeBasedAvroWriterPartitioner I have problems with the second and third
run:

I also have a cron job per 15 mins, if I use S3N to publish data in this
location:

s3bucket/error-event/event_date=2019-01-25/avrofiles

But in the second run, It publish data in a wrong location:

s3bucket/error-event/event_date=2019-01-25/event_date=2019-01-25/avrofiles

(Instead of publishing in: event/event_date=2019-01-25/new_avrofiles)

The third time fails throwing an exception about md5 on a folder. That’s
the same problem that I’ve had with the S3 driver. Also,  If I try to use
S3A It doesn’t work either.

if I didn't partition data the job works fine.

JobConfig:

job.name=KafkaErrorEventsToS3

job.group=KafkaToS3

job.description=Job to structured data from Kafka to S3 format avro whit
history store.

job.lock.enabled=true

job.schedule=0 0/60 * * * ?

mr.job.max.mappers=1

task.execution.synchronousExecutionModel=false

kafka.brokers=rc-aws-kafka-00:9092

kafka.deserializer.type=GSON

topic.whitelist=error-event

bootstrap.with.offset=earliest

source.timezone=UTC

source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaSimpleSource

extract.namespace=com.despegar.sem.extract.kafka

extract.limit.enabled=true

extract.limit.type=time

extract.limit.timeLimit=15

extract.limit.timeLimitTimeunit=minutes

converter.classes=com.despegar.sem.gobblin.converters.KafkaJsonMessageAvroConverter

avro.schema.literal={"namespace":"com.sem.tracker.avro","type":"record","name":"KafkaMessageAvro","fields":[{"name":"referrer","type":["string","null"]},{"name":"redirecturl","type":["string","null"]},{"name":"url","type":["string","null"]},{"name":"trackeame_user_id","type":["string","null"]},{"name":"click_id","type":["string","null"]},{"name":"ip","type":["string","null"]},{"name":"date_time","type":["string","null"]},{"name":"type","type":["string","null"]},{"name":"client_id","type":["string","null"]},{"name":"source","type":["string","null"]},{"name":"traffic","type":["string","null"]},{"name":"other_fields_json","type":["string","null"]},{"name":"event_date","type":["string","null"]}]}

writer.builder.class=org.apache.gobblin.writer.AvroDataWriterBuilder

writer.file.path.type=tablename

writer.destination.type=HDFS

writer.output.format=AVRO

writer.fs.uri=file:///

writer.codec.type=snappy

#unpartitioned works!

#writer.partitioner.class=com.despegar.sem.gobblin.partitioner.AvroDateTimePartitioner

#writer.partition.granularity=day

#writer.partition.pattern='event_date='YYYY-MM-dd

#writer.partition.timezone=UTC

fs.s3a.buffer.dir=${env:GOBBLIN_WORK_DIR}/s3a

fs.s3.buffer.dir=${env:GOBBLIN_WORK_DIR}/s3}

fs.s3n.buffer.dir=${env:GOBBLIN_WORK_DIR}/s3n

data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher

data.publisher.overwrite.enabled=true

data.publisher.fs.uri=s3a://bucket

data.publisher.replace.final.dir=false

#partitioned on root bucket

#data.publisher.final.dir=/

data.publisher.final.dir=/unpartitioned/

Greetings.

-- 

*Ing. Rodrigo Nicolás García*
Líder Técnico - Trackeame - Marketing Online Tools
Juana Manso N° 999/1069 - piso 2° - C.A.B.A.
<https://maps.google.com/?q=Juana+Manso+N%C2%B0+999/1069+-+piso+2%C2%B0+-+C.A.B.A.&entry=gmail&source=g>
(C1107CBS)
-----
Este mensaje es confidencial y puede contener información amparada por el
secreto profesional.
Si usted ha recibido este e-mail por error, por favor comuníquenoslo
inmediatamente respondiendo a este e-mail y luego eliminándolo de su
sistema.
El contenido de este mensaje no deberá ser copiado ni divulgado a ninguna
persona.