You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@gobblin.apache.org by Amith Prasanna <am...@sentienz.com> on 2019/07/23 18:42:56 UTC

Regarding scheduling job in distributed mode

Hi all,

I am working on scheduling a job which pulls records from kafka to hdfs.
This works fine while running in standalone mode. But on trying in
map-reduce mode using gobblin-mapreduce.sh script I'm getting class and
method not found errors. I'm using hadoop-2.8.1 and gobblin-0.13. And also
somehow I came to know that mapreduce.sh triggers job only once. Can anyone
please give details or changes to be done for scheduling the job in
mapreduce or yarn mode?

This is how jobconf file looks like:

job.name=KafkatoHdfsJob1
job.group=GobblinKafka
job.description=Gobblin quick start job for Kafka
job.lock.enabled=false
kafka.brokers=<host>:9092
job.schedule=0/20 * * * * ?
topic.whitelist=test
source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaSimpleSource
extract.namespace=org.apache.gobblin.extract.kafka
writer.builder.class=org.apache.gobblin.writer.SimpleDataWriterBuilder
writer.file.path.type=tablename
writer.destination.type=HDFS
writer.output.format=json
simple.writer.delimiter=\n
data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
launcher.type=MAPREDUCE
mr.job.max.mappers=1
mr.include.task.counters=100
mr.job.root.dir=/tmp/gobblin/mr-job
metrics.reporting.file.enabled=true
metrics.log.dir=/data/temp/gobblin-kafka/metrics
metrics.reporting.file.suffix=txt
bootstrap.with.offset=earliest
fs.uri=hdfs://<host>:8020/
writer.fs.uri=hdfs://<host>:8020/
state.store.fs.uri=hdfs://<host>:8020/
mr.job.root.dir=/data/temp/gobblin-kafka/working
writer.staging.dir=/data/temp/gobblin-kafka/writer-staging
writer.output.dir=/data/temp/gobblin-kafka/writer-output
state.store.dir=/data/temp/gobblin-kafka/state-store
task.data.root.dir=/data/temp/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
data.publisher.final.dir=/data/temp/test


Regards,
Amith

Re: Regarding scheduling job in distributed mode

Posted by Shirshanka Das <sh...@apache.org>.
Hi Amith,
  Check out the Azkaban section of this guide:
https://gobblin.readthedocs.io/en/latest/user-guide/Gobblin-Schedulers/
  Let us know if that is not working for you.




On Tue, Jul 30, 2019 at 5:25 PM Amith Prasanna <am...@sentienz.com>
wrote:

> Hi,
>
> Sorry for the confusion,
> I am not scheduling gobblin-mapred.sh script. I am trying to make the job
> which is getting scheduled properly (based on cron mentioned in .pull) in
> standalone mode as distributed job. I was able to run as mapreduce job
> through gobblin-mapred.sh but it was getting triggered one time even though
> there is cron mentioned in .pull. Also tried running in standalone by
> mentioning launcher.type as mapreduce but it ran in local mode only.
> While running as yarn service through gobblin-yarn.sh, the containers are
> getting stopped and created continuously with error "Container exited with
> a non-zero exit code 1" and the data was not pulled.
> Doesn't quartz scheduler work with mapreduce mode?
> If there is a way to schedule using azkaban schedular what steps I need to
> follow (I have azkaban-solo-server setup)?
> What might be the reason for yarn containers failure?
>
>
> On Tue, Jul 30, 2019 at 1:25 PM Shirshanka Das <sh...@apache.org>
> wrote:
>
>> Cron should be good enough for scheduling this.
>> Essentially you're just kicking off a shell script on a schedule.
>>
>> Is this not working for you?
>>
>>
>> On Mon, Jul 29, 2019 at 5:40 PM Amith Prasanna <
>> amith.prasanna@sentienz.com> wrote:
>>
>>> Hi,
>>>
>>> I am using cron scheduler. Is it the case that I cannot schedule
>>> mapreduce job using cron scheduler?
>>> I was able to run the job using mapreduce.sh after changing
>>> mapreduce.framework.name from tez to yarn in hadoop configuration. But
>>> coundn't able to schedule.
>>>
>>> On Fri, Jul 26, 2019 at 7:06 PM Shirshanka Das <sh...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>   Are you following the instructions at
>>>> https://gobblin.readthedocs.io/en/latest/user-guide/Gobblin-Schedulers/
>>>>
>>>>   Which scheduler are you using to launch the shell script?
>>>>
>>>>
>>>>
>>>> On Wed, Jul 24, 2019 at 12:13 AM Amith Prasanna <
>>>> amith.prasanna@sentienz.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am working on scheduling a job which pulls records from kafka to
>>>>> hdfs. This works fine while running in standalone mode. But on trying in
>>>>> map-reduce mode using gobblin-mapreduce.sh script I'm getting class and
>>>>> method not found errors. I'm using hadoop-2.8.1 and gobblin-0.13. And also
>>>>> somehow I came to know that mapreduce.sh triggers job only once. Can anyone
>>>>> please give details or changes to be done for scheduling the job in
>>>>> mapreduce or yarn mode?
>>>>>
>>>>> This is how jobconf file looks like:
>>>>>
>>>>> job.name=KafkatoHdfsJob1
>>>>> job.group=GobblinKafka
>>>>> job.description=Gobblin quick start job for Kafka
>>>>> job.lock.enabled=false
>>>>> kafka.brokers=<host>:9092
>>>>> job.schedule=0/20 * * * * ?
>>>>> topic.whitelist=test
>>>>>
>>>>> source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaSimpleSource
>>>>> extract.namespace=org.apache.gobblin.extract.kafka
>>>>> writer.builder.class=org.apache.gobblin.writer.SimpleDataWriterBuilder
>>>>> writer.file.path.type=tablename
>>>>> writer.destination.type=HDFS
>>>>> writer.output.format=json
>>>>> simple.writer.delimiter=\n
>>>>> data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
>>>>> launcher.type=MAPREDUCE
>>>>> mr.job.max.mappers=1
>>>>> mr.include.task.counters=100
>>>>> mr.job.root.dir=/tmp/gobblin/mr-job
>>>>> metrics.reporting.file.enabled=true
>>>>> metrics.log.dir=/data/temp/gobblin-kafka/metrics
>>>>> metrics.reporting.file.suffix=txt
>>>>> bootstrap.with.offset=earliest
>>>>> fs.uri=hdfs://<host>:8020/
>>>>> writer.fs.uri=hdfs://<host>:8020/
>>>>> state.store.fs.uri=hdfs://<host>:8020/
>>>>> mr.job.root.dir=/data/temp/gobblin-kafka/working
>>>>> writer.staging.dir=/data/temp/gobblin-kafka/writer-staging
>>>>> writer.output.dir=/data/temp/gobblin-kafka/writer-output
>>>>> state.store.dir=/data/temp/gobblin-kafka/state-store
>>>>>
>>>>> task.data.root.dir=/data/temp/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
>>>>> data.publisher.final.dir=/data/temp/test
>>>>>
>>>>>
>>>>> Regards,
>>>>> Amith
>>>>>
>>>>

Re: Regarding scheduling job in distributed mode

Posted by Amith Prasanna <am...@sentienz.com>.
Hi,

Sorry for the confusion,
I am not scheduling gobblin-mapred.sh script. I am trying to make the job
which is getting scheduled properly (based on cron mentioned in .pull) in
standalone mode as distributed job. I was able to run as mapreduce job
through gobblin-mapred.sh but it was getting triggered one time even though
there is cron mentioned in .pull. Also tried running in standalone by
mentioning launcher.type as mapreduce but it ran in local mode only.
While running as yarn service through gobblin-yarn.sh, the containers are
getting stopped and created continuously with error "Container exited with
a non-zero exit code 1" and the data was not pulled.
Doesn't quartz scheduler work with mapreduce mode?
If there is a way to schedule using azkaban schedular what steps I need to
follow (I have azkaban-solo-server setup)?
What might be the reason for yarn containers failure?


On Tue, Jul 30, 2019 at 1:25 PM Shirshanka Das <sh...@apache.org>
wrote:

> Cron should be good enough for scheduling this.
> Essentially you're just kicking off a shell script on a schedule.
>
> Is this not working for you?
>
>
> On Mon, Jul 29, 2019 at 5:40 PM Amith Prasanna <
> amith.prasanna@sentienz.com> wrote:
>
>> Hi,
>>
>> I am using cron scheduler. Is it the case that I cannot schedule
>> mapreduce job using cron scheduler?
>> I was able to run the job using mapreduce.sh after changing
>> mapreduce.framework.name from tez to yarn in hadoop configuration. But
>> coundn't able to schedule.
>>
>> On Fri, Jul 26, 2019 at 7:06 PM Shirshanka Das <sh...@apache.org>
>> wrote:
>>
>>> Hi,
>>>   Are you following the instructions at
>>> https://gobblin.readthedocs.io/en/latest/user-guide/Gobblin-Schedulers/
>>>   Which scheduler are you using to launch the shell script?
>>>
>>>
>>>
>>> On Wed, Jul 24, 2019 at 12:13 AM Amith Prasanna <
>>> amith.prasanna@sentienz.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am working on scheduling a job which pulls records from kafka to
>>>> hdfs. This works fine while running in standalone mode. But on trying in
>>>> map-reduce mode using gobblin-mapreduce.sh script I'm getting class and
>>>> method not found errors. I'm using hadoop-2.8.1 and gobblin-0.13. And also
>>>> somehow I came to know that mapreduce.sh triggers job only once. Can anyone
>>>> please give details or changes to be done for scheduling the job in
>>>> mapreduce or yarn mode?
>>>>
>>>> This is how jobconf file looks like:
>>>>
>>>> job.name=KafkatoHdfsJob1
>>>> job.group=GobblinKafka
>>>> job.description=Gobblin quick start job for Kafka
>>>> job.lock.enabled=false
>>>> kafka.brokers=<host>:9092
>>>> job.schedule=0/20 * * * * ?
>>>> topic.whitelist=test
>>>>
>>>> source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaSimpleSource
>>>> extract.namespace=org.apache.gobblin.extract.kafka
>>>> writer.builder.class=org.apache.gobblin.writer.SimpleDataWriterBuilder
>>>> writer.file.path.type=tablename
>>>> writer.destination.type=HDFS
>>>> writer.output.format=json
>>>> simple.writer.delimiter=\n
>>>> data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
>>>> launcher.type=MAPREDUCE
>>>> mr.job.max.mappers=1
>>>> mr.include.task.counters=100
>>>> mr.job.root.dir=/tmp/gobblin/mr-job
>>>> metrics.reporting.file.enabled=true
>>>> metrics.log.dir=/data/temp/gobblin-kafka/metrics
>>>> metrics.reporting.file.suffix=txt
>>>> bootstrap.with.offset=earliest
>>>> fs.uri=hdfs://<host>:8020/
>>>> writer.fs.uri=hdfs://<host>:8020/
>>>> state.store.fs.uri=hdfs://<host>:8020/
>>>> mr.job.root.dir=/data/temp/gobblin-kafka/working
>>>> writer.staging.dir=/data/temp/gobblin-kafka/writer-staging
>>>> writer.output.dir=/data/temp/gobblin-kafka/writer-output
>>>> state.store.dir=/data/temp/gobblin-kafka/state-store
>>>>
>>>> task.data.root.dir=/data/temp/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
>>>> data.publisher.final.dir=/data/temp/test
>>>>
>>>>
>>>> Regards,
>>>> Amith
>>>>
>>>

Re: Regarding scheduling job in distributed mode

Posted by Shirshanka Das <sh...@apache.org>.
Cron should be good enough for scheduling this.
Essentially you're just kicking off a shell script on a schedule.

Is this not working for you?


On Mon, Jul 29, 2019 at 5:40 PM Amith Prasanna <am...@sentienz.com>
wrote:

> Hi,
>
> I am using cron scheduler. Is it the case that I cannot schedule mapreduce
> job using cron scheduler?
> I was able to run the job using mapreduce.sh after changing
> mapreduce.framework.name from tez to yarn in hadoop configuration. But
> coundn't able to schedule.
>
> On Fri, Jul 26, 2019 at 7:06 PM Shirshanka Das <sh...@apache.org>
> wrote:
>
>> Hi,
>>   Are you following the instructions at
>> https://gobblin.readthedocs.io/en/latest/user-guide/Gobblin-Schedulers/
>>   Which scheduler are you using to launch the shell script?
>>
>>
>>
>> On Wed, Jul 24, 2019 at 12:13 AM Amith Prasanna <
>> amith.prasanna@sentienz.com> wrote:
>>
>>> Hi all,
>>>
>>> I am working on scheduling a job which pulls records from kafka to hdfs.
>>> This works fine while running in standalone mode. But on trying in
>>> map-reduce mode using gobblin-mapreduce.sh script I'm getting class and
>>> method not found errors. I'm using hadoop-2.8.1 and gobblin-0.13. And also
>>> somehow I came to know that mapreduce.sh triggers job only once. Can anyone
>>> please give details or changes to be done for scheduling the job in
>>> mapreduce or yarn mode?
>>>
>>> This is how jobconf file looks like:
>>>
>>> job.name=KafkatoHdfsJob1
>>> job.group=GobblinKafka
>>> job.description=Gobblin quick start job for Kafka
>>> job.lock.enabled=false
>>> kafka.brokers=<host>:9092
>>> job.schedule=0/20 * * * * ?
>>> topic.whitelist=test
>>>
>>> source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaSimpleSource
>>> extract.namespace=org.apache.gobblin.extract.kafka
>>> writer.builder.class=org.apache.gobblin.writer.SimpleDataWriterBuilder
>>> writer.file.path.type=tablename
>>> writer.destination.type=HDFS
>>> writer.output.format=json
>>> simple.writer.delimiter=\n
>>> data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
>>> launcher.type=MAPREDUCE
>>> mr.job.max.mappers=1
>>> mr.include.task.counters=100
>>> mr.job.root.dir=/tmp/gobblin/mr-job
>>> metrics.reporting.file.enabled=true
>>> metrics.log.dir=/data/temp/gobblin-kafka/metrics
>>> metrics.reporting.file.suffix=txt
>>> bootstrap.with.offset=earliest
>>> fs.uri=hdfs://<host>:8020/
>>> writer.fs.uri=hdfs://<host>:8020/
>>> state.store.fs.uri=hdfs://<host>:8020/
>>> mr.job.root.dir=/data/temp/gobblin-kafka/working
>>> writer.staging.dir=/data/temp/gobblin-kafka/writer-staging
>>> writer.output.dir=/data/temp/gobblin-kafka/writer-output
>>> state.store.dir=/data/temp/gobblin-kafka/state-store
>>>
>>> task.data.root.dir=/data/temp/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
>>> data.publisher.final.dir=/data/temp/test
>>>
>>>
>>> Regards,
>>> Amith
>>>
>>

Re: Regarding scheduling job in distributed mode

Posted by Amith Prasanna <am...@sentienz.com>.
Hi,

I am using cron scheduler. Is it the case that I cannot schedule mapreduce
job using cron scheduler?
I was able to run the job using mapreduce.sh after changing
mapreduce.framework.name from tez to yarn in hadoop configuration. But
coundn't able to schedule.

On Fri, Jul 26, 2019 at 7:06 PM Shirshanka Das <sh...@apache.org>
wrote:

> Hi,
>   Are you following the instructions at
> https://gobblin.readthedocs.io/en/latest/user-guide/Gobblin-Schedulers/
>   Which scheduler are you using to launch the shell script?
>
>
>
> On Wed, Jul 24, 2019 at 12:13 AM Amith Prasanna <
> amith.prasanna@sentienz.com> wrote:
>
>> Hi all,
>>
>> I am working on scheduling a job which pulls records from kafka to hdfs.
>> This works fine while running in standalone mode. But on trying in
>> map-reduce mode using gobblin-mapreduce.sh script I'm getting class and
>> method not found errors. I'm using hadoop-2.8.1 and gobblin-0.13. And also
>> somehow I came to know that mapreduce.sh triggers job only once. Can anyone
>> please give details or changes to be done for scheduling the job in
>> mapreduce or yarn mode?
>>
>> This is how jobconf file looks like:
>>
>> job.name=KafkatoHdfsJob1
>> job.group=GobblinKafka
>> job.description=Gobblin quick start job for Kafka
>> job.lock.enabled=false
>> kafka.brokers=<host>:9092
>> job.schedule=0/20 * * * * ?
>> topic.whitelist=test
>>
>> source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaSimpleSource
>> extract.namespace=org.apache.gobblin.extract.kafka
>> writer.builder.class=org.apache.gobblin.writer.SimpleDataWriterBuilder
>> writer.file.path.type=tablename
>> writer.destination.type=HDFS
>> writer.output.format=json
>> simple.writer.delimiter=\n
>> data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
>> launcher.type=MAPREDUCE
>> mr.job.max.mappers=1
>> mr.include.task.counters=100
>> mr.job.root.dir=/tmp/gobblin/mr-job
>> metrics.reporting.file.enabled=true
>> metrics.log.dir=/data/temp/gobblin-kafka/metrics
>> metrics.reporting.file.suffix=txt
>> bootstrap.with.offset=earliest
>> fs.uri=hdfs://<host>:8020/
>> writer.fs.uri=hdfs://<host>:8020/
>> state.store.fs.uri=hdfs://<host>:8020/
>> mr.job.root.dir=/data/temp/gobblin-kafka/working
>> writer.staging.dir=/data/temp/gobblin-kafka/writer-staging
>> writer.output.dir=/data/temp/gobblin-kafka/writer-output
>> state.store.dir=/data/temp/gobblin-kafka/state-store
>>
>> task.data.root.dir=/data/temp/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
>> data.publisher.final.dir=/data/temp/test
>>
>>
>> Regards,
>> Amith
>>
>

Re: Regarding scheduling job in distributed mode

Posted by Shirshanka Das <sh...@apache.org>.
Hi,
  Are you following the instructions at
https://gobblin.readthedocs.io/en/latest/user-guide/Gobblin-Schedulers/
  Which scheduler are you using to launch the shell script?



On Wed, Jul 24, 2019 at 12:13 AM Amith Prasanna <am...@sentienz.com>
wrote:

> Hi all,
>
> I am working on scheduling a job which pulls records from kafka to hdfs.
> This works fine while running in standalone mode. But on trying in
> map-reduce mode using gobblin-mapreduce.sh script I'm getting class and
> method not found errors. I'm using hadoop-2.8.1 and gobblin-0.13. And also
> somehow I came to know that mapreduce.sh triggers job only once. Can anyone
> please give details or changes to be done for scheduling the job in
> mapreduce or yarn mode?
>
> This is how jobconf file looks like:
>
> job.name=KafkatoHdfsJob1
> job.group=GobblinKafka
> job.description=Gobblin quick start job for Kafka
> job.lock.enabled=false
> kafka.brokers=<host>:9092
> job.schedule=0/20 * * * * ?
> topic.whitelist=test
>
> source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaSimpleSource
> extract.namespace=org.apache.gobblin.extract.kafka
> writer.builder.class=org.apache.gobblin.writer.SimpleDataWriterBuilder
> writer.file.path.type=tablename
> writer.destination.type=HDFS
> writer.output.format=json
> simple.writer.delimiter=\n
> data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
> launcher.type=MAPREDUCE
> mr.job.max.mappers=1
> mr.include.task.counters=100
> mr.job.root.dir=/tmp/gobblin/mr-job
> metrics.reporting.file.enabled=true
> metrics.log.dir=/data/temp/gobblin-kafka/metrics
> metrics.reporting.file.suffix=txt
> bootstrap.with.offset=earliest
> fs.uri=hdfs://<host>:8020/
> writer.fs.uri=hdfs://<host>:8020/
> state.store.fs.uri=hdfs://<host>:8020/
> mr.job.root.dir=/data/temp/gobblin-kafka/working
> writer.staging.dir=/data/temp/gobblin-kafka/writer-staging
> writer.output.dir=/data/temp/gobblin-kafka/writer-output
> state.store.dir=/data/temp/gobblin-kafka/state-store
> task.data.root.dir=/data/temp/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
> data.publisher.final.dir=/data/temp/test
>
>
> Regards,
> Amith
>

Re: Regarding scheduling job in distributed mode

Posted by Shirshanka Das <sh...@apache.org>.
Hi,
  Are you following the instructions at
https://gobblin.readthedocs.io/en/latest/user-guide/Gobblin-Schedulers/
  Which scheduler are you using to launch the shell script?



On Wed, Jul 24, 2019 at 12:13 AM Amith Prasanna <am...@sentienz.com>
wrote:

> Hi all,
>
> I am working on scheduling a job which pulls records from kafka to hdfs.
> This works fine while running in standalone mode. But on trying in
> map-reduce mode using gobblin-mapreduce.sh script I'm getting class and
> method not found errors. I'm using hadoop-2.8.1 and gobblin-0.13. And also
> somehow I came to know that mapreduce.sh triggers job only once. Can anyone
> please give details or changes to be done for scheduling the job in
> mapreduce or yarn mode?
>
> This is how jobconf file looks like:
>
> job.name=KafkatoHdfsJob1
> job.group=GobblinKafka
> job.description=Gobblin quick start job for Kafka
> job.lock.enabled=false
> kafka.brokers=<host>:9092
> job.schedule=0/20 * * * * ?
> topic.whitelist=test
>
> source.class=org.apache.gobblin.source.extractor.extract.kafka.KafkaSimpleSource
> extract.namespace=org.apache.gobblin.extract.kafka
> writer.builder.class=org.apache.gobblin.writer.SimpleDataWriterBuilder
> writer.file.path.type=tablename
> writer.destination.type=HDFS
> writer.output.format=json
> simple.writer.delimiter=\n
> data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
> launcher.type=MAPREDUCE
> mr.job.max.mappers=1
> mr.include.task.counters=100
> mr.job.root.dir=/tmp/gobblin/mr-job
> metrics.reporting.file.enabled=true
> metrics.log.dir=/data/temp/gobblin-kafka/metrics
> metrics.reporting.file.suffix=txt
> bootstrap.with.offset=earliest
> fs.uri=hdfs://<host>:8020/
> writer.fs.uri=hdfs://<host>:8020/
> state.store.fs.uri=hdfs://<host>:8020/
> mr.job.root.dir=/data/temp/gobblin-kafka/working
> writer.staging.dir=/data/temp/gobblin-kafka/writer-staging
> writer.output.dir=/data/temp/gobblin-kafka/writer-output
> state.store.dir=/data/temp/gobblin-kafka/state-store
> task.data.root.dir=/data/temp/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
> data.publisher.final.dir=/data/temp/test
>
>
> Regards,
> Amith
>