You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by Gurjinder Singh Rathore <gu...@gmail.com> on 2017/09/29 00:26:48 UTC

Problem scaling Extractors with data volume

Hi,

I'm using gobblin (embedded) in my project to transfer loads of over 10
million+ rows at a time (single table). But when I run the load, gobbling
starts giving me errors since looks like the number of Extractor instances
grows almost linearly with the number of rows in my source table. This
results into too many connections being opened and at some point no more
Extractor instances can be created because the DBMS is drained out of all
its available connection limit. This is becoming really painful.

I dug into the code, and found that I could use the following settings to
limit the number of extractors created simultaenously:

extract.limit.enabled=true
extract.limit.type=pool
extract.limit.pool.size=10

However, TaskContext.java has this check:

        Limiter limiter = DefaultLimiterFactory.newLimiter(this.taskState);
        if (!(limiter instanceof NonRefillableLimiter)) {
          throw new IllegalArgumentException("The Limiter used with an
Extractor should be an instance of "
              + NonRefillableLimiter.class.getSimpleName());
        }

This was the end of my hope. I absolutely need to use PoolBasedLimiter
which is not a NonRefillableLimiter. How can I get around this problem?

*A relevent thread I found on github*:
https://github.com/apache/incubator-gobblin/pull/132
I understand that the above-mentioned check was added since
the LimitingExtractorDecorator doesn't close the limiters. But why?

Regards,
Gurjinder

Re: Problem scaling Extractors with data volume

Posted by Zhixiong Chen <zh...@linkedin.com>.
Hi Gurjinder,


The `Limiter` used in for an `Extractor` limits the number records to be extracted. It doesn't limit the number of extractor instances.


In our main task execution engine, an `Extractor` is created per `Task` and A `Task` will be created per `WorkUnit`, meaning the number of extractor instances is bounded by the number of Tasks, which is then bounded by the number of WorkUnits.


There are couple of configurations which can help, like `mr.job.max.mappers`. For a specific example, check `QueryBasedSource#getWorkUnits<https://github.com/apache/incubator-gobblin/blob/312e768f564e7cb4619c7986cfdf9b0f828bbc7b/gobblin-core/src/main/java/org/apache/gobblin/source/extractor/extract/QueryBasedSource.java#L168>`


Zhixiong

________________________________
From: Abhishek Tiwari <ab...@apache.org>
Sent: Thursday, October 5, 2017 12:21:45 PM
To: dev@gobblin.incubator.apache.org; issac.buenrostro@gmail.com
Subject: Re: Problem scaling Extractors with data volume

Issac,

Do you have suggestions on this. I see you were engaged in conversation on
the referred PR.

Regards,
Abhishek

On Thu, Sep 28, 2017 at 5:26 PM, Gurjinder Singh Rathore <
gurjinder.rathore@gmail.com> wrote:

> Hi,
>
> I'm using gobblin (embedded) in my project to transfer loads of over 10
> million+ rows at a time (single table). But when I run the load, gobbling
> starts giving me errors since looks like the number of Extractor instances
> grows almost linearly with the number of rows in my source table. This
> results into too many connections being opened and at some point no more
> Extractor instances can be created because the DBMS is drained out of all
> its available connection limit. This is becoming really painful.
>
> I dug into the code, and found that I could use the following settings to
> limit the number of extractors created simultaenously:
>
> extract.limit.enabled=true
> extract.limit.type=pool
> extract.limit.pool.size=10
>
> However, TaskContext.java has this check:
>
>         Limiter limiter = DefaultLimiterFactory.
> newLimiter(this.taskState);
>         if (!(limiter instanceof NonRefillableLimiter)) {
>           throw new IllegalArgumentException("The Limiter used with an
> Extractor should be an instance of "
>               + NonRefillableLimiter.class.getSimpleName());
>         }
>
> This was the end of my hope. I absolutely need to use PoolBasedLimiter
> which is not a NonRefillableLimiter. How can I get around this problem?
>
> *A relevent thread I found on github*:
> https://github.com/apache/incubator-gobblin/pull/132
> I understand that the above-mentioned check was added since
> the LimitingExtractorDecorator doesn't close the limiters. But why?
>
> Regards,
> Gurjinder
>

Re: Problem scaling Extractors with data volume

Posted by Abhishek Tiwari <ab...@apache.org>.
Issac,

Do you have suggestions on this. I see you were engaged in conversation on
the referred PR.

Regards,
Abhishek

On Thu, Sep 28, 2017 at 5:26 PM, Gurjinder Singh Rathore <
gurjinder.rathore@gmail.com> wrote:

> Hi,
>
> I'm using gobblin (embedded) in my project to transfer loads of over 10
> million+ rows at a time (single table). But when I run the load, gobbling
> starts giving me errors since looks like the number of Extractor instances
> grows almost linearly with the number of rows in my source table. This
> results into too many connections being opened and at some point no more
> Extractor instances can be created because the DBMS is drained out of all
> its available connection limit. This is becoming really painful.
>
> I dug into the code, and found that I could use the following settings to
> limit the number of extractors created simultaenously:
>
> extract.limit.enabled=true
> extract.limit.type=pool
> extract.limit.pool.size=10
>
> However, TaskContext.java has this check:
>
>         Limiter limiter = DefaultLimiterFactory.
> newLimiter(this.taskState);
>         if (!(limiter instanceof NonRefillableLimiter)) {
>           throw new IllegalArgumentException("The Limiter used with an
> Extractor should be an instance of "
>               + NonRefillableLimiter.class.getSimpleName());
>         }
>
> This was the end of my hope. I absolutely need to use PoolBasedLimiter
> which is not a NonRefillableLimiter. How can I get around this problem?
>
> *A relevent thread I found on github*:
> https://github.com/apache/incubator-gobblin/pull/132
> I understand that the above-mentioned check was added since
> the LimitingExtractorDecorator doesn't close the limiters. But why?
>
> Regards,
> Gurjinder
>