You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Akira Ajisaka <aa...@apache.org> on 2021/08/12 07:48:07 UTC

Article: Cost-Efficient Open Source Big Data Platform at Uber

Hi folks,

I read Uber's article
https://eng.uber.com/cost-efficient-big-data-platform/. This article
is very interesting for me, and now I have some questions.

> For example, we identified that the Capacity Scheduler has some complex logic that slows down task assignment. However, code changes to get rid of those won’t be able to merge into Apache Hadoop trunk, since those sophisticated features may be needed by other companies.

- What are those sophisticated features in the Capacity Scheduler?
- In the future, can we turn off the features by some flags in Apache Hadoop?
- Is there any other examples like this?

Thanks and regards,
Akira

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org

Re: Article: Cost-Efficient Open Source Big Data Platform at Uber

Posted by Akira Ajisaka <aa...@apache.org>.

Thank you Peter for your comment,

> Since they didn't share profiling results or heat maps, we can only guess what part of Capacity Scheduler is deemed slow or a possible bottleneck.

Agreed. I want to know the profiling results or heat maps from the
authors just for curiosity.

-Akira

On Thu, Aug 19, 2021 at 12:05 AM Peter Bacsko <pb...@cloudera.com> wrote:
>
> Hi Akira,
>
> from the article, it's not clear to me what they mean by saying "sophisticated features". It is true that the container assignment code path is very complicated and understanding it takes quite a bit of time and effort. So in order to speed up container assignment in large clusters, it might be necessary to rewrite that, also losing certain features in the process - but what those might be is not elaborated. But they didn't take this path and instead opted for multiple Hadoop clusters.
>
> Since they didn't share profiling results or heat maps, we can only guess what part of Capacity Scheduler is deemed slow or a possible bottleneck.
>
> Peter
>
> On Thu, Aug 12, 2021 at 9:48 AM Akira Ajisaka <aa...@apache.org> wrote:
>>
>> Hi folks,
>>
>> I read Uber's article
>> https://eng.uber.com/cost-efficient-big-data-platform/. This article
>> is very interesting for me, and now I have some questions.
>>
>> > For example, we identified that the Capacity Scheduler has some complex logic that slows down task assignment. However, code changes to get rid of those won’t be able to merge into Apache Hadoop trunk, since those sophisticated features may be needed by other companies.
>>
>> - What are those sophisticated features in the Capacity Scheduler?
>> - In the future, can we turn off the features by some flags in Apache Hadoop?
>> - Is there any other examples like this?
>>
>> Thanks and regards,
>> Akira
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org

Re: Article: Cost-Efficient Open Source Big Data Platform at Uber

Posted by Peter Bacsko <pb...@cloudera.com.INVALID>.

Hi Akira,

from the article, it's not clear to me what they mean by saying
"sophisticated features". It is true that the container assignment code
path is very complicated and understanding it takes quite a bit of time and
effort. So in order to speed up container assignment in large clusters, it
might be necessary to rewrite that, also losing certain features in the
process - but what those might be is not elaborated. But they didn't take
this path and instead opted for multiple Hadoop clusters.

Since they didn't share profiling results or heat maps, we can only guess
what part of Capacity Scheduler is deemed slow or a possible bottleneck.

Peter

On Thu, Aug 12, 2021 at 9:48 AM Akira Ajisaka <aa...@apache.org> wrote:

> Hi folks,
>
> I read Uber's article
> https://eng.uber.com/cost-efficient-big-data-platform/. This article
> is very interesting for me, and now I have some questions.
>
> > For example, we identified that the Capacity Scheduler has some complex
> logic that slows down task assignment. However, code changes to get rid of
> those won’t be able to merge into Apache Hadoop trunk, since those
> sophisticated features may be needed by other companies.
>
> - What are those sophisticated features in the Capacity Scheduler?
> - In the future, can we turn off the features by some flags in Apache
> Hadoop?
> - Is there any other examples like this?
>
> Thanks and regards,
> Akira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>
>