You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@druid.apache.org by Will Lauer <wl...@verizonmedia.com.INVALID> on 2021/06/08 19:59:47 UTC

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

Unfortunately, the migration off of hadoop3 is a hard one (maybe not for
Druid, but certainly for big organizations running large hadoop2
workloads). If druid migrated to hadoop3 after 0.22, that would probably
prevent me from taking any new versions of Druid for at least the remainder
of the year and possibly longer.

Will


<http://www.verizonmedia.com>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

<http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
<https://www.linkedin.com/company/verizon-media/>
<http://www.instagram.com/verizonmedia>



On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cw...@apache.org> wrote:

> Hi all,
>
> I've been assisting with some experiments to see how we might want to
> migrate Druid to support Hadoop 3.x, and more importantly, see if maybe we
> can finally be free of some of the dependency issues it has been causing
> for as long as I can remember working with Druid.
>
> Hadoop 3 introduced shaded client jars,
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> , with the purpose to
> allow applications to talk to the Hadoop cluster without drowning in its
> transitive dependencies. The experimental branch that I have been helping
> with, which is using these new shaded client jars, can be seen in this PR
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> , and is currently working with
> the HDFS integration tests as well as the Hadoop tutorial flow in the Druid
> docs (which is pretty much equivalent to the HDFS integration test).
>
> The cloud deep storages still need some further testing and some minor
> cleanup still needs done for the docs and such. Additionally we still need
> to figure out how to handle the Kerberos extension, because it extends some
> Hadoop classes so isn't able to use the shaded client jars in a
> straight-forward manner, and so still has heavy dependencies and hasn't
> been tested. However, the experiment has started to pan out enough to where
> I think it is worth starting this discussion, because it does have some
> implications.
>
> Making this change I think will allow us to update our dependencies with a
> lot more freedom (I'm looking at you, Guava), but the catch is that once we
> make this change and start updating these dependencies, it will become
> hard, nearing impossible to support Hadoop 2.x, since as far as I know
> there isn't an equivalent set of shaded client jars. I am also not certain
> how far back the Hadoop job classpath isolation stuff goes
> (mapreduce.job.classloader = true) which I think is required to be set on
> Druid tasks for this shaded stuff to work alongside updated Druid
> dependencies.
>
> Is anyone opposed to or worried about dropping Hadoop 2.x support after the
> Druid 0.22 release?
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Posted by Will Xu <wi...@imply.io>.

If there is a spark ingestion option, would you be open to move away from
hadoop or there are other factors that might prevent a move?

Regards,
Will Product@Imply

On Mon, Aug 29, 2022 at 8:15 AM Will Lauer <wl...@yahooinc.com.invalid>
wrote:

> @Abhishek, I haven't spoken with our Hadoop team recently about Hadoop3
> stability, so I can't say for sure, but I understand the need to migrate
> and all the dependency headaches involved in NOT migrating. At this point,
> I expect druid moving to hadoop3 makes sense. I suspect that _we_ won't be
> ready to upgrade our clusters, which means we'll be stuck on an old druid
> release for a while,
>
> Will
>
>
>
>
> Will Lauer
>
>
> Senior Principal Architect, Audience & Advertising Reporting
>
> Data Platforms & Systems Engineering
>
>
> M 508 561 6427
>
> Champaign Office
>
> 1908 S. First St
>
> Champaign, IL 61822
>
>
>
> On Tue, Jul 26, 2022 at 5:20 AM Abhishek Agarwal <
> abhishek.agarwal@imply.io>
> wrote:
>
> > Reviving this conversation again.
> > @Will - Do you still have concerns about HDFS stability? Hadoop 3 has
> been
> > around for some time now and is very stable as far as I know.
> >
> > The dependencies coming from Hadoop 2 are also old enough that they cause
> > dependency scans to fail. E.g. Log4j 1.x dependencies that are coming
> from
> > Hadoop 2, get flagged during these scans. We have also seen issues when
> > customers try to use Hadoop ingestion with the latest log4j2 library.
> >
> > Exception in thread "main" java.lang.NoSuchMethodError:
> >
> >
> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
> > at
> >
> >
> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
> > at
> >
> >
> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
> > at
> >
> >
> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
> >
> >
> > Instead of fixing these point issues, we would be better served by
> > completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent
> > releases and dependencies are well isolated.
> >
> > On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <ka...@gmail.com>
> > wrote:
> >
> > > Hello
> > > We can also use maven profiles. We keep hadoop2 support by default and
> > add
> > > a new maven profile with hadoop3. This will allow the user to choose
> the
> > > profile which is best suited for the use case.
> > > Agreed, it will not help in the Hadoop dependency problems but does
> > enable
> > > our users to use druid with multiple flavors.
> > > Also with hadoop3, as clint mentioned, the dependencies come pre-shaded
> > so
> > > we significantly reduce our effort in solving the dependency problems.
> > > I have the PR in the last phases where I am able to run the entire test
> > > suit unit + integration tests on both the default ie hadoop2 and the
> new
> > > hadoop3 profile.
> > >
> > >
> > >
> > > On 2021/06/09 11:55:31, Will Lauer <wl...@verizonmedia.com.INVALID>
> > > wrote:
> > > > Clint,
> > > >
> > > > I fully understand what type of headache dealing with these
> dependency
> > > > issues is. We deal with this all the time, and based on conversations
> > > I've
> > > > had with our internal hadoop development team, they are quite aware
> of
> > > them
> > > > and just as frustrated by them as you are. I'm certainly in favor of
> > > doing
> > > > something to improve this situation, as long as it doesn't abandon a
> > > large
> > > > section of the user base, which I think DROPPING hadoop2 would do.
> > > >
> > > > I think there are solutions there that can help solve the conflicting
> > > > dependency problem. Refactoring Hadoop support into an independent
> > > > extension is certainly a start. But I think the dependency problem is
> > > > bigger than that. There are always going to be conflicts between
> > > > dependencies in the core system and in extensions as the system gets
> > > > bigger. We have one right now internally that prevents us from
> enabling
> > > SQL
> > > > in our instance of Druid due to conflicts between versions of
> protobuf
> > > used
> > > > by Calcite vs one of our critical extensions. Long term, I think you
> > are
> > > > going to need to carefully think through a ClassLoader based strategy
> > to
> > > > truly separate the impact of various dependencies.
> > > >
> > > > While I'm not seriously suggesting it for Druid, OSGi WOULD solve
> this
> > > > problem. It's a system that allows you to explicitly declare what
> each
> > > > bundle exposes to the system, and what each bundle consumes from the
> > > > system, allowing multiple conflicting dependencies to co-exist
> without
> > > > impacting each other. OSGi is the big hammer approach, but I bet a
> more
> > > > appropriate solution would be a simpler custom-ClassLoader based
> > solution
> > > > that hid all dependencies in extensions, keeping them from impacting
> > the
> > > > core, and that only exposed "public" pieces of the core to
> extensions.
> > If
> > > > Druid's core could be extended without impacting the various
> > extensions,
> > > > and the extensions' dependencies could be modified without impacting
> > the
> > > > core, this would go a long way towards solving the problem that you
> > have
> > > > described.
> > > >
> > > > Will
> > > >
> > > > <http://www.verizonmedia.com>
> > > >
> > > > Will Lauer
> > > >
> > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > Data Platforms & Systems Engineering
> > > >
> > > > M 508 561 6427
> > > > 1908 S. First St
> > > > Champaign, IL 61822
> > > >
> > > > <
> >
> https://urldefense.com/v3/__http://www.facebook.com/verizonmedia__;!!Op6eflyXZCqGR5I!ELFpc6bPuDlEP71zlGK80Q60Z7L9J_lK4lYBxdtvNtzUeGzeHgLNUmQxZ5rV4lhuOcP9AezuYRH3sSjAR1jzbHc4S3TzZPk$
> > >   <
> > >
> >
> https://urldefense.com/v3/__http://twitter.com/verizonmedia__;!!Op6eflyXZCqGR5I!ELFpc6bPuDlEP71zlGK80Q60Z7L9J_lK4lYBxdtvNtzUeGzeHgLNUmQxZ5rV4lhuOcP9AezuYRH3sSjAR1jzbHc4x83LXlc$
> > >
> > > > <
> >
> https://urldefense.com/v3/__https://www.linkedin.com/company/verizon-media/__;!!Op6eflyXZCqGR5I!ELFpc6bPuDlEP71zlGK80Q60Z7L9J_lK4lYBxdtvNtzUeGzeHgLNUmQxZ5rV4lhuOcP9AezuYRH3sSjAR1jzbHc4aTlZkEA$
> > >
> > > > <
> >
> https://urldefense.com/v3/__http://www.instagram.com/verizonmedia__;!!Op6eflyXZCqGR5I!ELFpc6bPuDlEP71zlGK80Q60Z7L9J_lK4lYBxdtvNtzUeGzeHgLNUmQxZ5rV4lhuOcP9AezuYRH3sSjAR1jzbHc4AUkUPWA$
> > >
> > > >
> > > >
> > > >
> > > > On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org>
> wrote:
> > > >
> > > > > @itai, I think pending the outcome of this discussion that it makes
> > > sense
> > > > > to have a wider community thread to announce any decisions we make
> > > here,
> > > > > thanks for bringing that up.
> > > > >
> > > > > @rajiv, Minio support seems unrelated to this discussion. It seems
> > > like a
> > > > > reasonable request, but I recommend starting another thread to see
> if
> > > > > someone is interested in taking up this effort.
> > > > >
> > > > > @jihoon I definitely agree that Hadoop should be refactored to be
> an
> > > > > extension longer term. I don't think this upgrade would necessarily
> > > > > make doing such a refactor any easier, but not harder either. Just
> > > moving
> > > > > Hadoop to an extension also unfortunately doesn't really do
> anything
> > to
> > > > > help our dependency problem though, which is the thing that has
> > > agitated me
> > > > > enough to start this thread and start looking into solutions.
> > > > >
> > > > > @will/@frank I feel like the stranglehold Hadoop has on our
> > > dependencies
> > > > > has started to become especially more painful in the last couple of
> > > > > years. Most painful to me is that we are stuck using a version of
> > > Apache
> > > > > Calcite from 2019 (six versions behind the latest), because newer
> > > versions
> > > > > require a newer version of Guava. This means we cannot get any bug
> > > fixes
> > > > > and improvements in our SQL parsing layer without doing something
> > like
> > > > > packaging a shaded version of it ourselves or solving our Hadoop
> > > dependency
> > > > > problem.
> > > > >
> > > > > Many other dependencies have also proved problematic with Hadoop as
> > > well in
> > > > > the past, and since we aren't able to run the Hadoop integration
> > tests
> > > in
> > > > > Travis, there is always the chance that sometimes we don't catch
> > these
> > > when
> > > > > they go in. I imagine now that we have turned on dependabot this
> > week,
> > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > > > > , that we are going to have to
> > > > > proceed very carefully with it until we are able to resolve this
> > > dependency
> > > > > issue.
> > > > >
> > > > > Hadoop 3.3.0 is also the first to support running on a Java version
> > > that is
> > > > > newer than Java 8 per
> > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > > > > ,
> > > > > which is another area we have been working towards - Druid to
> > > officially
> > > > > support Java 11+ environments.
> > > > >
> > > > > I'm sort of at a loss of what else to do besides one of
> > > > > - switching to these Hadoop 3 shaded jars and dropping 2.x support
> > > > > - figuring out how to custom package our own Hadoop 2.x
> > dependendencies
> > > > > that are shaded similarly to the Hadoop 3 client jars, and only
> > > supporting
> > > > > Hadoop with application classpath isolation
> > (mapreduce.job.classloader
> > > =
> > > > > true)
> > > > > - just dropping support for Hadoop completely
> > > > >
> > > > > I would much rather devote all effort into making Druids native
> batch
> > > > > ingestion better to encourage people to migrate to that, than
> > > continuing to
> > > > > fight with figuring out how to keep supporting Hadoop, so upgrading
> > and
> > > > > switching to the shaded client jars at least seemed like a
> reasonable
> > > > > compromise to dropping it completely. Maybe making custom shaded
> > Hadoop
> > > > > dependencies in the spirit of the Hadoop 3 shaded jars isn't as
> hard
> > > as I
> > > > > am imagining, but it does seem like the most amount of work between
> > the
> > > > > solutions I could think of to potentially resolve this problem.
> > > > >
> > > > > Does anyone have any other ideas of how we can isolate our
> > dependencies
> > > > > from Hadoop? Solutions like shading Guava,
> > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > > > > , would let Druid itself use
> > > > > newer Guava, but that doesn't help conflicts within our
> dependencies
> > > which
> > > > > has always seemed to be the larger problem to me. Moving Hadoop
> > > support to
> > > > > an extension doesn't help anything unless we can ensure that we can
> > run
> > > > > Druid ingestion tasks on Hadoop without having to match all of the
> > > Hadoop
> > > > > clusters dependencies with some sort of classloader wizardry.
> > > > >
> > > > > Maybe we could consider keeping a 0.22.x release line in Druid that
> > > gets
> > > > > security and minor bug fixes for some period of time to give
> people a
> > > > > longer period to migrate off of Hadoop 2.x? I can't speak for the
> > rest
> > > of
> > > > > the committers, but I would personally be more open to maintaining
> > > such a
> > > > > branch if it meant that moving forward at least we could update all
> > of
> > > our
> > > > > dependencies to newer versions, while providing a transition path
> to
> > > still
> > > > > have at least some support until migrating to Hadoop 3 or native
> > Druid
> > > > > batch ingestion.
> > > > >
> > > > > Any other ideas?
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 8, 2021 at 7:44 PM frank chen <fr...@apache.org>
> > > wrote:
> > > > >
> > > > > > Considering Druid takes advantage of lots of external components
> to
> > > > > work, I
> > > > > > think we should upgrade Druid in a little bit conservitive way.
> > > Dropping
> > > > > > support of hadoop2 is not a good idea.
> > > > > > The upgrading of the ZooKeeper client in Druid also prevents me
> > from
> > > > > > adopting 0.22 for a longer time.
> > > > > >
> > > > > > Although users could upgrade these dependencies first to use the
> > > latest
> > > > > > Druid releases, frankly speaking, these upgrades are not so easy
> in
> > > > > > production and usually take longer time, which would prevent
> users
> > > from
> > > > > > experiencing new features of Druid.
> > > > > > For hadoop3, I have heard of some performance issues, which also
> > > makes me
> > > > > > have no confidence to upgrade.
> > > > > >
> > > > > > I think what Jihoon proposes is a good idea, separating hadoop2
> > from
> > > > > Druid
> > > > > > core as an extension.
> > > > > > Since hadoop2 has not been EOF, to achieve balance between
> > > compatibility
> > > > > > and long term evolution, maybe we could provide two extensions,
> one
> > > for
> > > > > > hadoop2, one for hadoop3.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三
> > 上午4:13写道：
> > > > > >
> > > > > > > Just to follow up on this, our main problem with hadoop3 right
> > now
> > > has
> > > > > > been
> > > > > > > instability in HDFS, to the extent that we put on hold any
> plans
> > to
> > > > > > deploy
> > > > > > > it to our production systems. I would claim Hadoop3 isn't
> mature
> > > enough
> > > > > > yet
> > > > > > > to consider migrating Druid to it.
> > > > > > >
> > > > > > > WIll
> > > > > > >
> > > > > > > <http://www.verizonmedia.com>
> > > > > > >
> > > > > > > Will Lauer
> > > > > > >
> > > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > > Data Platforms & Systems Engineering
> > > > > > >
> > > > > > > M 508 561 6427
> > > > > > > 1908 S. First St
> > > > > > > Champaign, IL 61822
> > > > > > >
> > > > > > > <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > > >   <
> > > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > > >
> > > > > > > <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > > >
> > > > > > > <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
> > wlauer@verizonmedia.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Unfortunately, the migration off of hadoop3 is a hard one
> > (maybe
> > > not
> > > > > > for
> > > > > > > > Druid, but certainly for big organizations running large
> > hadoop2
> > > > > > > > workloads). If druid migrated to hadoop3 after 0.22, that
> would
> > > > > > probably
> > > > > > > > prevent me from taking any new versions of Druid for at least
> > the
> > > > > > > remainder
> > > > > > > > of the year and possibly longer.
> > > > > > > >
> > > > > > > > Will
> > > > > > > >
> > > > > > > >
> > > > > > > > <http://www.verizonmedia.com>
> > > > > > > >
> > > > > > > > Will Lauer
> > > > > > > >
> > > > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > > > Data Platforms & Systems Engineering
> > > > > > > >
> > > > > > > > M 508 561 6427
> > > > > > > > 1908 S. First St
> > > > > > > > Champaign, IL 61822
> > > > > > > >
> > > > > > > > <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > > >   <
> > > > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > > >
> > > > > > > >    <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > > >
> > > > > > > > <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <
> cwylie@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi all,
> > > > > > > >>
> > > > > > > >> I've been assisting with some experiments to see how we
> might
> > > want
> > > > > to
> > > > > > > >> migrate Druid to support Hadoop 3.x, and more importantly,
> see
> > > if
> > > > > > maybe
> > > > > > > we
> > > > > > > >> can finally be free of some of the dependency issues it has
> > been
> > > > > > causing
> > > > > > > >> for as long as I can remember working with Druid.
> > > > > > > >>
> > > > > > > >> Hadoop 3 introduced shaded client jars,
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > > > > >> , with the purpose to
> > > > > > > >> allow applications to talk to the Hadoop cluster without
> > > drowning in
> > > > > > its
> > > > > > > >> transitive dependencies. The experimental branch that I have
> > > been
> > > > > > > helping
> > > > > > > >> with, which is using these new shaded client jars, can be
> seen
> > > in
> > > > > this
> > > > > > > PR
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > > > > >> , and is currently working with
> > > > > > > >> the HDFS integration tests as well as the Hadoop tutorial
> flow
> > > in
> > > > > the
> > > > > > > >> Druid
> > > > > > > >> docs (which is pretty much equivalent to the HDFS
> integration
> > > test).
> > > > > > > >>
> > > > > > > >> The cloud deep storages still need some further testing and
> > some
> > > > > minor
> > > > > > > >> cleanup still needs done for the docs and such. Additionally
> > we
> > > > > still
> > > > > > > need
> > > > > > > >> to figure out how to handle the Kerberos extension, because
> it
> > > > > extends
> > > > > > > >> some
> > > > > > > >> Hadoop classes so isn't able to use the shaded client jars
> in
> > a
> > > > > > > >> straight-forward manner, and so still has heavy dependencies
> > and
> > > > > > hasn't
> > > > > > > >> been tested. However, the experiment has started to pan out
> > > enough
> > > > > to
> > > > > > > >> where
> > > > > > > >> I think it is worth starting this discussion, because it
> does
> > > have
> > > > > > some
> > > > > > > >> implications.
> > > > > > > >>
> > > > > > > >> Making this change I think will allow us to update our
> > > dependencies
> > > > > > > with a
> > > > > > > >> lot more freedom (I'm looking at you, Guava), but the catch
> is
> > > that
> > > > > > once
> > > > > > > >> we
> > > > > > > >> make this change and start updating these dependencies, it
> > will
> > > > > become
> > > > > > > >> hard, nearing impossible to support Hadoop 2.x, since as far
> > as
> > > I
> > > > > know
> > > > > > > >> there isn't an equivalent set of shaded client jars. I am
> also
> > > not
> > > > > > > certain
> > > > > > > >> how far back the Hadoop job classpath isolation stuff goes
> > > > > > > >> (mapreduce.job.classloader = true) which I think is required
> > to
> > > be
> > > > > set
> > > > > > > on
> > > > > > > >> Druid tasks for this shaded stuff to work alongside updated
> > > Druid
> > > > > > > >> dependencies.
> > > > > > > >>
> > > > > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x
> > > support
> > > > > > after
> > > > > > > >> the
> > > > > > > >> Druid 0.22 release?
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> > > For additional commands, e-mail: dev-help@druid.apache.org
> > >
> > >
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Posted by Will Lauer <wl...@yahooinc.com.INVALID>.

@Abhishek, I haven't spoken with our Hadoop team recently about Hadoop3
stability, so I can't say for sure, but I understand the need to migrate
and all the dependency headaches involved in NOT migrating. At this point,
I expect druid moving to hadoop3 makes sense. I suspect that _we_ won't be
ready to upgrade our clusters, which means we'll be stuck on an old druid
release for a while,

Will




Will Lauer


Senior Principal Architect, Audience & Advertising Reporting

Data Platforms & Systems Engineering


M 508 561 6427

Champaign Office

1908 S. First St

Champaign, IL 61822



On Tue, Jul 26, 2022 at 5:20 AM Abhishek Agarwal <ab...@imply.io>
wrote:

> Reviving this conversation again.
> @Will - Do you still have concerns about HDFS stability? Hadoop 3 has been
> around for some time now and is very stable as far as I know.
>
> The dependencies coming from Hadoop 2 are also old enough that they cause
> dependency scans to fail. E.g. Log4j 1.x dependencies that are coming from
> Hadoop 2, get flagged during these scans. We have also seen issues when
> customers try to use Hadoop ingestion with the latest log4j2 library.
>
> Exception in thread "main" java.lang.NoSuchMethodError:
>
> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
> at
>
> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
> at
>
> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
> at
>
> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
>
>
> Instead of fixing these point issues, we would be better served by
> completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent
> releases and dependencies are well isolated.
>
> On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <ka...@gmail.com>
> wrote:
>
> > Hello
> > We can also use maven profiles. We keep hadoop2 support by default and
> add
> > a new maven profile with hadoop3. This will allow the user to choose the
> > profile which is best suited for the use case.
> > Agreed, it will not help in the Hadoop dependency problems but does
> enable
> > our users to use druid with multiple flavors.
> > Also with hadoop3, as clint mentioned, the dependencies come pre-shaded
> so
> > we significantly reduce our effort in solving the dependency problems.
> > I have the PR in the last phases where I am able to run the entire test
> > suit unit + integration tests on both the default ie hadoop2 and the new
> > hadoop3 profile.
> >
> >
> >
> > On 2021/06/09 11:55:31, Will Lauer <wl...@verizonmedia.com.INVALID>
> > wrote:
> > > Clint,
> > >
> > > I fully understand what type of headache dealing with these dependency
> > > issues is. We deal with this all the time, and based on conversations
> > I've
> > > had with our internal hadoop development team, they are quite aware of
> > them
> > > and just as frustrated by them as you are. I'm certainly in favor of
> > doing
> > > something to improve this situation, as long as it doesn't abandon a
> > large
> > > section of the user base, which I think DROPPING hadoop2 would do.
> > >
> > > I think there are solutions there that can help solve the conflicting
> > > dependency problem. Refactoring Hadoop support into an independent
> > > extension is certainly a start. But I think the dependency problem is
> > > bigger than that. There are always going to be conflicts between
> > > dependencies in the core system and in extensions as the system gets
> > > bigger. We have one right now internally that prevents us from enabling
> > SQL
> > > in our instance of Druid due to conflicts between versions of protobuf
> > used
> > > by Calcite vs one of our critical extensions. Long term, I think you
> are
> > > going to need to carefully think through a ClassLoader based strategy
> to
> > > truly separate the impact of various dependencies.
> > >
> > > While I'm not seriously suggesting it for Druid, OSGi WOULD solve this
> > > problem. It's a system that allows you to explicitly declare what each
> > > bundle exposes to the system, and what each bundle consumes from the
> > > system, allowing multiple conflicting dependencies to co-exist without
> > > impacting each other. OSGi is the big hammer approach, but I bet a more
> > > appropriate solution would be a simpler custom-ClassLoader based
> solution
> > > that hid all dependencies in extensions, keeping them from impacting
> the
> > > core, and that only exposed "public" pieces of the core to extensions.
> If
> > > Druid's core could be extended without impacting the various
> extensions,
> > > and the extensions' dependencies could be modified without impacting
> the
> > > core, this would go a long way towards solving the problem that you
> have
> > > described.
> > >
> > > Will
> > >
> > > <http://www.verizonmedia.com>
> > >
> > > Will Lauer
> > >
> > > Senior Principal Architect, Audience & Advertising Reporting
> > > Data Platforms & Systems Engineering
> > >
> > > M 508 561 6427
> > > 1908 S. First St
> > > Champaign, IL 61822
> > >
> > > <
> https://urldefense.com/v3/__http://www.facebook.com/verizonmedia__;!!Op6eflyXZCqGR5I!ELFpc6bPuDlEP71zlGK80Q60Z7L9J_lK4lYBxdtvNtzUeGzeHgLNUmQxZ5rV4lhuOcP9AezuYRH3sSjAR1jzbHc4S3TzZPk$
> >   <
> >
> https://urldefense.com/v3/__http://twitter.com/verizonmedia__;!!Op6eflyXZCqGR5I!ELFpc6bPuDlEP71zlGK80Q60Z7L9J_lK4lYBxdtvNtzUeGzeHgLNUmQxZ5rV4lhuOcP9AezuYRH3sSjAR1jzbHc4x83LXlc$
> >
> > > <
> https://urldefense.com/v3/__https://www.linkedin.com/company/verizon-media/__;!!Op6eflyXZCqGR5I!ELFpc6bPuDlEP71zlGK80Q60Z7L9J_lK4lYBxdtvNtzUeGzeHgLNUmQxZ5rV4lhuOcP9AezuYRH3sSjAR1jzbHc4aTlZkEA$
> >
> > > <
> https://urldefense.com/v3/__http://www.instagram.com/verizonmedia__;!!Op6eflyXZCqGR5I!ELFpc6bPuDlEP71zlGK80Q60Z7L9J_lK4lYBxdtvNtzUeGzeHgLNUmQxZ5rV4lhuOcP9AezuYRH3sSjAR1jzbHc4AUkUPWA$
> >
> > >
> > >
> > >
> > > On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org> wrote:
> > >
> > > > @itai, I think pending the outcome of this discussion that it makes
> > sense
> > > > to have a wider community thread to announce any decisions we make
> > here,
> > > > thanks for bringing that up.
> > > >
> > > > @rajiv, Minio support seems unrelated to this discussion. It seems
> > like a
> > > > reasonable request, but I recommend starting another thread to see if
> > > > someone is interested in taking up this effort.
> > > >
> > > > @jihoon I definitely agree that Hadoop should be refactored to be an
> > > > extension longer term. I don't think this upgrade would necessarily
> > > > make doing such a refactor any easier, but not harder either. Just
> > moving
> > > > Hadoop to an extension also unfortunately doesn't really do anything
> to
> > > > help our dependency problem though, which is the thing that has
> > agitated me
> > > > enough to start this thread and start looking into solutions.
> > > >
> > > > @will/@frank I feel like the stranglehold Hadoop has on our
> > dependencies
> > > > has started to become especially more painful in the last couple of
> > > > years. Most painful to me is that we are stuck using a version of
> > Apache
> > > > Calcite from 2019 (six versions behind the latest), because newer
> > versions
> > > > require a newer version of Guava. This means we cannot get any bug
> > fixes
> > > > and improvements in our SQL parsing layer without doing something
> like
> > > > packaging a shaded version of it ourselves or solving our Hadoop
> > dependency
> > > > problem.
> > > >
> > > > Many other dependencies have also proved problematic with Hadoop as
> > well in
> > > > the past, and since we aren't able to run the Hadoop integration
> tests
> > in
> > > > Travis, there is always the chance that sometimes we don't catch
> these
> > when
> > > > they go in. I imagine now that we have turned on dependabot this
> week,
> > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > > > , that we are going to have to
> > > > proceed very carefully with it until we are able to resolve this
> > dependency
> > > > issue.
> > > >
> > > > Hadoop 3.3.0 is also the first to support running on a Java version
> > that is
> > > > newer than Java 8 per
> > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > > > ,
> > > > which is another area we have been working towards - Druid to
> > officially
> > > > support Java 11+ environments.
> > > >
> > > > I'm sort of at a loss of what else to do besides one of
> > > > - switching to these Hadoop 3 shaded jars and dropping 2.x support
> > > > - figuring out how to custom package our own Hadoop 2.x
> dependendencies
> > > > that are shaded similarly to the Hadoop 3 client jars, and only
> > supporting
> > > > Hadoop with application classpath isolation
> (mapreduce.job.classloader
> > =
> > > > true)
> > > > - just dropping support for Hadoop completely
> > > >
> > > > I would much rather devote all effort into making Druids native batch
> > > > ingestion better to encourage people to migrate to that, than
> > continuing to
> > > > fight with figuring out how to keep supporting Hadoop, so upgrading
> and
> > > > switching to the shaded client jars at least seemed like a reasonable
> > > > compromise to dropping it completely. Maybe making custom shaded
> Hadoop
> > > > dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard
> > as I
> > > > am imagining, but it does seem like the most amount of work between
> the
> > > > solutions I could think of to potentially resolve this problem.
> > > >
> > > > Does anyone have any other ideas of how we can isolate our
> dependencies
> > > > from Hadoop? Solutions like shading Guava,
> > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > > > , would let Druid itself use
> > > > newer Guava, but that doesn't help conflicts within our dependencies
> > which
> > > > has always seemed to be the larger problem to me. Moving Hadoop
> > support to
> > > > an extension doesn't help anything unless we can ensure that we can
> run
> > > > Druid ingestion tasks on Hadoop without having to match all of the
> > Hadoop
> > > > clusters dependencies with some sort of classloader wizardry.
> > > >
> > > > Maybe we could consider keeping a 0.22.x release line in Druid that
> > gets
> > > > security and minor bug fixes for some period of time to give people a
> > > > longer period to migrate off of Hadoop 2.x? I can't speak for the
> rest
> > of
> > > > the committers, but I would personally be more open to maintaining
> > such a
> > > > branch if it meant that moving forward at least we could update all
> of
> > our
> > > > dependencies to newer versions, while providing a transition path to
> > still
> > > > have at least some support until migrating to Hadoop 3 or native
> Druid
> > > > batch ingestion.
> > > >
> > > > Any other ideas?
> > > >
> > > >
> > > >
> > > > On Tue, Jun 8, 2021 at 7:44 PM frank chen <fr...@apache.org>
> > wrote:
> > > >
> > > > > Considering Druid takes advantage of lots of external components to
> > > > work, I
> > > > > think we should upgrade Druid in a little bit conservitive way.
> > Dropping
> > > > > support of hadoop2 is not a good idea.
> > > > > The upgrading of the ZooKeeper client in Druid also prevents me
> from
> > > > > adopting 0.22 for a longer time.
> > > > >
> > > > > Although users could upgrade these dependencies first to use the
> > latest
> > > > > Druid releases, frankly speaking, these upgrades are not so easy in
> > > > > production and usually take longer time, which would prevent users
> > from
> > > > > experiencing new features of Druid.
> > > > > For hadoop3, I have heard of some performance issues, which also
> > makes me
> > > > > have no confidence to upgrade.
> > > > >
> > > > > I think what Jihoon proposes is a good idea, separating hadoop2
> from
> > > > Druid
> > > > > core as an extension.
> > > > > Since hadoop2 has not been EOF, to achieve balance between
> > compatibility
> > > > > and long term evolution, maybe we could provide two extensions, one
> > for
> > > > > hadoop2, one for hadoop3.
> > > > >
> > > > >
> > > > >
> > > > > Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三
> 上午4:13写道：
> > > > >
> > > > > > Just to follow up on this, our main problem with hadoop3 right
> now
> > has
> > > > > been
> > > > > > instability in HDFS, to the extent that we put on hold any plans
> to
> > > > > deploy
> > > > > > it to our production systems. I would claim Hadoop3 isn't mature
> > enough
> > > > > yet
> > > > > > to consider migrating Druid to it.
> > > > > >
> > > > > > WIll
> > > > > >
> > > > > > <http://www.verizonmedia.com>
> > > > > >
> > > > > > Will Lauer
> > > > > >
> > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > Data Platforms & Systems Engineering
> > > > > >
> > > > > > M 508 561 6427
> > > > > > 1908 S. First St
> > > > > > Champaign, IL 61822
> > > > > >
> > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > >   <
> > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > >
> > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > >
> > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
> wlauer@verizonmedia.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Unfortunately, the migration off of hadoop3 is a hard one
> (maybe
> > not
> > > > > for
> > > > > > > Druid, but certainly for big organizations running large
> hadoop2
> > > > > > > workloads). If druid migrated to hadoop3 after 0.22, that would
> > > > > probably
> > > > > > > prevent me from taking any new versions of Druid for at least
> the
> > > > > > remainder
> > > > > > > of the year and possibly longer.
> > > > > > >
> > > > > > > Will
> > > > > > >
> > > > > > >
> > > > > > > <http://www.verizonmedia.com>
> > > > > > >
> > > > > > > Will Lauer
> > > > > > >
> > > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > > Data Platforms & Systems Engineering
> > > > > > >
> > > > > > > M 508 561 6427
> > > > > > > 1908 S. First St
> > > > > > > Champaign, IL 61822
> > > > > > >
> > > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > >   <
> > > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > >
> > > > > > >    <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > >
> > > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cw...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> I've been assisting with some experiments to see how we might
> > want
> > > > to
> > > > > > >> migrate Druid to support Hadoop 3.x, and more importantly, see
> > if
> > > > > maybe
> > > > > > we
> > > > > > >> can finally be free of some of the dependency issues it has
> been
> > > > > causing
> > > > > > >> for as long as I can remember working with Druid.
> > > > > > >>
> > > > > > >> Hadoop 3 introduced shaded client jars,
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > > > >> , with the purpose to
> > > > > > >> allow applications to talk to the Hadoop cluster without
> > drowning in
> > > > > its
> > > > > > >> transitive dependencies. The experimental branch that I have
> > been
> > > > > > helping
> > > > > > >> with, which is using these new shaded client jars, can be seen
> > in
> > > > this
> > > > > > PR
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > > > >> , and is currently working with
> > > > > > >> the HDFS integration tests as well as the Hadoop tutorial flow
> > in
> > > > the
> > > > > > >> Druid
> > > > > > >> docs (which is pretty much equivalent to the HDFS integration
> > test).
> > > > > > >>
> > > > > > >> The cloud deep storages still need some further testing and
> some
> > > > minor
> > > > > > >> cleanup still needs done for the docs and such. Additionally
> we
> > > > still
> > > > > > need
> > > > > > >> to figure out how to handle the Kerberos extension, because it
> > > > extends
> > > > > > >> some
> > > > > > >> Hadoop classes so isn't able to use the shaded client jars in
> a
> > > > > > >> straight-forward manner, and so still has heavy dependencies
> and
> > > > > hasn't
> > > > > > >> been tested. However, the experiment has started to pan out
> > enough
> > > > to
> > > > > > >> where
> > > > > > >> I think it is worth starting this discussion, because it does
> > have
> > > > > some
> > > > > > >> implications.
> > > > > > >>
> > > > > > >> Making this change I think will allow us to update our
> > dependencies
> > > > > > with a
> > > > > > >> lot more freedom (I'm looking at you, Guava), but the catch is
> > that
> > > > > once
> > > > > > >> we
> > > > > > >> make this change and start updating these dependencies, it
> will
> > > > become
> > > > > > >> hard, nearing impossible to support Hadoop 2.x, since as far
> as
> > I
> > > > know
> > > > > > >> there isn't an equivalent set of shaded client jars. I am also
> > not
> > > > > > certain
> > > > > > >> how far back the Hadoop job classpath isolation stuff goes
> > > > > > >> (mapreduce.job.classloader = true) which I think is required
> to
> > be
> > > > set
> > > > > > on
> > > > > > >> Druid tasks for this shaded stuff to work alongside updated
> > Druid
> > > > > > >> dependencies.
> > > > > > >>
> > > > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x
> > support
> > > > > after
> > > > > > >> the
> > > > > > >> Druid 0.22 release?
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> > For additional commands, e-mail: dev-help@druid.apache.org
> >
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Posted by Paul Rogers <pa...@imply.io>.

Gian mentioned MSQ. The new MSQ work is exciting and powerful for Druid ingestion. If the data needs cleaning, we would expect users to employ something like Spark to do that task, then emit clean data to Kafka or files, which Druid MSQ can ingest. That is:

Dirty data —> Spark —> Kafka/Files —> Druid with MSQ

Spark is an industry-standard tool and has a wide set of data engineering features developed over many years. Spark is great at data conversion, data cleaning, “enrichment” (joins), etc. IMHO, there is no reason for Druid MSQ to duplicate these generic Spark features: MSQ is about loading clean data into Druid. For users familiar with Spark, Julian’s Spark connector avoids the multi-step path, Spark can do the work directly.

Looks like Spark still supports Hadoop 2. Since Spark has sorted out these issues, then as Samarth suggested, perhaps Druid wouldn’t need to, if we had a Spark connector.

I recall that one discussion was whether the connector should be part of core Druid, or as come kind of extension. Though, we don’t have a “Druid marketplace” or similar solution manage “third party” extensions. I’m not aware that such a feature is under discussion, so we having the Spark connector in Druid itself may be the only short-term solution. Or, is there another option?

Julian,

On the PR thread, I mentioned the work that was done to allow “external tasks” such as Spark. That work is waiting for the “new IT” stuff to land so we can reasonably write integration tests. My sense is that the external task support will reduce some of the more fiddly bits of the Spark PR.

Maytas,

Thanks for offering to review Julian’s PR. We do need a committer to help push this PR over the line.

Thanks,

- Paul



> On Aug 8, 2022, at 9:13 PM, Gian Merlino <gi...@apache.org> wrote:
> 
> It's always good to deprecate things for some time prior to removing them,
> so we don't need to (nor should we) remove Hadoop 2 support right now. My
> vote is that in this upcoming release, we should deprecate it. The main
> problem in my eyes is the one Abhishek brought up: the dependency
> management situation with Hadoop 2 is really messy, and I'm not sure
> there's a good way to handle them given the limited classloader isolation.
> This situation becomes tougher to manage with each release, and we haven't
> had people volunteering to find and build comprehensive solutions. It is
> time to move on.
> 
> The concern Samarth raised, that people may end up stuck on older Druid
> versions because they aren't able to upgrade to Hadoop 3, is valid. I can
> see two good solutions to this. First: we can improve native ingest to the
> point where people feel broadly comfortable moving Hadoop 2 workloads to
> native. The work planned as part of doing ingest via multi-stage
> distributed query <https://github.com/apache/druid/issues/12262> is going
> to be useful here, by improving the speed and scalability of native ingest.
> Second: it would also be great to have something similar that runs on
> Spark, for people that have made investments in Spark. I suspect that most
> people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so supporting
> both of those would ease a lot of the potential pain of dropping Hadoop 2
> support.
> 
> On Spark: I'm not familiar with the current state of the Spark work. Is it
> stuck? If so could something be done to unstick it? I agree with Abhishek
> that I wouldn't want to block moving off Hadoop 2 on this. However, it'd be
> great if we could get it done before actually removing Hadoop 2 support
> from the code base.
> 
> 
> On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal <ab...@imply.io>
> wrote:
> 
>> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
>> low-resistance path than moving from Hadoop to Spark. even if we get that
>> PR merged, it will take good time for spark integration to reach the same
>> level of maturity as Hadoop or Native ingestion. BTW I am not making an
>> argument against spark integration. it will certainly be nice to have Spark
>> as an option. Just that spark integration doesn't become a blocker for us
>> to get off Hadoop.
>> 
>> btw are you using Hadoop 2 right now with the latest druid version? If so,
>> did you run into similar errors that I posted in my last email?
>> 
>> On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain <sa...@gmail.com>
>> wrote:
>> 
>>> I am sure there are other companies out there who are still on Hadoop 2.x
>>> with migration to Hadoop 3.x being a no-go.
>>> If Druid was to drop support for Hadoop 3.x completely, I am afraid it
>>> would prevent users from updating to newer versions of Druid which would
>> be
>>> a shame.
>>> 
>>> FWIW, we have found in practice for high volume use cases that compaction
>>> based on Druid's Hadoop based batch ingestion is a lot more scale-able
>> than
>>> the native compaction.
>>> 
>>> Having said that, as an alternative, if we can merge Julian's Spark based
>>> ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid,
>> that
>>> might provide an alternate way for users to get rid of the Hadoop
>>> dependency.
>>> 
>>> On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <
>>> abhishek.agarwal@imply.io>
>>> wrote:
>>> 
>>>> Reviving this conversation again.
>>>> @Will - Do you still have concerns about HDFS stability? Hadoop 3 has
>>> been
>>>> around for some time now and is very stable as far as I know.
>>>> 
>>>> The dependencies coming from Hadoop 2 are also old enough that they
>> cause
>>>> dependency scans to fail. E.g. Log4j 1.x dependencies that are coming
>>> from
>>>> Hadoop 2, get flagged during these scans. We have also seen issues when
>>>> customers try to use Hadoop ingestion with the latest log4j2 library.
>>>> 
>>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>>> 
>>>> 
>>> 
>> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
>>>> 
>>>> 
>>>> Instead of fixing these point issues, we would be better served by
>>>> completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent
>>>> releases and dependencies are well isolated.
>>>> 
>>>> On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <karankumar1100@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>>> Hello
>>>>> We can also use maven profiles. We keep hadoop2 support by default
>> and
>>>> add
>>>>> a new maven profile with hadoop3. This will allow the user to choose
>>> the
>>>>> profile which is best suited for the use case.
>>>>> Agreed, it will not help in the Hadoop dependency problems but does
>>>> enable
>>>>> our users to use druid with multiple flavors.
>>>>> Also with hadoop3, as clint mentioned, the dependencies come
>> pre-shaded
>>>> so
>>>>> we significantly reduce our effort in solving the dependency
>> problems.
>>>>> I have the PR in the last phases where I am able to run the entire
>> test
>>>>> suit unit + integration tests on both the default ie hadoop2 and the
>>> new
>>>>> hadoop3 profile.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 2021/06/09 11:55:31, Will Lauer <wl...@verizonmedia.com.INVALID>
>>>>> wrote:
>>>>>> Clint,
>>>>>> 
>>>>>> I fully understand what type of headache dealing with these
>>> dependency
>>>>>> issues is. We deal with this all the time, and based on
>> conversations
>>>>> I've
>>>>>> had with our internal hadoop development team, they are quite aware
>>> of
>>>>> them
>>>>>> and just as frustrated by them as you are. I'm certainly in favor
>> of
>>>>> doing
>>>>>> something to improve this situation, as long as it doesn't abandon
>> a
>>>>> large
>>>>>> section of the user base, which I think DROPPING hadoop2 would do.
>>>>>> 
>>>>>> I think there are solutions there that can help solve the
>> conflicting
>>>>>> dependency problem. Refactoring Hadoop support into an independent
>>>>>> extension is certainly a start. But I think the dependency problem
>> is
>>>>>> bigger than that. There are always going to be conflicts between
>>>>>> dependencies in the core system and in extensions as the system
>> gets
>>>>>> bigger. We have one right now internally that prevents us from
>>> enabling
>>>>> SQL
>>>>>> in our instance of Druid due to conflicts between versions of
>>> protobuf
>>>>> used
>>>>>> by Calcite vs one of our critical extensions. Long term, I think
>> you
>>>> are
>>>>>> going to need to carefully think through a ClassLoader based
>> strategy
>>>> to
>>>>>> truly separate the impact of various dependencies.
>>>>>> 
>>>>>> While I'm not seriously suggesting it for Druid, OSGi WOULD solve
>>> this
>>>>>> problem. It's a system that allows you to explicitly declare what
>>> each
>>>>>> bundle exposes to the system, and what each bundle consumes from
>> the
>>>>>> system, allowing multiple conflicting dependencies to co-exist
>>> without
>>>>>> impacting each other. OSGi is the big hammer approach, but I bet a
>>> more
>>>>>> appropriate solution would be a simpler custom-ClassLoader based
>>>> solution
>>>>>> that hid all dependencies in extensions, keeping them from
>> impacting
>>>> the
>>>>>> core, and that only exposed "public" pieces of the core to
>>> extensions.
>>>> If
>>>>>> Druid's core could be extended without impacting the various
>>>> extensions,
>>>>>> and the extensions' dependencies could be modified without
>> impacting
>>>> the
>>>>>> core, this would go a long way towards solving the problem that you
>>>> have
>>>>>> described.
>>>>>> 
>>>>>> Will
>>>>>> 
>>>>>> <http://www.verizonmedia.com>
>>>>>> 
>>>>>> Will Lauer
>>>>>> 
>>>>>> Senior Principal Architect, Audience & Advertising Reporting
>>>>>> Data Platforms & Systems Engineering
>>>>>> 
>>>>>> M 508 561 6427
>>>>>> 1908 S. First St
>>>>>> Champaign, IL 61822
>>>>>> 
>>>>>> <http://www.facebook.com/verizonmedia>   <
>>>>> http://twitter.com/verizonmedia>
>>>>>> <https://www.linkedin.com/company/verizon-media/>
>>>>>> <http://www.instagram.com/verizonmedia>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org>
>>> wrote:
>>>>>> 
>>>>>>> @itai, I think pending the outcome of this discussion that it
>> makes
>>>>> sense
>>>>>>> to have a wider community thread to announce any decisions we
>> make
>>>>> here,
>>>>>>> thanks for bringing that up.
>>>>>>> 
>>>>>>> @rajiv, Minio support seems unrelated to this discussion. It
>> seems
>>>>> like a
>>>>>>> reasonable request, but I recommend starting another thread to
>> see
>>> if
>>>>>>> someone is interested in taking up this effort.
>>>>>>> 
>>>>>>> @jihoon I definitely agree that Hadoop should be refactored to be
>>> an
>>>>>>> extension longer term. I don't think this upgrade would
>> necessarily
>>>>>>> make doing such a refactor any easier, but not harder either.
>> Just
>>>>> moving
>>>>>>> Hadoop to an extension also unfortunately doesn't really do
>>> anything
>>>> to
>>>>>>> help our dependency problem though, which is the thing that has
>>>>> agitated me
>>>>>>> enough to start this thread and start looking into solutions.
>>>>>>> 
>>>>>>> @will/@frank I feel like the stranglehold Hadoop has on our
>>>>> dependencies
>>>>>>> has started to become especially more painful in the last couple
>> of
>>>>>>> years. Most painful to me is that we are stuck using a version of
>>>>> Apache
>>>>>>> Calcite from 2019 (six versions behind the latest), because newer
>>>>> versions
>>>>>>> require a newer version of Guava. This means we cannot get any
>> bug
>>>>> fixes
>>>>>>> and improvements in our SQL parsing layer without doing something
>>>> like
>>>>>>> packaging a shaded version of it ourselves or solving our Hadoop
>>>>> dependency
>>>>>>> problem.
>>>>>>> 
>>>>>>> Many other dependencies have also proved problematic with Hadoop
>> as
>>>>> well in
>>>>>>> the past, and since we aren't able to run the Hadoop integration
>>>> tests
>>>>> in
>>>>>>> Travis, there is always the chance that sometimes we don't catch
>>>> these
>>>>> when
>>>>>>> they go in. I imagine now that we have turned on dependabot this
>>>> week,
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
>>>>>>> , that we are going to have to
>>>>>>> proceed very carefully with it until we are able to resolve this
>>>>> dependency
>>>>>>> issue.
>>>>>>> 
>>>>>>> Hadoop 3.3.0 is also the first to support running on a Java
>> version
>>>>> that is
>>>>>>> newer than Java 8 per
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
>>>>>>> ,
>>>>>>> which is another area we have been working towards - Druid to
>>>>> officially
>>>>>>> support Java 11+ environments.
>>>>>>> 
>>>>>>> I'm sort of at a loss of what else to do besides one of
>>>>>>> - switching to these Hadoop 3 shaded jars and dropping 2.x
>> support
>>>>>>> - figuring out how to custom package our own Hadoop 2.x
>>>> dependendencies
>>>>>>> that are shaded similarly to the Hadoop 3 client jars, and only
>>>>> supporting
>>>>>>> Hadoop with application classpath isolation
>>>> (mapreduce.job.classloader
>>>>> =
>>>>>>> true)
>>>>>>> - just dropping support for Hadoop completely
>>>>>>> 
>>>>>>> I would much rather devote all effort into making Druids native
>>> batch
>>>>>>> ingestion better to encourage people to migrate to that, than
>>>>> continuing to
>>>>>>> fight with figuring out how to keep supporting Hadoop, so
>> upgrading
>>>> and
>>>>>>> switching to the shaded client jars at least seemed like a
>>> reasonable
>>>>>>> compromise to dropping it completely. Maybe making custom shaded
>>>> Hadoop
>>>>>>> dependencies in the spirit of the Hadoop 3 shaded jars isn't as
>>> hard
>>>>> as I
>>>>>>> am imagining, but it does seem like the most amount of work
>> between
>>>> the
>>>>>>> solutions I could think of to potentially resolve this problem.
>>>>>>> 
>>>>>>> Does anyone have any other ideas of how we can isolate our
>>>> dependencies
>>>>>>> from Hadoop? Solutions like shading Guava,
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
>>>>>>> , would let Druid itself use
>>>>>>> newer Guava, but that doesn't help conflicts within our
>>> dependencies
>>>>> which
>>>>>>> has always seemed to be the larger problem to me. Moving Hadoop
>>>>> support to
>>>>>>> an extension doesn't help anything unless we can ensure that we
>> can
>>>> run
>>>>>>> Druid ingestion tasks on Hadoop without having to match all of
>> the
>>>>> Hadoop
>>>>>>> clusters dependencies with some sort of classloader wizardry.
>>>>>>> 
>>>>>>> Maybe we could consider keeping a 0.22.x release line in Druid
>> that
>>>>> gets
>>>>>>> security and minor bug fixes for some period of time to give
>>> people a
>>>>>>> longer period to migrate off of Hadoop 2.x? I can't speak for the
>>>> rest
>>>>> of
>>>>>>> the committers, but I would personally be more open to
>> maintaining
>>>>> such a
>>>>>>> branch if it meant that moving forward at least we could update
>> all
>>>> of
>>>>> our
>>>>>>> dependencies to newer versions, while providing a transition path
>>> to
>>>>> still
>>>>>>> have at least some support until migrating to Hadoop 3 or native
>>>> Druid
>>>>>>> batch ingestion.
>>>>>>> 
>>>>>>> Any other ideas?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jun 8, 2021 at 7:44 PM frank chen <fr...@apache.org>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Considering Druid takes advantage of lots of external
>> components
>>> to
>>>>>>> work, I
>>>>>>>> think we should upgrade Druid in a little bit conservitive way.
>>>>> Dropping
>>>>>>>> support of hadoop2 is not a good idea.
>>>>>>>> The upgrading of the ZooKeeper client in Druid also prevents me
>>>> from
>>>>>>>> adopting 0.22 for a longer time.
>>>>>>>> 
>>>>>>>> Although users could upgrade these dependencies first to use
>> the
>>>>> latest
>>>>>>>> Druid releases, frankly speaking, these upgrades are not so
>> easy
>>> in
>>>>>>>> production and usually take longer time, which would prevent
>>> users
>>>>> from
>>>>>>>> experiencing new features of Druid.
>>>>>>>> For hadoop3, I have heard of some performance issues, which
>> also
>>>>> makes me
>>>>>>>> have no confidence to upgrade.
>>>>>>>> 
>>>>>>>> I think what Jihoon proposes is a good idea, separating hadoop2
>>>> from
>>>>>>> Druid
>>>>>>>> core as an extension.
>>>>>>>> Since hadoop2 has not been EOF, to achieve balance between
>>>>> compatibility
>>>>>>>> and long term evolution, maybe we could provide two extensions,
>>> one
>>>>> for
>>>>>>>> hadoop2, one for hadoop3.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三
>>>> 上午4:13写道：
>>>>>>>> 
>>>>>>>>> Just to follow up on this, our main problem with hadoop3
>> right
>>>> now
>>>>> has
>>>>>>>> been
>>>>>>>>> instability in HDFS, to the extent that we put on hold any
>>> plans
>>>> to
>>>>>>>> deploy
>>>>>>>>> it to our production systems. I would claim Hadoop3 isn't
>>> mature
>>>>> enough
>>>>>>>> yet
>>>>>>>>> to consider migrating Druid to it.
>>>>>>>>> 
>>>>>>>>> WIll
>>>>>>>>> 
>>>>>>>>> <http://www.verizonmedia.com>
>>>>>>>>> 
>>>>>>>>> Will Lauer
>>>>>>>>> 
>>>>>>>>> Senior Principal Architect, Audience & Advertising Reporting
>>>>>>>>> Data Platforms & Systems Engineering
>>>>>>>>> 
>>>>>>>>> M 508 561 6427
>>>>>>>>> 1908 S. First St
>>>>>>>>> Champaign, IL 61822
>>>>>>>>> 
>>>>>>>>> <
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
>>>>>>>>  <
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
>>>>>>>> 
>>>>>>>>> <
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
>>>>>>>> 
>>>>>>>>> <
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
>>>> wlauer@verizonmedia.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Unfortunately, the migration off of hadoop3 is a hard one
>>>> (maybe
>>>>> not
>>>>>>>> for
>>>>>>>>>> Druid, but certainly for big organizations running large
>>>> hadoop2
>>>>>>>>>> workloads). If druid migrated to hadoop3 after 0.22, that
>>> would
>>>>>>>> probably
>>>>>>>>>> prevent me from taking any new versions of Druid for at
>> least
>>>> the
>>>>>>>>> remainder
>>>>>>>>>> of the year and possibly longer.
>>>>>>>>>> 
>>>>>>>>>> Will
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> <http://www.verizonmedia.com>
>>>>>>>>>> 
>>>>>>>>>> Will Lauer
>>>>>>>>>> 
>>>>>>>>>> Senior Principal Architect, Audience & Advertising
>> Reporting
>>>>>>>>>> Data Platforms & Systems Engineering
>>>>>>>>>> 
>>>>>>>>>> M 508 561 6427
>>>>>>>>>> 1908 S. First St
>>>>>>>>>> Champaign, IL 61822
>>>>>>>>>> 
>>>>>>>>>> <
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
>>>>>>>>  <
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
>>>>>>>> 
>>>>>>>>>>   <
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
>>>>>>>> 
>>>>>>>>>> <
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <
>>> cwylie@apache.org>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> I've been assisting with some experiments to see how we
>>> might
>>>>> want
>>>>>>> to
>>>>>>>>>>> migrate Druid to support Hadoop 3.x, and more importantly,
>>> see
>>>>> if
>>>>>>>> maybe
>>>>>>>>> we
>>>>>>>>>>> can finally be free of some of the dependency issues it
>> has
>>>> been
>>>>>>>> causing
>>>>>>>>>>> for as long as I can remember working with Druid.
>>>>>>>>>>> 
>>>>>>>>>>> Hadoop 3 introduced shaded client jars,
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
>>>>>>>>>>> , with the purpose to
>>>>>>>>>>> allow applications to talk to the Hadoop cluster without
>>>>> drowning in
>>>>>>>> its
>>>>>>>>>>> transitive dependencies. The experimental branch that I
>> have
>>>>> been
>>>>>>>>> helping
>>>>>>>>>>> with, which is using these new shaded client jars, can be
>>> seen
>>>>> in
>>>>>>> this
>>>>>>>>> PR
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
>>>>>>>>>>> , and is currently working with
>>>>>>>>>>> the HDFS integration tests as well as the Hadoop tutorial
>>> flow
>>>>> in
>>>>>>> the
>>>>>>>>>>> Druid
>>>>>>>>>>> docs (which is pretty much equivalent to the HDFS
>>> integration
>>>>> test).
>>>>>>>>>>> 
>>>>>>>>>>> The cloud deep storages still need some further testing
>> and
>>>> some
>>>>>>> minor
>>>>>>>>>>> cleanup still needs done for the docs and such.
>> Additionally
>>>> we
>>>>>>> still
>>>>>>>>> need
>>>>>>>>>>> to figure out how to handle the Kerberos extension,
>> because
>>> it
>>>>>>> extends
>>>>>>>>>>> some
>>>>>>>>>>> Hadoop classes so isn't able to use the shaded client jars
>>> in
>>>> a
>>>>>>>>>>> straight-forward manner, and so still has heavy
>> dependencies
>>>> and
>>>>>>>> hasn't
>>>>>>>>>>> been tested. However, the experiment has started to pan
>> out
>>>>> enough
>>>>>>> to
>>>>>>>>>>> where
>>>>>>>>>>> I think it is worth starting this discussion, because it
>>> does
>>>>> have
>>>>>>>> some
>>>>>>>>>>> implications.
>>>>>>>>>>> 
>>>>>>>>>>> Making this change I think will allow us to update our
>>>>> dependencies
>>>>>>>>> with a
>>>>>>>>>>> lot more freedom (I'm looking at you, Guava), but the
>> catch
>>> is
>>>>> that
>>>>>>>> once
>>>>>>>>>>> we
>>>>>>>>>>> make this change and start updating these dependencies, it
>>>> will
>>>>>>> become
>>>>>>>>>>> hard, nearing impossible to support Hadoop 2.x, since as
>> far
>>>> as
>>>>> I
>>>>>>> know
>>>>>>>>>>> there isn't an equivalent set of shaded client jars. I am
>>> also
>>>>> not
>>>>>>>>> certain
>>>>>>>>>>> how far back the Hadoop job classpath isolation stuff goes
>>>>>>>>>>> (mapreduce.job.classloader = true) which I think is
>> required
>>>> to
>>>>> be
>>>>>>> set
>>>>>>>>> on
>>>>>>>>>>> Druid tasks for this shaded stuff to work alongside
>> updated
>>>>> Druid
>>>>>>>>>>> dependencies.
>>>>>>>>>>> 
>>>>>>>>>>> Is anyone opposed to or worried about dropping Hadoop 2.x
>>>>> support
>>>>>>>> after
>>>>>>>>>>> the
>>>>>>>>>>> Druid 0.22 release?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
>>>>> For additional commands, e-mail: dev-help@druid.apache.org
>>>>> 
>>>>> 
>>>> 
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
For additional commands, e-mail: dev-help@druid.apache.org

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Posted by Maytas Monsereenusorn <ma...@apache.org>.

Hi Julian,

Thank you so much for your contribution on Spark support. As an existing
committer, I would like to help get the Spark connector merged into OSS
(including PR reviews and any other development work that may be needed).
We can move the conversation regarding Spark support into a new thread or
reuse the Github issue already opened to keep this thread on topic with
dropping support for Hadoop 2.x.

Best Regards,
Maytas

On Sun, Aug 21, 2022 at 11:55 PM Julian Jaffe <ju...@gmail.com>
wrote:

> For Spark support, the connector I wrote remains functional but I haven’t
> updated the PR for six months or so since it didn’t seem like there was an
> appetite for review. If that’s changing I could migrate back some more
> recent changes to the OSS PR. Even with an up-to-date patch though I see
> two problems:
>
> First, I remain worried that there isn’t sufficient support among
> committers for the Spark connector. I don’t want Druid to end up in the
> same place it is now for Hadoop 2 support where no one really maintains the
> Spark code and we wind up with another awkward corner of the code base that
> holds back other development.
>
> Secondly, the PR I have up is for Spark 2.4, which is now 2 years further
> out of date than it was back in 2020. Similarly to Hadoop there is a
> bifurcation in the community and Spark 2.4 is still in heavy use but we
> might be trading one problem for another if we deprecate Hadoop 2 in favor
> of Spark 2.4. I have written a Spark 3.2 connector as well but it’s been
> deployed to significantly smaller use cases than the 2.4 line.
>
> Even with these two caveats, if there’s a desire among the Druid
> development community to add Spark functionality and support it I’d love to
> push this across the finish line.
>
> > On Aug 9, 2022, at 1:04 AM, Abhishek Agarwal <ab...@imply.io>
> wrote:
> >
> > Yes. We should deprecate it first which is similar to dropping the
> support
> > (no more active development) but we will still ship it for a release or
> > two. In a way, we are already in that mode to a certain extent. Many
> > features are being built with native ingestion as a first-class citizen.
> > E.g. range partitioning is still not supported on Hadoop ingestion. It's
> > hard for developers to build and test their business logic for all the
> > ingestion modes.
> >
> > It will be good to hear what gaps do community sees between native
> > ingestion vs Hadoop-based batch ingestion. And then work toward fixing
> > those gaps before dropping the Hadoop ingestion entirely. For example, if
> > users want the resource elasticity that a Hadoop cluster gives, we could
> > push forward PRs such as https://github.com/apache/druid/pull/10910.
> It's
> > not the same as a Hadoop cluster but nonetheless will let user reuse
> their
> > existing infrastructure to run druid jobs.
> >
> >> On Tue, Aug 9, 2022 at 9:43 AM Gian Merlino <gi...@apache.org> wrote:
> >>
> >> It's always good to deprecate things for some time prior to removing
> them,
> >> so we don't need to (nor should we) remove Hadoop 2 support right now.
> My
> >> vote is that in this upcoming release, we should deprecate it. The main
> >> problem in my eyes is the one Abhishek brought up: the dependency
> >> management situation with Hadoop 2 is really messy, and I'm not sure
> >> there's a good way to handle them given the limited classloader
> isolation.
> >> This situation becomes tougher to manage with each release, and we
> haven't
> >> had people volunteering to find and build comprehensive solutions. It is
> >> time to move on.
> >>
> >> The concern Samarth raised, that people may end up stuck on older Druid
> >> versions because they aren't able to upgrade to Hadoop 3, is valid. I
> can
> >> see two good solutions to this. First: we can improve native ingest to
> the
> >> point where people feel broadly comfortable moving Hadoop 2 workloads to
> >> native. The work planned as part of doing ingest via multi-stage
> >> distributed query <https://github.com/apache/druid/issues/12262> is
> going
> >> to be useful here, by improving the speed and scalability of native
> ingest.
> >> Second: it would also be great to have something similar that runs on
> >> Spark, for people that have made investments in Spark. I suspect that
> most
> >> people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so
> supporting
> >> both of those would ease a lot of the potential pain of dropping Hadoop
> 2
> >> support.
> >>
> >> On Spark: I'm not familiar with the current state of the Spark work. Is
> it
> >> stuck? If so could something be done to unstick it? I agree with
> Abhishek
> >> that I wouldn't want to block moving off Hadoop 2 on this. However,
> it'd be
> >> great if we could get it done before actually removing Hadoop 2 support
> >> from the code base.
> >>
> >>
> >> On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal <
> abhishek.agarwal@imply.io
> >>>
> >> wrote:
> >>
> >>> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
> >>> low-resistance path than moving from Hadoop to Spark. even if we get
> that
> >>> PR merged, it will take good time for spark integration to reach the
> same
> >>> level of maturity as Hadoop or Native ingestion. BTW I am not making an
> >>> argument against spark integration. it will certainly be nice to have
> >> Spark
> >>> as an option. Just that spark integration doesn't become a blocker for
> us
> >>> to get off Hadoop.
> >>>
> >>> btw are you using Hadoop 2 right now with the latest druid version? If
> >> so,
> >>> did you run into similar errors that I posted in my last email?
> >>>
> >>> On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain <sa...@gmail.com>
> >>> wrote:
> >>>
> >>>> I am sure there are other companies out there who are still on Hadoop
> >> 2.x
> >>>> with migration to Hadoop 3.x being a no-go.
> >>>> If Druid was to drop support for Hadoop 3.x completely, I am afraid it
> >>>> would prevent users from updating to newer versions of Druid which
> >> would
> >>> be
> >>>> a shame.
> >>>>
> >>>> FWIW, we have found in practice for high volume use cases that
> >> compaction
> >>>> based on Druid's Hadoop based batch ingestion is a lot more scale-able
> >>> than
> >>>> the native compaction.
> >>>>
> >>>> Having said that, as an alternative, if we can merge Julian's Spark
> >> based
> >>>> ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid,
> >>> that
> >>>> might provide an alternate way for users to get rid of the Hadoop
> >>>> dependency.
> >>>>
> >>>> On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <
> >>>> abhishek.agarwal@imply.io>
> >>>> wrote:
> >>>>
> >>>>> Reviving this conversation again.
> >>>>> @Will - Do you still have concerns about HDFS stability? Hadoop 3 has
> >>>> been
> >>>>> around for some time now and is very stable as far as I know.
> >>>>>
> >>>>> The dependencies coming from Hadoop 2 are also old enough that they
> >>> cause
> >>>>> dependency scans to fail. E.g. Log4j 1.x dependencies that are coming
> >>>> from
> >>>>> Hadoop 2, get flagged during these scans. We have also seen issues
> >> when
> >>>>> customers try to use Hadoop ingestion with the latest log4j2 library.
> >>>>>
> >>>>> Exception in thread "main" java.lang.NoSuchMethodError:
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
> >>>>> at
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
> >>>>> at
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
> >>>>> at
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
> >>>>>
> >>>>>
> >>>>> Instead of fixing these point issues, we would be better served by
> >>>>> completely moving to Hadoop 3 entirely. Hadoop 3 does get more
> >> frequent
> >>>>> releases and dependencies are well isolated.
> >>>>>
> >>>>> On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <
> >> karankumar1100@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hello
> >>>>>> We can also use maven profiles. We keep hadoop2 support by default
> >>> and
> >>>>> add
> >>>>>> a new maven profile with hadoop3. This will allow the user to
> >> choose
> >>>> the
> >>>>>> profile which is best suited for the use case.
> >>>>>> Agreed, it will not help in the Hadoop dependency problems but does
> >>>>> enable
> >>>>>> our users to use druid with multiple flavors.
> >>>>>> Also with hadoop3, as clint mentioned, the dependencies come
> >>> pre-shaded
> >>>>> so
> >>>>>> we significantly reduce our effort in solving the dependency
> >>> problems.
> >>>>>> I have the PR in the last phases where I am able to run the entire
> >>> test
> >>>>>> suit unit + integration tests on both the default ie hadoop2 and
> >> the
> >>>> new
> >>>>>> hadoop3 profile.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2021/06/09 11:55:31, Will Lauer <wlauer@verizonmedia.com
> >> .INVALID>
> >>>>>> wrote:
> >>>>>>> Clint,
> >>>>>>>
> >>>>>>> I fully understand what type of headache dealing with these
> >>>> dependency
> >>>>>>> issues is. We deal with this all the time, and based on
> >>> conversations
> >>>>>> I've
> >>>>>>> had with our internal hadoop development team, they are quite
> >> aware
> >>>> of
> >>>>>> them
> >>>>>>> and just as frustrated by them as you are. I'm certainly in favor
> >>> of
> >>>>>> doing
> >>>>>>> something to improve this situation, as long as it doesn't
> >> abandon
> >>> a
> >>>>>> large
> >>>>>>> section of the user base, which I think DROPPING hadoop2 would
> >> do.
> >>>>>>>
> >>>>>>> I think there are solutions there that can help solve the
> >>> conflicting
> >>>>>>> dependency problem. Refactoring Hadoop support into an
> >> independent
> >>>>>>> extension is certainly a start. But I think the dependency
> >> problem
> >>> is
> >>>>>>> bigger than that. There are always going to be conflicts between
> >>>>>>> dependencies in the core system and in extensions as the system
> >>> gets
> >>>>>>> bigger. We have one right now internally that prevents us from
> >>>> enabling
> >>>>>> SQL
> >>>>>>> in our instance of Druid due to conflicts between versions of
> >>>> protobuf
> >>>>>> used
> >>>>>>> by Calcite vs one of our critical extensions. Long term, I think
> >>> you
> >>>>> are
> >>>>>>> going to need to carefully think through a ClassLoader based
> >>> strategy
> >>>>> to
> >>>>>>> truly separate the impact of various dependencies.
> >>>>>>>
> >>>>>>> While I'm not seriously suggesting it for Druid, OSGi WOULD solve
> >>>> this
> >>>>>>> problem. It's a system that allows you to explicitly declare what
> >>>> each
> >>>>>>> bundle exposes to the system, and what each bundle consumes from
> >>> the
> >>>>>>> system, allowing multiple conflicting dependencies to co-exist
> >>>> without
> >>>>>>> impacting each other. OSGi is the big hammer approach, but I bet
> >> a
> >>>> more
> >>>>>>> appropriate solution would be a simpler custom-ClassLoader based
> >>>>> solution
> >>>>>>> that hid all dependencies in extensions, keeping them from
> >>> impacting
> >>>>> the
> >>>>>>> core, and that only exposed "public" pieces of the core to
> >>>> extensions.
> >>>>> If
> >>>>>>> Druid's core could be extended without impacting the various
> >>>>> extensions,
> >>>>>>> and the extensions' dependencies could be modified without
> >>> impacting
> >>>>> the
> >>>>>>> core, this would go a long way towards solving the problem that
> >> you
> >>>>> have
> >>>>>>> described.
> >>>>>>>
> >>>>>>> Will
> >>>>>>>
> >>>>>>> <http://www.verizonmedia.com>
> >>>>>>>
> >>>>>>> Will Lauer
> >>>>>>>
> >>>>>>> Senior Principal Architect, Audience & Advertising Reporting
> >>>>>>> Data Platforms & Systems Engineering
> >>>>>>>
> >>>>>>> M 508 561 6427
> >>>>>>> 1908 S. First St
> >>>>>>> Champaign, IL 61822
> >>>>>>>
> >>>>>>> <http://www.facebook.com/verizonmedia>   <
> >>>>>> http://twitter.com/verizonmedia>
> >>>>>>> <https://www.linkedin.com/company/verizon-media/>
> >>>>>>> <http://www.instagram.com/verizonmedia>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> @itai, I think pending the outcome of this discussion that it
> >>> makes
> >>>>>> sense
> >>>>>>>> to have a wider community thread to announce any decisions we
> >>> make
> >>>>>> here,
> >>>>>>>> thanks for bringing that up.
> >>>>>>>>
> >>>>>>>> @rajiv, Minio support seems unrelated to this discussion. It
> >>> seems
> >>>>>> like a
> >>>>>>>> reasonable request, but I recommend starting another thread to
> >>> see
> >>>> if
> >>>>>>>> someone is interested in taking up this effort.
> >>>>>>>>
> >>>>>>>> @jihoon I definitely agree that Hadoop should be refactored to
> >> be
> >>>> an
> >>>>>>>> extension longer term. I don't think this upgrade would
> >>> necessarily
> >>>>>>>> make doing such a refactor any easier, but not harder either.
> >>> Just
> >>>>>> moving
> >>>>>>>> Hadoop to an extension also unfortunately doesn't really do
> >>>> anything
> >>>>> to
> >>>>>>>> help our dependency problem though, which is the thing that has
> >>>>>> agitated me
> >>>>>>>> enough to start this thread and start looking into solutions.
> >>>>>>>>
> >>>>>>>> @will/@frank I feel like the stranglehold Hadoop has on our
> >>>>>> dependencies
> >>>>>>>> has started to become especially more painful in the last
> >> couple
> >>> of
> >>>>>>>> years. Most painful to me is that we are stuck using a version
> >> of
> >>>>>> Apache
> >>>>>>>> Calcite from 2019 (six versions behind the latest), because
> >> newer
> >>>>>> versions
> >>>>>>>> require a newer version of Guava. This means we cannot get any
> >>> bug
> >>>>>> fixes
> >>>>>>>> and improvements in our SQL parsing layer without doing
> >> something
> >>>>> like
> >>>>>>>> packaging a shaded version of it ourselves or solving our
> >> Hadoop
> >>>>>> dependency
> >>>>>>>> problem.
> >>>>>>>>
> >>>>>>>> Many other dependencies have also proved problematic with
> >> Hadoop
> >>> as
> >>>>>> well in
> >>>>>>>> the past, and since we aren't able to run the Hadoop
> >> integration
> >>>>> tests
> >>>>>> in
> >>>>>>>> Travis, there is always the chance that sometimes we don't
> >> catch
> >>>>> these
> >>>>>> when
> >>>>>>>> they go in. I imagine now that we have turned on dependabot
> >> this
> >>>>> week,
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> >>>>>>>> , that we are going to have to
> >>>>>>>> proceed very carefully with it until we are able to resolve
> >> this
> >>>>>> dependency
> >>>>>>>> issue.
> >>>>>>>>
> >>>>>>>> Hadoop 3.3.0 is also the first to support running on a Java
> >>> version
> >>>>>> that is
> >>>>>>>> newer than Java 8 per
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> >>>>>>>> ,
> >>>>>>>> which is another area we have been working towards - Druid to
> >>>>>> officially
> >>>>>>>> support Java 11+ environments.
> >>>>>>>>
> >>>>>>>> I'm sort of at a loss of what else to do besides one of
> >>>>>>>> - switching to these Hadoop 3 shaded jars and dropping 2.x
> >>> support
> >>>>>>>> - figuring out how to custom package our own Hadoop 2.x
> >>>>> dependendencies
> >>>>>>>> that are shaded similarly to the Hadoop 3 client jars, and only
> >>>>>> supporting
> >>>>>>>> Hadoop with application classpath isolation
> >>>>> (mapreduce.job.classloader
> >>>>>> =
> >>>>>>>> true)
> >>>>>>>> - just dropping support for Hadoop completely
> >>>>>>>>
> >>>>>>>> I would much rather devote all effort into making Druids native
> >>>> batch
> >>>>>>>> ingestion better to encourage people to migrate to that, than
> >>>>>> continuing to
> >>>>>>>> fight with figuring out how to keep supporting Hadoop, so
> >>> upgrading
> >>>>> and
> >>>>>>>> switching to the shaded client jars at least seemed like a
> >>>> reasonable
> >>>>>>>> compromise to dropping it completely. Maybe making custom
> >> shaded
> >>>>> Hadoop
> >>>>>>>> dependencies in the spirit of the Hadoop 3 shaded jars isn't as
> >>>> hard
> >>>>>> as I
> >>>>>>>> am imagining, but it does seem like the most amount of work
> >>> between
> >>>>> the
> >>>>>>>> solutions I could think of to potentially resolve this problem.
> >>>>>>>>
> >>>>>>>> Does anyone have any other ideas of how we can isolate our
> >>>>> dependencies
> >>>>>>>> from Hadoop? Solutions like shading Guava,
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> >>>>>>>> , would let Druid itself use
> >>>>>>>> newer Guava, but that doesn't help conflicts within our
> >>>> dependencies
> >>>>>> which
> >>>>>>>> has always seemed to be the larger problem to me. Moving Hadoop
> >>>>>> support to
> >>>>>>>> an extension doesn't help anything unless we can ensure that we
> >>> can
> >>>>> run
> >>>>>>>> Druid ingestion tasks on Hadoop without having to match all of
> >>> the
> >>>>>> Hadoop
> >>>>>>>> clusters dependencies with some sort of classloader wizardry.
> >>>>>>>>
> >>>>>>>> Maybe we could consider keeping a 0.22.x release line in Druid
> >>> that
> >>>>>> gets
> >>>>>>>> security and minor bug fixes for some period of time to give
> >>>> people a
> >>>>>>>> longer period to migrate off of Hadoop 2.x? I can't speak for
> >> the
> >>>>> rest
> >>>>>> of
> >>>>>>>> the committers, but I would personally be more open to
> >>> maintaining
> >>>>>> such a
> >>>>>>>> branch if it meant that moving forward at least we could update
> >>> all
> >>>>> of
> >>>>>> our
> >>>>>>>> dependencies to newer versions, while providing a transition
> >> path
> >>>> to
> >>>>>> still
> >>>>>>>> have at least some support until migrating to Hadoop 3 or
> >> native
> >>>>> Druid
> >>>>>>>> batch ingestion.
> >>>>>>>>
> >>>>>>>> Any other ideas?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Jun 8, 2021 at 7:44 PM frank chen <
> >> frankchen@apache.org>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Considering Druid takes advantage of lots of external
> >>> components
> >>>> to
> >>>>>>>> work, I
> >>>>>>>>> think we should upgrade Druid in a little bit conservitive
> >> way.
> >>>>>> Dropping
> >>>>>>>>> support of hadoop2 is not a good idea.
> >>>>>>>>> The upgrading of the ZooKeeper client in Druid also prevents
> >> me
> >>>>> from
> >>>>>>>>> adopting 0.22 for a longer time.
> >>>>>>>>>
> >>>>>>>>> Although users could upgrade these dependencies first to use
> >>> the
> >>>>>> latest
> >>>>>>>>> Druid releases, frankly speaking, these upgrades are not so
> >>> easy
> >>>> in
> >>>>>>>>> production and usually take longer time, which would prevent
> >>>> users
> >>>>>> from
> >>>>>>>>> experiencing new features of Druid.
> >>>>>>>>> For hadoop3, I have heard of some performance issues, which
> >>> also
> >>>>>> makes me
> >>>>>>>>> have no confidence to upgrade.
> >>>>>>>>>
> >>>>>>>>> I think what Jihoon proposes is a good idea, separating
> >> hadoop2
> >>>>> from
> >>>>>>>> Druid
> >>>>>>>>> core as an extension.
> >>>>>>>>> Since hadoop2 has not been EOF, to achieve balance between
> >>>>>> compatibility
> >>>>>>>>> and long term evolution, maybe we could provide two
> >> extensions,
> >>>> one
> >>>>>> for
> >>>>>>>>> hadoop2, one for hadoop3.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三
> >>>>> 上午4:13写道：
> >>>>>>>>>
> >>>>>>>>>> Just to follow up on this, our main problem with hadoop3
> >>> right
> >>>>> now
> >>>>>> has
> >>>>>>>>> been
> >>>>>>>>>> instability in HDFS, to the extent that we put on hold any
> >>>> plans
> >>>>> to
> >>>>>>>>> deploy
> >>>>>>>>>> it to our production systems. I would claim Hadoop3 isn't
> >>>> mature
> >>>>>> enough
> >>>>>>>>> yet
> >>>>>>>>>> to consider migrating Druid to it.
> >>>>>>>>>>
> >>>>>>>>>> WIll
> >>>>>>>>>>
> >>>>>>>>>> <http://www.verizonmedia.com>
> >>>>>>>>>>
> >>>>>>>>>> Will Lauer
> >>>>>>>>>>
> >>>>>>>>>> Senior Principal Architect, Audience & Advertising
> >> Reporting
> >>>>>>>>>> Data Platforms & Systems Engineering
> >>>>>>>>>>
> >>>>>>>>>> M 508 561 6427
> >>>>>>>>>> 1908 S. First St
> >>>>>>>>>> Champaign, IL 61822
> >>>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> >>>>>>>>>  <
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> >>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> >>>>>>>>>
> >>>>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
> >>>>> wlauer@verizonmedia.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Unfortunately, the migration off of hadoop3 is a hard one
> >>>>> (maybe
> >>>>>> not
> >>>>>>>>> for
> >>>>>>>>>>> Druid, but certainly for big organizations running large
> >>>>> hadoop2
> >>>>>>>>>>> workloads). If druid migrated to hadoop3 after 0.22, that
> >>>> would
> >>>>>>>>> probably
> >>>>>>>>>>> prevent me from taking any new versions of Druid for at
> >>> least
> >>>>> the
> >>>>>>>>>> remainder
> >>>>>>>>>>> of the year and possibly longer.
> >>>>>>>>>>>
> >>>>>>>>>>> Will
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> <http://www.verizonmedia.com>
> >>>>>>>>>>>
> >>>>>>>>>>> Will Lauer
> >>>>>>>>>>>
> >>>>>>>>>>> Senior Principal Architect, Audience & Advertising
> >>> Reporting
> >>>>>>>>>>> Data Platforms & Systems Engineering
> >>>>>>>>>>>
> >>>>>>>>>>> M 508 561 6427
> >>>>>>>>>>> 1908 S. First St
> >>>>>>>>>>> Champaign, IL 61822
> >>>>>>>>>>>
> >>>>>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> >>>>>>>>>  <
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> >>>>>>>>>
> >>>>>>>>>>>   <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> >>>>>>>>>
> >>>>>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <
> >>>> cwylie@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I've been assisting with some experiments to see how we
> >>>> might
> >>>>>> want
> >>>>>>>> to
> >>>>>>>>>>>> migrate Druid to support Hadoop 3.x, and more
> >> importantly,
> >>>> see
> >>>>>> if
> >>>>>>>>> maybe
> >>>>>>>>>> we
> >>>>>>>>>>>> can finally be free of some of the dependency issues it
> >>> has
> >>>>> been
> >>>>>>>>> causing
> >>>>>>>>>>>> for as long as I can remember working with Druid.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hadoop 3 introduced shaded client jars,
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> >>>>>>>>>>>> , with the purpose to
> >>>>>>>>>>>> allow applications to talk to the Hadoop cluster without
> >>>>>> drowning in
> >>>>>>>>> its
> >>>>>>>>>>>> transitive dependencies. The experimental branch that I
> >>> have
> >>>>>> been
> >>>>>>>>>> helping
> >>>>>>>>>>>> with, which is using these new shaded client jars, can
> >> be
> >>>> seen
> >>>>>> in
> >>>>>>>> this
> >>>>>>>>>> PR
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> >>>>>>>>>>>> , and is currently working with
> >>>>>>>>>>>> the HDFS integration tests as well as the Hadoop
> >> tutorial
> >>>> flow
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>> Druid
> >>>>>>>>>>>> docs (which is pretty much equivalent to the HDFS
> >>>> integration
> >>>>>> test).
> >>>>>>>>>>>>
> >>>>>>>>>>>> The cloud deep storages still need some further testing
> >>> and
> >>>>> some
> >>>>>>>> minor
> >>>>>>>>>>>> cleanup still needs done for the docs and such.
> >>> Additionally
> >>>>> we
> >>>>>>>> still
> >>>>>>>>>> need
> >>>>>>>>>>>> to figure out how to handle the Kerberos extension,
> >>> because
> >>>> it
> >>>>>>>> extends
> >>>>>>>>>>>> some
> >>>>>>>>>>>> Hadoop classes so isn't able to use the shaded client
> >> jars
> >>>> in
> >>>>> a
> >>>>>>>>>>>> straight-forward manner, and so still has heavy
> >>> dependencies
> >>>>> and
> >>>>>>>>> hasn't
> >>>>>>>>>>>> been tested. However, the experiment has started to pan
> >>> out
> >>>>>> enough
> >>>>>>>> to
> >>>>>>>>>>>> where
> >>>>>>>>>>>> I think it is worth starting this discussion, because it
> >>>> does
> >>>>>> have
> >>>>>>>>> some
> >>>>>>>>>>>> implications.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Making this change I think will allow us to update our
> >>>>>> dependencies
> >>>>>>>>>> with a
> >>>>>>>>>>>> lot more freedom (I'm looking at you, Guava), but the
> >>> catch
> >>>> is
> >>>>>> that
> >>>>>>>>> once
> >>>>>>>>>>>> we
> >>>>>>>>>>>> make this change and start updating these dependencies,
> >> it
> >>>>> will
> >>>>>>>> become
> >>>>>>>>>>>> hard, nearing impossible to support Hadoop 2.x, since as
> >>> far
> >>>>> as
> >>>>>> I
> >>>>>>>> know
> >>>>>>>>>>>> there isn't an equivalent set of shaded client jars. I
> >> am
> >>>> also
> >>>>>> not
> >>>>>>>>>> certain
> >>>>>>>>>>>> how far back the Hadoop job classpath isolation stuff
> >> goes
> >>>>>>>>>>>> (mapreduce.job.classloader = true) which I think is
> >>> required
> >>>>> to
> >>>>>> be
> >>>>>>>> set
> >>>>>>>>>> on
> >>>>>>>>>>>> Druid tasks for this shaded stuff to work alongside
> >>> updated
> >>>>>> Druid
> >>>>>>>>>>>> dependencies.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is anyone opposed to or worried about dropping Hadoop
> >> 2.x
> >>>>>> support
> >>>>>>>>> after
> >>>>>>>>>>>> the
> >>>>>>>>>>>> Druid 0.22 release?
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> >>>>>> For additional commands, e-mail: dev-help@druid.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> For additional commands, e-mail: dev-help@druid.apache.org
>
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Posted by Julian Jaffe <ju...@gmail.com>.

For Spark support, the connector I wrote remains functional but I haven’t updated the PR for six months or so since it didn’t seem like there was an appetite for review. If that’s changing I could migrate back some more recent changes to the OSS PR. Even with an up-to-date patch though I see two problems:

First, I remain worried that there isn’t sufficient support among committers for the Spark connector. I don’t want Druid to end up in the same place it is now for Hadoop 2 support where no one really maintains the Spark code and we wind up with another awkward corner of the code base that holds back other development.

Secondly, the PR I have up is for Spark 2.4, which is now 2 years further out of date than it was back in 2020. Similarly to Hadoop there is a bifurcation in the community and Spark 2.4 is still in heavy use but we might be trading one problem for another if we deprecate Hadoop 2 in favor of Spark 2.4. I have written a Spark 3.2 connector as well but it’s been deployed to significantly smaller use cases than the 2.4 line.

Even with these two caveats, if there’s a desire among the Druid development community to add Spark functionality and support it I’d love to push this across the finish line.

> On Aug 9, 2022, at 1:04 AM, Abhishek Agarwal <ab...@imply.io> wrote:
> 
> Yes. We should deprecate it first which is similar to dropping the support
> (no more active development) but we will still ship it for a release or
> two. In a way, we are already in that mode to a certain extent. Many
> features are being built with native ingestion as a first-class citizen.
> E.g. range partitioning is still not supported on Hadoop ingestion. It's
> hard for developers to build and test their business logic for all the
> ingestion modes.
> 
> It will be good to hear what gaps do community sees between native
> ingestion vs Hadoop-based batch ingestion. And then work toward fixing
> those gaps before dropping the Hadoop ingestion entirely. For example, if
> users want the resource elasticity that a Hadoop cluster gives, we could
> push forward PRs such as https://github.com/apache/druid/pull/10910. It's
> not the same as a Hadoop cluster but nonetheless will let user reuse their
> existing infrastructure to run druid jobs.
> 
>> On Tue, Aug 9, 2022 at 9:43 AM Gian Merlino <gi...@apache.org> wrote:
>> 
>> It's always good to deprecate things for some time prior to removing them,
>> so we don't need to (nor should we) remove Hadoop 2 support right now. My
>> vote is that in this upcoming release, we should deprecate it. The main
>> problem in my eyes is the one Abhishek brought up: the dependency
>> management situation with Hadoop 2 is really messy, and I'm not sure
>> there's a good way to handle them given the limited classloader isolation.
>> This situation becomes tougher to manage with each release, and we haven't
>> had people volunteering to find and build comprehensive solutions. It is
>> time to move on.
>> 
>> The concern Samarth raised, that people may end up stuck on older Druid
>> versions because they aren't able to upgrade to Hadoop 3, is valid. I can
>> see two good solutions to this. First: we can improve native ingest to the
>> point where people feel broadly comfortable moving Hadoop 2 workloads to
>> native. The work planned as part of doing ingest via multi-stage
>> distributed query <https://github.com/apache/druid/issues/12262> is going
>> to be useful here, by improving the speed and scalability of native ingest.
>> Second: it would also be great to have something similar that runs on
>> Spark, for people that have made investments in Spark. I suspect that most
>> people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so supporting
>> both of those would ease a lot of the potential pain of dropping Hadoop 2
>> support.
>> 
>> On Spark: I'm not familiar with the current state of the Spark work. Is it
>> stuck? If so could something be done to unstick it? I agree with Abhishek
>> that I wouldn't want to block moving off Hadoop 2 on this. However, it'd be
>> great if we could get it done before actually removing Hadoop 2 support
>> from the code base.
>> 
>> 
>> On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal <abhishek.agarwal@imply.io
>>> 
>> wrote:
>> 
>>> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
>>> low-resistance path than moving from Hadoop to Spark. even if we get that
>>> PR merged, it will take good time for spark integration to reach the same
>>> level of maturity as Hadoop or Native ingestion. BTW I am not making an
>>> argument against spark integration. it will certainly be nice to have
>> Spark
>>> as an option. Just that spark integration doesn't become a blocker for us
>>> to get off Hadoop.
>>> 
>>> btw are you using Hadoop 2 right now with the latest druid version? If
>> so,
>>> did you run into similar errors that I posted in my last email?
>>> 
>>> On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain <sa...@gmail.com>
>>> wrote:
>>> 
>>>> I am sure there are other companies out there who are still on Hadoop
>> 2.x
>>>> with migration to Hadoop 3.x being a no-go.
>>>> If Druid was to drop support for Hadoop 3.x completely, I am afraid it
>>>> would prevent users from updating to newer versions of Druid which
>> would
>>> be
>>>> a shame.
>>>> 
>>>> FWIW, we have found in practice for high volume use cases that
>> compaction
>>>> based on Druid's Hadoop based batch ingestion is a lot more scale-able
>>> than
>>>> the native compaction.
>>>> 
>>>> Having said that, as an alternative, if we can merge Julian's Spark
>> based
>>>> ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid,
>>> that
>>>> might provide an alternate way for users to get rid of the Hadoop
>>>> dependency.
>>>> 
>>>> On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <
>>>> abhishek.agarwal@imply.io>
>>>> wrote:
>>>> 
>>>>> Reviving this conversation again.
>>>>> @Will - Do you still have concerns about HDFS stability? Hadoop 3 has
>>>> been
>>>>> around for some time now and is very stable as far as I know.
>>>>> 
>>>>> The dependencies coming from Hadoop 2 are also old enough that they
>>> cause
>>>>> dependency scans to fail. E.g. Log4j 1.x dependencies that are coming
>>>> from
>>>>> Hadoop 2, get flagged during these scans. We have also seen issues
>> when
>>>>> customers try to use Hadoop ingestion with the latest log4j2 library.
>>>>> 
>>>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
>>>>> at
>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
>>>>> at
>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
>>>>> at
>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
>>>>> 
>>>>> 
>>>>> Instead of fixing these point issues, we would be better served by
>>>>> completely moving to Hadoop 3 entirely. Hadoop 3 does get more
>> frequent
>>>>> releases and dependencies are well isolated.
>>>>> 
>>>>> On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <
>> karankumar1100@gmail.com
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hello
>>>>>> We can also use maven profiles. We keep hadoop2 support by default
>>> and
>>>>> add
>>>>>> a new maven profile with hadoop3. This will allow the user to
>> choose
>>>> the
>>>>>> profile which is best suited for the use case.
>>>>>> Agreed, it will not help in the Hadoop dependency problems but does
>>>>> enable
>>>>>> our users to use druid with multiple flavors.
>>>>>> Also with hadoop3, as clint mentioned, the dependencies come
>>> pre-shaded
>>>>> so
>>>>>> we significantly reduce our effort in solving the dependency
>>> problems.
>>>>>> I have the PR in the last phases where I am able to run the entire
>>> test
>>>>>> suit unit + integration tests on both the default ie hadoop2 and
>> the
>>>> new
>>>>>> hadoop3 profile.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 2021/06/09 11:55:31, Will Lauer <wlauer@verizonmedia.com
>> .INVALID>
>>>>>> wrote:
>>>>>>> Clint,
>>>>>>> 
>>>>>>> I fully understand what type of headache dealing with these
>>>> dependency
>>>>>>> issues is. We deal with this all the time, and based on
>>> conversations
>>>>>> I've
>>>>>>> had with our internal hadoop development team, they are quite
>> aware
>>>> of
>>>>>> them
>>>>>>> and just as frustrated by them as you are. I'm certainly in favor
>>> of
>>>>>> doing
>>>>>>> something to improve this situation, as long as it doesn't
>> abandon
>>> a
>>>>>> large
>>>>>>> section of the user base, which I think DROPPING hadoop2 would
>> do.
>>>>>>> 
>>>>>>> I think there are solutions there that can help solve the
>>> conflicting
>>>>>>> dependency problem. Refactoring Hadoop support into an
>> independent
>>>>>>> extension is certainly a start. But I think the dependency
>> problem
>>> is
>>>>>>> bigger than that. There are always going to be conflicts between
>>>>>>> dependencies in the core system and in extensions as the system
>>> gets
>>>>>>> bigger. We have one right now internally that prevents us from
>>>> enabling
>>>>>> SQL
>>>>>>> in our instance of Druid due to conflicts between versions of
>>>> protobuf
>>>>>> used
>>>>>>> by Calcite vs one of our critical extensions. Long term, I think
>>> you
>>>>> are
>>>>>>> going to need to carefully think through a ClassLoader based
>>> strategy
>>>>> to
>>>>>>> truly separate the impact of various dependencies.
>>>>>>> 
>>>>>>> While I'm not seriously suggesting it for Druid, OSGi WOULD solve
>>>> this
>>>>>>> problem. It's a system that allows you to explicitly declare what
>>>> each
>>>>>>> bundle exposes to the system, and what each bundle consumes from
>>> the
>>>>>>> system, allowing multiple conflicting dependencies to co-exist
>>>> without
>>>>>>> impacting each other. OSGi is the big hammer approach, but I bet
>> a
>>>> more
>>>>>>> appropriate solution would be a simpler custom-ClassLoader based
>>>>> solution
>>>>>>> that hid all dependencies in extensions, keeping them from
>>> impacting
>>>>> the
>>>>>>> core, and that only exposed "public" pieces of the core to
>>>> extensions.
>>>>> If
>>>>>>> Druid's core could be extended without impacting the various
>>>>> extensions,
>>>>>>> and the extensions' dependencies could be modified without
>>> impacting
>>>>> the
>>>>>>> core, this would go a long way towards solving the problem that
>> you
>>>>> have
>>>>>>> described.
>>>>>>> 
>>>>>>> Will
>>>>>>> 
>>>>>>> <http://www.verizonmedia.com>
>>>>>>> 
>>>>>>> Will Lauer
>>>>>>> 
>>>>>>> Senior Principal Architect, Audience & Advertising Reporting
>>>>>>> Data Platforms & Systems Engineering
>>>>>>> 
>>>>>>> M 508 561 6427
>>>>>>> 1908 S. First St
>>>>>>> Champaign, IL 61822
>>>>>>> 
>>>>>>> <http://www.facebook.com/verizonmedia>   <
>>>>>> http://twitter.com/verizonmedia>
>>>>>>> <https://www.linkedin.com/company/verizon-media/>
>>>>>>> <http://www.instagram.com/verizonmedia>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org>
>>>> wrote:
>>>>>>> 
>>>>>>>> @itai, I think pending the outcome of this discussion that it
>>> makes
>>>>>> sense
>>>>>>>> to have a wider community thread to announce any decisions we
>>> make
>>>>>> here,
>>>>>>>> thanks for bringing that up.
>>>>>>>> 
>>>>>>>> @rajiv, Minio support seems unrelated to this discussion. It
>>> seems
>>>>>> like a
>>>>>>>> reasonable request, but I recommend starting another thread to
>>> see
>>>> if
>>>>>>>> someone is interested in taking up this effort.
>>>>>>>> 
>>>>>>>> @jihoon I definitely agree that Hadoop should be refactored to
>> be
>>>> an
>>>>>>>> extension longer term. I don't think this upgrade would
>>> necessarily
>>>>>>>> make doing such a refactor any easier, but not harder either.
>>> Just
>>>>>> moving
>>>>>>>> Hadoop to an extension also unfortunately doesn't really do
>>>> anything
>>>>> to
>>>>>>>> help our dependency problem though, which is the thing that has
>>>>>> agitated me
>>>>>>>> enough to start this thread and start looking into solutions.
>>>>>>>> 
>>>>>>>> @will/@frank I feel like the stranglehold Hadoop has on our
>>>>>> dependencies
>>>>>>>> has started to become especially more painful in the last
>> couple
>>> of
>>>>>>>> years. Most painful to me is that we are stuck using a version
>> of
>>>>>> Apache
>>>>>>>> Calcite from 2019 (six versions behind the latest), because
>> newer
>>>>>> versions
>>>>>>>> require a newer version of Guava. This means we cannot get any
>>> bug
>>>>>> fixes
>>>>>>>> and improvements in our SQL parsing layer without doing
>> something
>>>>> like
>>>>>>>> packaging a shaded version of it ourselves or solving our
>> Hadoop
>>>>>> dependency
>>>>>>>> problem.
>>>>>>>> 
>>>>>>>> Many other dependencies have also proved problematic with
>> Hadoop
>>> as
>>>>>> well in
>>>>>>>> the past, and since we aren't able to run the Hadoop
>> integration
>>>>> tests
>>>>>> in
>>>>>>>> Travis, there is always the chance that sometimes we don't
>> catch
>>>>> these
>>>>>> when
>>>>>>>> they go in. I imagine now that we have turned on dependabot
>> this
>>>>> week,
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
>>>>>>>> , that we are going to have to
>>>>>>>> proceed very carefully with it until we are able to resolve
>> this
>>>>>> dependency
>>>>>>>> issue.
>>>>>>>> 
>>>>>>>> Hadoop 3.3.0 is also the first to support running on a Java
>>> version
>>>>>> that is
>>>>>>>> newer than Java 8 per
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
>>>>>>>> ,
>>>>>>>> which is another area we have been working towards - Druid to
>>>>>> officially
>>>>>>>> support Java 11+ environments.
>>>>>>>> 
>>>>>>>> I'm sort of at a loss of what else to do besides one of
>>>>>>>> - switching to these Hadoop 3 shaded jars and dropping 2.x
>>> support
>>>>>>>> - figuring out how to custom package our own Hadoop 2.x
>>>>> dependendencies
>>>>>>>> that are shaded similarly to the Hadoop 3 client jars, and only
>>>>>> supporting
>>>>>>>> Hadoop with application classpath isolation
>>>>> (mapreduce.job.classloader
>>>>>> =
>>>>>>>> true)
>>>>>>>> - just dropping support for Hadoop completely
>>>>>>>> 
>>>>>>>> I would much rather devote all effort into making Druids native
>>>> batch
>>>>>>>> ingestion better to encourage people to migrate to that, than
>>>>>> continuing to
>>>>>>>> fight with figuring out how to keep supporting Hadoop, so
>>> upgrading
>>>>> and
>>>>>>>> switching to the shaded client jars at least seemed like a
>>>> reasonable
>>>>>>>> compromise to dropping it completely. Maybe making custom
>> shaded
>>>>> Hadoop
>>>>>>>> dependencies in the spirit of the Hadoop 3 shaded jars isn't as
>>>> hard
>>>>>> as I
>>>>>>>> am imagining, but it does seem like the most amount of work
>>> between
>>>>> the
>>>>>>>> solutions I could think of to potentially resolve this problem.
>>>>>>>> 
>>>>>>>> Does anyone have any other ideas of how we can isolate our
>>>>> dependencies
>>>>>>>> from Hadoop? Solutions like shading Guava,
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
>>>>>>>> , would let Druid itself use
>>>>>>>> newer Guava, but that doesn't help conflicts within our
>>>> dependencies
>>>>>> which
>>>>>>>> has always seemed to be the larger problem to me. Moving Hadoop
>>>>>> support to
>>>>>>>> an extension doesn't help anything unless we can ensure that we
>>> can
>>>>> run
>>>>>>>> Druid ingestion tasks on Hadoop without having to match all of
>>> the
>>>>>> Hadoop
>>>>>>>> clusters dependencies with some sort of classloader wizardry.
>>>>>>>> 
>>>>>>>> Maybe we could consider keeping a 0.22.x release line in Druid
>>> that
>>>>>> gets
>>>>>>>> security and minor bug fixes for some period of time to give
>>>> people a
>>>>>>>> longer period to migrate off of Hadoop 2.x? I can't speak for
>> the
>>>>> rest
>>>>>> of
>>>>>>>> the committers, but I would personally be more open to
>>> maintaining
>>>>>> such a
>>>>>>>> branch if it meant that moving forward at least we could update
>>> all
>>>>> of
>>>>>> our
>>>>>>>> dependencies to newer versions, while providing a transition
>> path
>>>> to
>>>>>> still
>>>>>>>> have at least some support until migrating to Hadoop 3 or
>> native
>>>>> Druid
>>>>>>>> batch ingestion.
>>>>>>>> 
>>>>>>>> Any other ideas?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Jun 8, 2021 at 7:44 PM frank chen <
>> frankchen@apache.org>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Considering Druid takes advantage of lots of external
>>> components
>>>> to
>>>>>>>> work, I
>>>>>>>>> think we should upgrade Druid in a little bit conservitive
>> way.
>>>>>> Dropping
>>>>>>>>> support of hadoop2 is not a good idea.
>>>>>>>>> The upgrading of the ZooKeeper client in Druid also prevents
>> me
>>>>> from
>>>>>>>>> adopting 0.22 for a longer time.
>>>>>>>>> 
>>>>>>>>> Although users could upgrade these dependencies first to use
>>> the
>>>>>> latest
>>>>>>>>> Druid releases, frankly speaking, these upgrades are not so
>>> easy
>>>> in
>>>>>>>>> production and usually take longer time, which would prevent
>>>> users
>>>>>> from
>>>>>>>>> experiencing new features of Druid.
>>>>>>>>> For hadoop3, I have heard of some performance issues, which
>>> also
>>>>>> makes me
>>>>>>>>> have no confidence to upgrade.
>>>>>>>>> 
>>>>>>>>> I think what Jihoon proposes is a good idea, separating
>> hadoop2
>>>>> from
>>>>>>>> Druid
>>>>>>>>> core as an extension.
>>>>>>>>> Since hadoop2 has not been EOF, to achieve balance between
>>>>>> compatibility
>>>>>>>>> and long term evolution, maybe we could provide two
>> extensions,
>>>> one
>>>>>> for
>>>>>>>>> hadoop2, one for hadoop3.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三
>>>>> 上午4:13写道：
>>>>>>>>> 
>>>>>>>>>> Just to follow up on this, our main problem with hadoop3
>>> right
>>>>> now
>>>>>> has
>>>>>>>>> been
>>>>>>>>>> instability in HDFS, to the extent that we put on hold any
>>>> plans
>>>>> to
>>>>>>>>> deploy
>>>>>>>>>> it to our production systems. I would claim Hadoop3 isn't
>>>> mature
>>>>>> enough
>>>>>>>>> yet
>>>>>>>>>> to consider migrating Druid to it.
>>>>>>>>>> 
>>>>>>>>>> WIll
>>>>>>>>>> 
>>>>>>>>>> <http://www.verizonmedia.com>
>>>>>>>>>> 
>>>>>>>>>> Will Lauer
>>>>>>>>>> 
>>>>>>>>>> Senior Principal Architect, Audience & Advertising
>> Reporting
>>>>>>>>>> Data Platforms & Systems Engineering
>>>>>>>>>> 
>>>>>>>>>> M 508 561 6427
>>>>>>>>>> 1908 S. First St
>>>>>>>>>> Champaign, IL 61822
>>>>>>>>>> 
>>>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
>>>>>>>>>  <
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
>>>>>>>>> 
>>>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
>>>>>>>>> 
>>>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
>>>>> wlauer@verizonmedia.com
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Unfortunately, the migration off of hadoop3 is a hard one
>>>>> (maybe
>>>>>> not
>>>>>>>>> for
>>>>>>>>>>> Druid, but certainly for big organizations running large
>>>>> hadoop2
>>>>>>>>>>> workloads). If druid migrated to hadoop3 after 0.22, that
>>>> would
>>>>>>>>> probably
>>>>>>>>>>> prevent me from taking any new versions of Druid for at
>>> least
>>>>> the
>>>>>>>>>> remainder
>>>>>>>>>>> of the year and possibly longer.
>>>>>>>>>>> 
>>>>>>>>>>> Will
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> <http://www.verizonmedia.com>
>>>>>>>>>>> 
>>>>>>>>>>> Will Lauer
>>>>>>>>>>> 
>>>>>>>>>>> Senior Principal Architect, Audience & Advertising
>>> Reporting
>>>>>>>>>>> Data Platforms & Systems Engineering
>>>>>>>>>>> 
>>>>>>>>>>> M 508 561 6427
>>>>>>>>>>> 1908 S. First St
>>>>>>>>>>> Champaign, IL 61822
>>>>>>>>>>> 
>>>>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
>>>>>>>>>  <
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
>>>>>>>>> 
>>>>>>>>>>>   <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
>>>>>>>>> 
>>>>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <
>>>> cwylie@apache.org>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> 
>>>>>>>>>>>> I've been assisting with some experiments to see how we
>>>> might
>>>>>> want
>>>>>>>> to
>>>>>>>>>>>> migrate Druid to support Hadoop 3.x, and more
>> importantly,
>>>> see
>>>>>> if
>>>>>>>>> maybe
>>>>>>>>>> we
>>>>>>>>>>>> can finally be free of some of the dependency issues it
>>> has
>>>>> been
>>>>>>>>> causing
>>>>>>>>>>>> for as long as I can remember working with Druid.
>>>>>>>>>>>> 
>>>>>>>>>>>> Hadoop 3 introduced shaded client jars,
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
>>>>>>>>>>>> , with the purpose to
>>>>>>>>>>>> allow applications to talk to the Hadoop cluster without
>>>>>> drowning in
>>>>>>>>> its
>>>>>>>>>>>> transitive dependencies. The experimental branch that I
>>> have
>>>>>> been
>>>>>>>>>> helping
>>>>>>>>>>>> with, which is using these new shaded client jars, can
>> be
>>>> seen
>>>>>> in
>>>>>>>> this
>>>>>>>>>> PR
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
>>>>>>>>>>>> , and is currently working with
>>>>>>>>>>>> the HDFS integration tests as well as the Hadoop
>> tutorial
>>>> flow
>>>>>> in
>>>>>>>> the
>>>>>>>>>>>> Druid
>>>>>>>>>>>> docs (which is pretty much equivalent to the HDFS
>>>> integration
>>>>>> test).
>>>>>>>>>>>> 
>>>>>>>>>>>> The cloud deep storages still need some further testing
>>> and
>>>>> some
>>>>>>>> minor
>>>>>>>>>>>> cleanup still needs done for the docs and such.
>>> Additionally
>>>>> we
>>>>>>>> still
>>>>>>>>>> need
>>>>>>>>>>>> to figure out how to handle the Kerberos extension,
>>> because
>>>> it
>>>>>>>> extends
>>>>>>>>>>>> some
>>>>>>>>>>>> Hadoop classes so isn't able to use the shaded client
>> jars
>>>> in
>>>>> a
>>>>>>>>>>>> straight-forward manner, and so still has heavy
>>> dependencies
>>>>> and
>>>>>>>>> hasn't
>>>>>>>>>>>> been tested. However, the experiment has started to pan
>>> out
>>>>>> enough
>>>>>>>> to
>>>>>>>>>>>> where
>>>>>>>>>>>> I think it is worth starting this discussion, because it
>>>> does
>>>>>> have
>>>>>>>>> some
>>>>>>>>>>>> implications.
>>>>>>>>>>>> 
>>>>>>>>>>>> Making this change I think will allow us to update our
>>>>>> dependencies
>>>>>>>>>> with a
>>>>>>>>>>>> lot more freedom (I'm looking at you, Guava), but the
>>> catch
>>>> is
>>>>>> that
>>>>>>>>> once
>>>>>>>>>>>> we
>>>>>>>>>>>> make this change and start updating these dependencies,
>> it
>>>>> will
>>>>>>>> become
>>>>>>>>>>>> hard, nearing impossible to support Hadoop 2.x, since as
>>> far
>>>>> as
>>>>>> I
>>>>>>>> know
>>>>>>>>>>>> there isn't an equivalent set of shaded client jars. I
>> am
>>>> also
>>>>>> not
>>>>>>>>>> certain
>>>>>>>>>>>> how far back the Hadoop job classpath isolation stuff
>> goes
>>>>>>>>>>>> (mapreduce.job.classloader = true) which I think is
>>> required
>>>>> to
>>>>>> be
>>>>>>>> set
>>>>>>>>>> on
>>>>>>>>>>>> Druid tasks for this shaded stuff to work alongside
>>> updated
>>>>>> Druid
>>>>>>>>>>>> dependencies.
>>>>>>>>>>>> 
>>>>>>>>>>>> Is anyone opposed to or worried about dropping Hadoop
>> 2.x
>>>>>> support
>>>>>>>>> after
>>>>>>>>>>>> the
>>>>>>>>>>>> Druid 0.22 release?
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
>>>>>> For additional commands, e-mail: dev-help@druid.apache.org
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
For additional commands, e-mail: dev-help@druid.apache.org

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Posted by Abhishek Agarwal <ab...@imply.io>.

Yes. We should deprecate it first which is similar to dropping the support
(no more active development) but we will still ship it for a release or
two. In a way, we are already in that mode to a certain extent. Many
features are being built with native ingestion as a first-class citizen.
E.g. range partitioning is still not supported on Hadoop ingestion. It's
hard for developers to build and test their business logic for all the
ingestion modes.

It will be good to hear what gaps do community sees between native
ingestion vs Hadoop-based batch ingestion. And then work toward fixing
those gaps before dropping the Hadoop ingestion entirely. For example, if
users want the resource elasticity that a Hadoop cluster gives, we could
push forward PRs such as https://github.com/apache/druid/pull/10910. It's
not the same as a Hadoop cluster but nonetheless will let user reuse their
existing infrastructure to run druid jobs.

On Tue, Aug 9, 2022 at 9:43 AM Gian Merlino <gi...@apache.org> wrote:

> It's always good to deprecate things for some time prior to removing them,
> so we don't need to (nor should we) remove Hadoop 2 support right now. My
> vote is that in this upcoming release, we should deprecate it. The main
> problem in my eyes is the one Abhishek brought up: the dependency
> management situation with Hadoop 2 is really messy, and I'm not sure
> there's a good way to handle them given the limited classloader isolation.
> This situation becomes tougher to manage with each release, and we haven't
> had people volunteering to find and build comprehensive solutions. It is
> time to move on.
>
> The concern Samarth raised, that people may end up stuck on older Druid
> versions because they aren't able to upgrade to Hadoop 3, is valid. I can
> see two good solutions to this. First: we can improve native ingest to the
> point where people feel broadly comfortable moving Hadoop 2 workloads to
> native. The work planned as part of doing ingest via multi-stage
> distributed query <https://github.com/apache/druid/issues/12262> is going
> to be useful here, by improving the speed and scalability of native ingest.
> Second: it would also be great to have something similar that runs on
> Spark, for people that have made investments in Spark. I suspect that most
> people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so supporting
> both of those would ease a lot of the potential pain of dropping Hadoop 2
> support.
>
> On Spark: I'm not familiar with the current state of the Spark work. Is it
> stuck? If so could something be done to unstick it? I agree with Abhishek
> that I wouldn't want to block moving off Hadoop 2 on this. However, it'd be
> great if we could get it done before actually removing Hadoop 2 support
> from the code base.
>
>
> On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal <abhishek.agarwal@imply.io
> >
> wrote:
>
> > I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
> > low-resistance path than moving from Hadoop to Spark. even if we get that
> > PR merged, it will take good time for spark integration to reach the same
> > level of maturity as Hadoop or Native ingestion. BTW I am not making an
> > argument against spark integration. it will certainly be nice to have
> Spark
> > as an option. Just that spark integration doesn't become a blocker for us
> > to get off Hadoop.
> >
> > btw are you using Hadoop 2 right now with the latest druid version? If
> so,
> > did you run into similar errors that I posted in my last email?
> >
> > On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain <sa...@gmail.com>
> > wrote:
> >
> > > I am sure there are other companies out there who are still on Hadoop
> 2.x
> > > with migration to Hadoop 3.x being a no-go.
> > > If Druid was to drop support for Hadoop 3.x completely, I am afraid it
> > > would prevent users from updating to newer versions of Druid which
> would
> > be
> > > a shame.
> > >
> > > FWIW, we have found in practice for high volume use cases that
> compaction
> > > based on Druid's Hadoop based batch ingestion is a lot more scale-able
> > than
> > > the native compaction.
> > >
> > > Having said that, as an alternative, if we can merge Julian's Spark
> based
> > > ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid,
> > that
> > > might provide an alternate way for users to get rid of the Hadoop
> > > dependency.
> > >
> > > On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <
> > > abhishek.agarwal@imply.io>
> > > wrote:
> > >
> > > > Reviving this conversation again.
> > > > @Will - Do you still have concerns about HDFS stability? Hadoop 3 has
> > > been
> > > > around for some time now and is very stable as far as I know.
> > > >
> > > > The dependencies coming from Hadoop 2 are also old enough that they
> > cause
> > > > dependency scans to fail. E.g. Log4j 1.x dependencies that are coming
> > > from
> > > > Hadoop 2, get flagged during these scans. We have also seen issues
> when
> > > > customers try to use Hadoop ingestion with the latest log4j2 library.
> > > >
> > > > Exception in thread "main" java.lang.NoSuchMethodError:
> > > >
> > > >
> > >
> >
> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
> > > >
> > > >
> > > > Instead of fixing these point issues, we would be better served by
> > > > completely moving to Hadoop 3 entirely. Hadoop 3 does get more
> frequent
> > > > releases and dependencies are well isolated.
> > > >
> > > > On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <
> karankumar1100@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello
> > > > > We can also use maven profiles. We keep hadoop2 support by default
> > and
> > > > add
> > > > > a new maven profile with hadoop3. This will allow the user to
> choose
> > > the
> > > > > profile which is best suited for the use case.
> > > > > Agreed, it will not help in the Hadoop dependency problems but does
> > > > enable
> > > > > our users to use druid with multiple flavors.
> > > > > Also with hadoop3, as clint mentioned, the dependencies come
> > pre-shaded
> > > > so
> > > > > we significantly reduce our effort in solving the dependency
> > problems.
> > > > > I have the PR in the last phases where I am able to run the entire
> > test
> > > > > suit unit + integration tests on both the default ie hadoop2 and
> the
> > > new
> > > > > hadoop3 profile.
> > > > >
> > > > >
> > > > >
> > > > > On 2021/06/09 11:55:31, Will Lauer <wlauer@verizonmedia.com
> .INVALID>
> > > > > wrote:
> > > > > > Clint,
> > > > > >
> > > > > > I fully understand what type of headache dealing with these
> > > dependency
> > > > > > issues is. We deal with this all the time, and based on
> > conversations
> > > > > I've
> > > > > > had with our internal hadoop development team, they are quite
> aware
> > > of
> > > > > them
> > > > > > and just as frustrated by them as you are. I'm certainly in favor
> > of
> > > > > doing
> > > > > > something to improve this situation, as long as it doesn't
> abandon
> > a
> > > > > large
> > > > > > section of the user base, which I think DROPPING hadoop2 would
> do.
> > > > > >
> > > > > > I think there are solutions there that can help solve the
> > conflicting
> > > > > > dependency problem. Refactoring Hadoop support into an
> independent
> > > > > > extension is certainly a start. But I think the dependency
> problem
> > is
> > > > > > bigger than that. There are always going to be conflicts between
> > > > > > dependencies in the core system and in extensions as the system
> > gets
> > > > > > bigger. We have one right now internally that prevents us from
> > > enabling
> > > > > SQL
> > > > > > in our instance of Druid due to conflicts between versions of
> > > protobuf
> > > > > used
> > > > > > by Calcite vs one of our critical extensions. Long term, I think
> > you
> > > > are
> > > > > > going to need to carefully think through a ClassLoader based
> > strategy
> > > > to
> > > > > > truly separate the impact of various dependencies.
> > > > > >
> > > > > > While I'm not seriously suggesting it for Druid, OSGi WOULD solve
> > > this
> > > > > > problem. It's a system that allows you to explicitly declare what
> > > each
> > > > > > bundle exposes to the system, and what each bundle consumes from
> > the
> > > > > > system, allowing multiple conflicting dependencies to co-exist
> > > without
> > > > > > impacting each other. OSGi is the big hammer approach, but I bet
> a
> > > more
> > > > > > appropriate solution would be a simpler custom-ClassLoader based
> > > > solution
> > > > > > that hid all dependencies in extensions, keeping them from
> > impacting
> > > > the
> > > > > > core, and that only exposed "public" pieces of the core to
> > > extensions.
> > > > If
> > > > > > Druid's core could be extended without impacting the various
> > > > extensions,
> > > > > > and the extensions' dependencies could be modified without
> > impacting
> > > > the
> > > > > > core, this would go a long way towards solving the problem that
> you
> > > > have
> > > > > > described.
> > > > > >
> > > > > > Will
> > > > > >
> > > > > > <http://www.verizonmedia.com>
> > > > > >
> > > > > > Will Lauer
> > > > > >
> > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > Data Platforms & Systems Engineering
> > > > > >
> > > > > > M 508 561 6427
> > > > > > 1908 S. First St
> > > > > > Champaign, IL 61822
> > > > > >
> > > > > > <http://www.facebook.com/verizonmedia>   <
> > > > > http://twitter.com/verizonmedia>
> > > > > > <https://www.linkedin.com/company/verizon-media/>
> > > > > > <http://www.instagram.com/verizonmedia>
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > @itai, I think pending the outcome of this discussion that it
> > makes
> > > > > sense
> > > > > > > to have a wider community thread to announce any decisions we
> > make
> > > > > here,
> > > > > > > thanks for bringing that up.
> > > > > > >
> > > > > > > @rajiv, Minio support seems unrelated to this discussion. It
> > seems
> > > > > like a
> > > > > > > reasonable request, but I recommend starting another thread to
> > see
> > > if
> > > > > > > someone is interested in taking up this effort.
> > > > > > >
> > > > > > > @jihoon I definitely agree that Hadoop should be refactored to
> be
> > > an
> > > > > > > extension longer term. I don't think this upgrade would
> > necessarily
> > > > > > > make doing such a refactor any easier, but not harder either.
> > Just
> > > > > moving
> > > > > > > Hadoop to an extension also unfortunately doesn't really do
> > > anything
> > > > to
> > > > > > > help our dependency problem though, which is the thing that has
> > > > > agitated me
> > > > > > > enough to start this thread and start looking into solutions.
> > > > > > >
> > > > > > > @will/@frank I feel like the stranglehold Hadoop has on our
> > > > > dependencies
> > > > > > > has started to become especially more painful in the last
> couple
> > of
> > > > > > > years. Most painful to me is that we are stuck using a version
> of
> > > > > Apache
> > > > > > > Calcite from 2019 (six versions behind the latest), because
> newer
> > > > > versions
> > > > > > > require a newer version of Guava. This means we cannot get any
> > bug
> > > > > fixes
> > > > > > > and improvements in our SQL parsing layer without doing
> something
> > > > like
> > > > > > > packaging a shaded version of it ourselves or solving our
> Hadoop
> > > > > dependency
> > > > > > > problem.
> > > > > > >
> > > > > > > Many other dependencies have also proved problematic with
> Hadoop
> > as
> > > > > well in
> > > > > > > the past, and since we aren't able to run the Hadoop
> integration
> > > > tests
> > > > > in
> > > > > > > Travis, there is always the chance that sometimes we don't
> catch
> > > > these
> > > > > when
> > > > > > > they go in. I imagine now that we have turned on dependabot
> this
> > > > week,
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > > > > > > , that we are going to have to
> > > > > > > proceed very carefully with it until we are able to resolve
> this
> > > > > dependency
> > > > > > > issue.
> > > > > > >
> > > > > > > Hadoop 3.3.0 is also the first to support running on a Java
> > version
> > > > > that is
> > > > > > > newer than Java 8 per
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > > > > > > ,
> > > > > > > which is another area we have been working towards - Druid to
> > > > > officially
> > > > > > > support Java 11+ environments.
> > > > > > >
> > > > > > > I'm sort of at a loss of what else to do besides one of
> > > > > > > - switching to these Hadoop 3 shaded jars and dropping 2.x
> > support
> > > > > > > - figuring out how to custom package our own Hadoop 2.x
> > > > dependendencies
> > > > > > > that are shaded similarly to the Hadoop 3 client jars, and only
> > > > > supporting
> > > > > > > Hadoop with application classpath isolation
> > > > (mapreduce.job.classloader
> > > > > =
> > > > > > > true)
> > > > > > > - just dropping support for Hadoop completely
> > > > > > >
> > > > > > > I would much rather devote all effort into making Druids native
> > > batch
> > > > > > > ingestion better to encourage people to migrate to that, than
> > > > > continuing to
> > > > > > > fight with figuring out how to keep supporting Hadoop, so
> > upgrading
> > > > and
> > > > > > > switching to the shaded client jars at least seemed like a
> > > reasonable
> > > > > > > compromise to dropping it completely. Maybe making custom
> shaded
> > > > Hadoop
> > > > > > > dependencies in the spirit of the Hadoop 3 shaded jars isn't as
> > > hard
> > > > > as I
> > > > > > > am imagining, but it does seem like the most amount of work
> > between
> > > > the
> > > > > > > solutions I could think of to potentially resolve this problem.
> > > > > > >
> > > > > > > Does anyone have any other ideas of how we can isolate our
> > > > dependencies
> > > > > > > from Hadoop? Solutions like shading Guava,
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > > > > > > , would let Druid itself use
> > > > > > > newer Guava, but that doesn't help conflicts within our
> > > dependencies
> > > > > which
> > > > > > > has always seemed to be the larger problem to me. Moving Hadoop
> > > > > support to
> > > > > > > an extension doesn't help anything unless we can ensure that we
> > can
> > > > run
> > > > > > > Druid ingestion tasks on Hadoop without having to match all of
> > the
> > > > > Hadoop
> > > > > > > clusters dependencies with some sort of classloader wizardry.
> > > > > > >
> > > > > > > Maybe we could consider keeping a 0.22.x release line in Druid
> > that
> > > > > gets
> > > > > > > security and minor bug fixes for some period of time to give
> > > people a
> > > > > > > longer period to migrate off of Hadoop 2.x? I can't speak for
> the
> > > > rest
> > > > > of
> > > > > > > the committers, but I would personally be more open to
> > maintaining
> > > > > such a
> > > > > > > branch if it meant that moving forward at least we could update
> > all
> > > > of
> > > > > our
> > > > > > > dependencies to newer versions, while providing a transition
> path
> > > to
> > > > > still
> > > > > > > have at least some support until migrating to Hadoop 3 or
> native
> > > > Druid
> > > > > > > batch ingestion.
> > > > > > >
> > > > > > > Any other ideas?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 8, 2021 at 7:44 PM frank chen <
> frankchen@apache.org>
> > > > > wrote:
> > > > > > >
> > > > > > > > Considering Druid takes advantage of lots of external
> > components
> > > to
> > > > > > > work, I
> > > > > > > > think we should upgrade Druid in a little bit conservitive
> way.
> > > > > Dropping
> > > > > > > > support of hadoop2 is not a good idea.
> > > > > > > > The upgrading of the ZooKeeper client in Druid also prevents
> me
> > > > from
> > > > > > > > adopting 0.22 for a longer time.
> > > > > > > >
> > > > > > > > Although users could upgrade these dependencies first to use
> > the
> > > > > latest
> > > > > > > > Druid releases, frankly speaking, these upgrades are not so
> > easy
> > > in
> > > > > > > > production and usually take longer time, which would prevent
> > > users
> > > > > from
> > > > > > > > experiencing new features of Druid.
> > > > > > > > For hadoop3, I have heard of some performance issues, which
> > also
> > > > > makes me
> > > > > > > > have no confidence to upgrade.
> > > > > > > >
> > > > > > > > I think what Jihoon proposes is a good idea, separating
> hadoop2
> > > > from
> > > > > > > Druid
> > > > > > > > core as an extension.
> > > > > > > > Since hadoop2 has not been EOF, to achieve balance between
> > > > > compatibility
> > > > > > > > and long term evolution, maybe we could provide two
> extensions,
> > > one
> > > > > for
> > > > > > > > hadoop2, one for hadoop3.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三
> > > > 上午4:13写道：
> > > > > > > >
> > > > > > > > > Just to follow up on this, our main problem with hadoop3
> > right
> > > > now
> > > > > has
> > > > > > > > been
> > > > > > > > > instability in HDFS, to the extent that we put on hold any
> > > plans
> > > > to
> > > > > > > > deploy
> > > > > > > > > it to our production systems. I would claim Hadoop3 isn't
> > > mature
> > > > > enough
> > > > > > > > yet
> > > > > > > > > to consider migrating Druid to it.
> > > > > > > > >
> > > > > > > > > WIll
> > > > > > > > >
> > > > > > > > > <http://www.verizonmedia.com>
> > > > > > > > >
> > > > > > > > > Will Lauer
> > > > > > > > >
> > > > > > > > > Senior Principal Architect, Audience & Advertising
> Reporting
> > > > > > > > > Data Platforms & Systems Engineering
> > > > > > > > >
> > > > > > > > > M 508 561 6427
> > > > > > > > > 1908 S. First St
> > > > > > > > > Champaign, IL 61822
> > > > > > > > >
> > > > > > > > > <
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > > > > >   <
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > > > > >
> > > > > > > > > <
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > > > > >
> > > > > > > > > <
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
> > > > wlauer@verizonmedia.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Unfortunately, the migration off of hadoop3 is a hard one
> > > > (maybe
> > > > > not
> > > > > > > > for
> > > > > > > > > > Druid, but certainly for big organizations running large
> > > > hadoop2
> > > > > > > > > > workloads). If druid migrated to hadoop3 after 0.22, that
> > > would
> > > > > > > > probably
> > > > > > > > > > prevent me from taking any new versions of Druid for at
> > least
> > > > the
> > > > > > > > > remainder
> > > > > > > > > > of the year and possibly longer.
> > > > > > > > > >
> > > > > > > > > > Will
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > <http://www.verizonmedia.com>
> > > > > > > > > >
> > > > > > > > > > Will Lauer
> > > > > > > > > >
> > > > > > > > > > Senior Principal Architect, Audience & Advertising
> > Reporting
> > > > > > > > > > Data Platforms & Systems Engineering
> > > > > > > > > >
> > > > > > > > > > M 508 561 6427
> > > > > > > > > > 1908 S. First St
> > > > > > > > > > Champaign, IL 61822
> > > > > > > > > >
> > > > > > > > > > <
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > > > > >   <
> > > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > > > > >
> > > > > > > > > >    <
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > > > > >
> > > > > > > > > > <
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <
> > > cwylie@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> Hi all,
> > > > > > > > > >>
> > > > > > > > > >> I've been assisting with some experiments to see how we
> > > might
> > > > > want
> > > > > > > to
> > > > > > > > > >> migrate Druid to support Hadoop 3.x, and more
> importantly,
> > > see
> > > > > if
> > > > > > > > maybe
> > > > > > > > > we
> > > > > > > > > >> can finally be free of some of the dependency issues it
> > has
> > > > been
> > > > > > > > causing
> > > > > > > > > >> for as long as I can remember working with Druid.
> > > > > > > > > >>
> > > > > > > > > >> Hadoop 3 introduced shaded client jars,
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > > > > > > >> , with the purpose to
> > > > > > > > > >> allow applications to talk to the Hadoop cluster without
> > > > > drowning in
> > > > > > > > its
> > > > > > > > > >> transitive dependencies. The experimental branch that I
> > have
> > > > > been
> > > > > > > > > helping
> > > > > > > > > >> with, which is using these new shaded client jars, can
> be
> > > seen
> > > > > in
> > > > > > > this
> > > > > > > > > PR
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > > > > > > >> , and is currently working with
> > > > > > > > > >> the HDFS integration tests as well as the Hadoop
> tutorial
> > > flow
> > > > > in
> > > > > > > the
> > > > > > > > > >> Druid
> > > > > > > > > >> docs (which is pretty much equivalent to the HDFS
> > > integration
> > > > > test).
> > > > > > > > > >>
> > > > > > > > > >> The cloud deep storages still need some further testing
> > and
> > > > some
> > > > > > > minor
> > > > > > > > > >> cleanup still needs done for the docs and such.
> > Additionally
> > > > we
> > > > > > > still
> > > > > > > > > need
> > > > > > > > > >> to figure out how to handle the Kerberos extension,
> > because
> > > it
> > > > > > > extends
> > > > > > > > > >> some
> > > > > > > > > >> Hadoop classes so isn't able to use the shaded client
> jars
> > > in
> > > > a
> > > > > > > > > >> straight-forward manner, and so still has heavy
> > dependencies
> > > > and
> > > > > > > > hasn't
> > > > > > > > > >> been tested. However, the experiment has started to pan
> > out
> > > > > enough
> > > > > > > to
> > > > > > > > > >> where
> > > > > > > > > >> I think it is worth starting this discussion, because it
> > > does
> > > > > have
> > > > > > > > some
> > > > > > > > > >> implications.
> > > > > > > > > >>
> > > > > > > > > >> Making this change I think will allow us to update our
> > > > > dependencies
> > > > > > > > > with a
> > > > > > > > > >> lot more freedom (I'm looking at you, Guava), but the
> > catch
> > > is
> > > > > that
> > > > > > > > once
> > > > > > > > > >> we
> > > > > > > > > >> make this change and start updating these dependencies,
> it
> > > > will
> > > > > > > become
> > > > > > > > > >> hard, nearing impossible to support Hadoop 2.x, since as
> > far
> > > > as
> > > > > I
> > > > > > > know
> > > > > > > > > >> there isn't an equivalent set of shaded client jars. I
> am
> > > also
> > > > > not
> > > > > > > > > certain
> > > > > > > > > >> how far back the Hadoop job classpath isolation stuff
> goes
> > > > > > > > > >> (mapreduce.job.classloader = true) which I think is
> > required
> > > > to
> > > > > be
> > > > > > > set
> > > > > > > > > on
> > > > > > > > > >> Druid tasks for this shaded stuff to work alongside
> > updated
> > > > > Druid
> > > > > > > > > >> dependencies.
> > > > > > > > > >>
> > > > > > > > > >> Is anyone opposed to or worried about dropping Hadoop
> 2.x
> > > > > support
> > > > > > > > after
> > > > > > > > > >> the
> > > > > > > > > >> Druid 0.22 release?
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> > > > > For additional commands, e-mail: dev-help@druid.apache.org
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Posted by Gian Merlino <gi...@apache.org>.

It's always good to deprecate things for some time prior to removing them,
so we don't need to (nor should we) remove Hadoop 2 support right now. My
vote is that in this upcoming release, we should deprecate it. The main
problem in my eyes is the one Abhishek brought up: the dependency
management situation with Hadoop 2 is really messy, and I'm not sure
there's a good way to handle them given the limited classloader isolation.
This situation becomes tougher to manage with each release, and we haven't
had people volunteering to find and build comprehensive solutions. It is
time to move on.

The concern Samarth raised, that people may end up stuck on older Druid
versions because they aren't able to upgrade to Hadoop 3, is valid. I can
see two good solutions to this. First: we can improve native ingest to the
point where people feel broadly comfortable moving Hadoop 2 workloads to
native. The work planned as part of doing ingest via multi-stage
distributed query <https://github.com/apache/druid/issues/12262> is going
to be useful here, by improving the speed and scalability of native ingest.
Second: it would also be great to have something similar that runs on
Spark, for people that have made investments in Spark. I suspect that most
people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so supporting
both of those would ease a lot of the potential pain of dropping Hadoop 2
support.

On Spark: I'm not familiar with the current state of the Spark work. Is it
stuck? If so could something be done to unstick it? I agree with Abhishek
that I wouldn't want to block moving off Hadoop 2 on this. However, it'd be
great if we could get it done before actually removing Hadoop 2 support
from the code base.


On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal <ab...@imply.io>
wrote:

> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
> low-resistance path than moving from Hadoop to Spark. even if we get that
> PR merged, it will take good time for spark integration to reach the same
> level of maturity as Hadoop or Native ingestion. BTW I am not making an
> argument against spark integration. it will certainly be nice to have Spark
> as an option. Just that spark integration doesn't become a blocker for us
> to get off Hadoop.
>
> btw are you using Hadoop 2 right now with the latest druid version? If so,
> did you run into similar errors that I posted in my last email?
>
> On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain <sa...@gmail.com>
> wrote:
>
> > I am sure there are other companies out there who are still on Hadoop 2.x
> > with migration to Hadoop 3.x being a no-go.
> > If Druid was to drop support for Hadoop 3.x completely, I am afraid it
> > would prevent users from updating to newer versions of Druid which would
> be
> > a shame.
> >
> > FWIW, we have found in practice for high volume use cases that compaction
> > based on Druid's Hadoop based batch ingestion is a lot more scale-able
> than
> > the native compaction.
> >
> > Having said that, as an alternative, if we can merge Julian's Spark based
> > ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid,
> that
> > might provide an alternate way for users to get rid of the Hadoop
> > dependency.
> >
> > On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <
> > abhishek.agarwal@imply.io>
> > wrote:
> >
> > > Reviving this conversation again.
> > > @Will - Do you still have concerns about HDFS stability? Hadoop 3 has
> > been
> > > around for some time now and is very stable as far as I know.
> > >
> > > The dependencies coming from Hadoop 2 are also old enough that they
> cause
> > > dependency scans to fail. E.g. Log4j 1.x dependencies that are coming
> > from
> > > Hadoop 2, get flagged during these scans. We have also seen issues when
> > > customers try to use Hadoop ingestion with the latest log4j2 library.
> > >
> > > Exception in thread "main" java.lang.NoSuchMethodError:
> > >
> > >
> >
> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
> > > at
> > >
> > >
> >
> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
> > > at
> > >
> > >
> >
> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
> > > at
> > >
> > >
> >
> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
> > >
> > >
> > > Instead of fixing these point issues, we would be better served by
> > > completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent
> > > releases and dependencies are well isolated.
> > >
> > > On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <karankumar1100@gmail.com
> >
> > > wrote:
> > >
> > > > Hello
> > > > We can also use maven profiles. We keep hadoop2 support by default
> and
> > > add
> > > > a new maven profile with hadoop3. This will allow the user to choose
> > the
> > > > profile which is best suited for the use case.
> > > > Agreed, it will not help in the Hadoop dependency problems but does
> > > enable
> > > > our users to use druid with multiple flavors.
> > > > Also with hadoop3, as clint mentioned, the dependencies come
> pre-shaded
> > > so
> > > > we significantly reduce our effort in solving the dependency
> problems.
> > > > I have the PR in the last phases where I am able to run the entire
> test
> > > > suit unit + integration tests on both the default ie hadoop2 and the
> > new
> > > > hadoop3 profile.
> > > >
> > > >
> > > >
> > > > On 2021/06/09 11:55:31, Will Lauer <wl...@verizonmedia.com.INVALID>
> > > > wrote:
> > > > > Clint,
> > > > >
> > > > > I fully understand what type of headache dealing with these
> > dependency
> > > > > issues is. We deal with this all the time, and based on
> conversations
> > > > I've
> > > > > had with our internal hadoop development team, they are quite aware
> > of
> > > > them
> > > > > and just as frustrated by them as you are. I'm certainly in favor
> of
> > > > doing
> > > > > something to improve this situation, as long as it doesn't abandon
> a
> > > > large
> > > > > section of the user base, which I think DROPPING hadoop2 would do.
> > > > >
> > > > > I think there are solutions there that can help solve the
> conflicting
> > > > > dependency problem. Refactoring Hadoop support into an independent
> > > > > extension is certainly a start. But I think the dependency problem
> is
> > > > > bigger than that. There are always going to be conflicts between
> > > > > dependencies in the core system and in extensions as the system
> gets
> > > > > bigger. We have one right now internally that prevents us from
> > enabling
> > > > SQL
> > > > > in our instance of Druid due to conflicts between versions of
> > protobuf
> > > > used
> > > > > by Calcite vs one of our critical extensions. Long term, I think
> you
> > > are
> > > > > going to need to carefully think through a ClassLoader based
> strategy
> > > to
> > > > > truly separate the impact of various dependencies.
> > > > >
> > > > > While I'm not seriously suggesting it for Druid, OSGi WOULD solve
> > this
> > > > > problem. It's a system that allows you to explicitly declare what
> > each
> > > > > bundle exposes to the system, and what each bundle consumes from
> the
> > > > > system, allowing multiple conflicting dependencies to co-exist
> > without
> > > > > impacting each other. OSGi is the big hammer approach, but I bet a
> > more
> > > > > appropriate solution would be a simpler custom-ClassLoader based
> > > solution
> > > > > that hid all dependencies in extensions, keeping them from
> impacting
> > > the
> > > > > core, and that only exposed "public" pieces of the core to
> > extensions.
> > > If
> > > > > Druid's core could be extended without impacting the various
> > > extensions,
> > > > > and the extensions' dependencies could be modified without
> impacting
> > > the
> > > > > core, this would go a long way towards solving the problem that you
> > > have
> > > > > described.
> > > > >
> > > > > Will
> > > > >
> > > > > <http://www.verizonmedia.com>
> > > > >
> > > > > Will Lauer
> > > > >
> > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > Data Platforms & Systems Engineering
> > > > >
> > > > > M 508 561 6427
> > > > > 1908 S. First St
> > > > > Champaign, IL 61822
> > > > >
> > > > > <http://www.facebook.com/verizonmedia>   <
> > > > http://twitter.com/verizonmedia>
> > > > > <https://www.linkedin.com/company/verizon-media/>
> > > > > <http://www.instagram.com/verizonmedia>
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org>
> > wrote:
> > > > >
> > > > > > @itai, I think pending the outcome of this discussion that it
> makes
> > > > sense
> > > > > > to have a wider community thread to announce any decisions we
> make
> > > > here,
> > > > > > thanks for bringing that up.
> > > > > >
> > > > > > @rajiv, Minio support seems unrelated to this discussion. It
> seems
> > > > like a
> > > > > > reasonable request, but I recommend starting another thread to
> see
> > if
> > > > > > someone is interested in taking up this effort.
> > > > > >
> > > > > > @jihoon I definitely agree that Hadoop should be refactored to be
> > an
> > > > > > extension longer term. I don't think this upgrade would
> necessarily
> > > > > > make doing such a refactor any easier, but not harder either.
> Just
> > > > moving
> > > > > > Hadoop to an extension also unfortunately doesn't really do
> > anything
> > > to
> > > > > > help our dependency problem though, which is the thing that has
> > > > agitated me
> > > > > > enough to start this thread and start looking into solutions.
> > > > > >
> > > > > > @will/@frank I feel like the stranglehold Hadoop has on our
> > > > dependencies
> > > > > > has started to become especially more painful in the last couple
> of
> > > > > > years. Most painful to me is that we are stuck using a version of
> > > > Apache
> > > > > > Calcite from 2019 (six versions behind the latest), because newer
> > > > versions
> > > > > > require a newer version of Guava. This means we cannot get any
> bug
> > > > fixes
> > > > > > and improvements in our SQL parsing layer without doing something
> > > like
> > > > > > packaging a shaded version of it ourselves or solving our Hadoop
> > > > dependency
> > > > > > problem.
> > > > > >
> > > > > > Many other dependencies have also proved problematic with Hadoop
> as
> > > > well in
> > > > > > the past, and since we aren't able to run the Hadoop integration
> > > tests
> > > > in
> > > > > > Travis, there is always the chance that sometimes we don't catch
> > > these
> > > > when
> > > > > > they go in. I imagine now that we have turned on dependabot this
> > > week,
> > > > > >
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > > > > > , that we are going to have to
> > > > > > proceed very carefully with it until we are able to resolve this
> > > > dependency
> > > > > > issue.
> > > > > >
> > > > > > Hadoop 3.3.0 is also the first to support running on a Java
> version
> > > > that is
> > > > > > newer than Java 8 per
> > > > > >
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > > > > > ,
> > > > > > which is another area we have been working towards - Druid to
> > > > officially
> > > > > > support Java 11+ environments.
> > > > > >
> > > > > > I'm sort of at a loss of what else to do besides one of
> > > > > > - switching to these Hadoop 3 shaded jars and dropping 2.x
> support
> > > > > > - figuring out how to custom package our own Hadoop 2.x
> > > dependendencies
> > > > > > that are shaded similarly to the Hadoop 3 client jars, and only
> > > > supporting
> > > > > > Hadoop with application classpath isolation
> > > (mapreduce.job.classloader
> > > > =
> > > > > > true)
> > > > > > - just dropping support for Hadoop completely
> > > > > >
> > > > > > I would much rather devote all effort into making Druids native
> > batch
> > > > > > ingestion better to encourage people to migrate to that, than
> > > > continuing to
> > > > > > fight with figuring out how to keep supporting Hadoop, so
> upgrading
> > > and
> > > > > > switching to the shaded client jars at least seemed like a
> > reasonable
> > > > > > compromise to dropping it completely. Maybe making custom shaded
> > > Hadoop
> > > > > > dependencies in the spirit of the Hadoop 3 shaded jars isn't as
> > hard
> > > > as I
> > > > > > am imagining, but it does seem like the most amount of work
> between
> > > the
> > > > > > solutions I could think of to potentially resolve this problem.
> > > > > >
> > > > > > Does anyone have any other ideas of how we can isolate our
> > > dependencies
> > > > > > from Hadoop? Solutions like shading Guava,
> > > > > >
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > > > > > , would let Druid itself use
> > > > > > newer Guava, but that doesn't help conflicts within our
> > dependencies
> > > > which
> > > > > > has always seemed to be the larger problem to me. Moving Hadoop
> > > > support to
> > > > > > an extension doesn't help anything unless we can ensure that we
> can
> > > run
> > > > > > Druid ingestion tasks on Hadoop without having to match all of
> the
> > > > Hadoop
> > > > > > clusters dependencies with some sort of classloader wizardry.
> > > > > >
> > > > > > Maybe we could consider keeping a 0.22.x release line in Druid
> that
> > > > gets
> > > > > > security and minor bug fixes for some period of time to give
> > people a
> > > > > > longer period to migrate off of Hadoop 2.x? I can't speak for the
> > > rest
> > > > of
> > > > > > the committers, but I would personally be more open to
> maintaining
> > > > such a
> > > > > > branch if it meant that moving forward at least we could update
> all
> > > of
> > > > our
> > > > > > dependencies to newer versions, while providing a transition path
> > to
> > > > still
> > > > > > have at least some support until migrating to Hadoop 3 or native
> > > Druid
> > > > > > batch ingestion.
> > > > > >
> > > > > > Any other ideas?
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 8, 2021 at 7:44 PM frank chen <fr...@apache.org>
> > > > wrote:
> > > > > >
> > > > > > > Considering Druid takes advantage of lots of external
> components
> > to
> > > > > > work, I
> > > > > > > think we should upgrade Druid in a little bit conservitive way.
> > > > Dropping
> > > > > > > support of hadoop2 is not a good idea.
> > > > > > > The upgrading of the ZooKeeper client in Druid also prevents me
> > > from
> > > > > > > adopting 0.22 for a longer time.
> > > > > > >
> > > > > > > Although users could upgrade these dependencies first to use
> the
> > > > latest
> > > > > > > Druid releases, frankly speaking, these upgrades are not so
> easy
> > in
> > > > > > > production and usually take longer time, which would prevent
> > users
> > > > from
> > > > > > > experiencing new features of Druid.
> > > > > > > For hadoop3, I have heard of some performance issues, which
> also
> > > > makes me
> > > > > > > have no confidence to upgrade.
> > > > > > >
> > > > > > > I think what Jihoon proposes is a good idea, separating hadoop2
> > > from
> > > > > > Druid
> > > > > > > core as an extension.
> > > > > > > Since hadoop2 has not been EOF, to achieve balance between
> > > > compatibility
> > > > > > > and long term evolution, maybe we could provide two extensions,
> > one
> > > > for
> > > > > > > hadoop2, one for hadoop3.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三
> > > 上午4:13写道：
> > > > > > >
> > > > > > > > Just to follow up on this, our main problem with hadoop3
> right
> > > now
> > > > has
> > > > > > > been
> > > > > > > > instability in HDFS, to the extent that we put on hold any
> > plans
> > > to
> > > > > > > deploy
> > > > > > > > it to our production systems. I would claim Hadoop3 isn't
> > mature
> > > > enough
> > > > > > > yet
> > > > > > > > to consider migrating Druid to it.
> > > > > > > >
> > > > > > > > WIll
> > > > > > > >
> > > > > > > > <http://www.verizonmedia.com>
> > > > > > > >
> > > > > > > > Will Lauer
> > > > > > > >
> > > > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > > > Data Platforms & Systems Engineering
> > > > > > > >
> > > > > > > > M 508 561 6427
> > > > > > > > 1908 S. First St
> > > > > > > > Champaign, IL 61822
> > > > > > > >
> > > > > > > > <
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > > > >   <
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > > > >
> > > > > > > > <
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > > > >
> > > > > > > > <
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
> > > wlauer@verizonmedia.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Unfortunately, the migration off of hadoop3 is a hard one
> > > (maybe
> > > > not
> > > > > > > for
> > > > > > > > > Druid, but certainly for big organizations running large
> > > hadoop2
> > > > > > > > > workloads). If druid migrated to hadoop3 after 0.22, that
> > would
> > > > > > > probably
> > > > > > > > > prevent me from taking any new versions of Druid for at
> least
> > > the
> > > > > > > > remainder
> > > > > > > > > of the year and possibly longer.
> > > > > > > > >
> > > > > > > > > Will
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > <http://www.verizonmedia.com>
> > > > > > > > >
> > > > > > > > > Will Lauer
> > > > > > > > >
> > > > > > > > > Senior Principal Architect, Audience & Advertising
> Reporting
> > > > > > > > > Data Platforms & Systems Engineering
> > > > > > > > >
> > > > > > > > > M 508 561 6427
> > > > > > > > > 1908 S. First St
> > > > > > > > > Champaign, IL 61822
> > > > > > > > >
> > > > > > > > > <
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > > > >   <
> > > > > > > >
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > > > >
> > > > > > > > >    <
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > > > >
> > > > > > > > > <
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <
> > cwylie@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Hi all,
> > > > > > > > >>
> > > > > > > > >> I've been assisting with some experiments to see how we
> > might
> > > > want
> > > > > > to
> > > > > > > > >> migrate Druid to support Hadoop 3.x, and more importantly,
> > see
> > > > if
> > > > > > > maybe
> > > > > > > > we
> > > > > > > > >> can finally be free of some of the dependency issues it
> has
> > > been
> > > > > > > causing
> > > > > > > > >> for as long as I can remember working with Druid.
> > > > > > > > >>
> > > > > > > > >> Hadoop 3 introduced shaded client jars,
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > > > > > >> , with the purpose to
> > > > > > > > >> allow applications to talk to the Hadoop cluster without
> > > > drowning in
> > > > > > > its
> > > > > > > > >> transitive dependencies. The experimental branch that I
> have
> > > > been
> > > > > > > > helping
> > > > > > > > >> with, which is using these new shaded client jars, can be
> > seen
> > > > in
> > > > > > this
> > > > > > > > PR
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > > > > > >> , and is currently working with
> > > > > > > > >> the HDFS integration tests as well as the Hadoop tutorial
> > flow
> > > > in
> > > > > > the
> > > > > > > > >> Druid
> > > > > > > > >> docs (which is pretty much equivalent to the HDFS
> > integration
> > > > test).
> > > > > > > > >>
> > > > > > > > >> The cloud deep storages still need some further testing
> and
> > > some
> > > > > > minor
> > > > > > > > >> cleanup still needs done for the docs and such.
> Additionally
> > > we
> > > > > > still
> > > > > > > > need
> > > > > > > > >> to figure out how to handle the Kerberos extension,
> because
> > it
> > > > > > extends
> > > > > > > > >> some
> > > > > > > > >> Hadoop classes so isn't able to use the shaded client jars
> > in
> > > a
> > > > > > > > >> straight-forward manner, and so still has heavy
> dependencies
> > > and
> > > > > > > hasn't
> > > > > > > > >> been tested. However, the experiment has started to pan
> out
> > > > enough
> > > > > > to
> > > > > > > > >> where
> > > > > > > > >> I think it is worth starting this discussion, because it
> > does
> > > > have
> > > > > > > some
> > > > > > > > >> implications.
> > > > > > > > >>
> > > > > > > > >> Making this change I think will allow us to update our
> > > > dependencies
> > > > > > > > with a
> > > > > > > > >> lot more freedom (I'm looking at you, Guava), but the
> catch
> > is
> > > > that
> > > > > > > once
> > > > > > > > >> we
> > > > > > > > >> make this change and start updating these dependencies, it
> > > will
> > > > > > become
> > > > > > > > >> hard, nearing impossible to support Hadoop 2.x, since as
> far
> > > as
> > > > I
> > > > > > know
> > > > > > > > >> there isn't an equivalent set of shaded client jars. I am
> > also
> > > > not
> > > > > > > > certain
> > > > > > > > >> how far back the Hadoop job classpath isolation stuff goes
> > > > > > > > >> (mapreduce.job.classloader = true) which I think is
> required
> > > to
> > > > be
> > > > > > set
> > > > > > > > on
> > > > > > > > >> Druid tasks for this shaded stuff to work alongside
> updated
> > > > Druid
> > > > > > > > >> dependencies.
> > > > > > > > >>
> > > > > > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x
> > > > support
> > > > > > > after
> > > > > > > > >> the
> > > > > > > > >> Druid 0.22 release?
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> > > > For additional commands, e-mail: dev-help@druid.apache.org
> > > >
> > > >
> > >
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Posted by Abhishek Agarwal <ab...@imply.io>.

I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
low-resistance path than moving from Hadoop to Spark. even if we get that
PR merged, it will take good time for spark integration to reach the same
level of maturity as Hadoop or Native ingestion. BTW I am not making an
argument against spark integration. it will certainly be nice to have Spark
as an option. Just that spark integration doesn't become a blocker for us
to get off Hadoop.

btw are you using Hadoop 2 right now with the latest druid version? If so,
did you run into similar errors that I posted in my last email?

On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain <sa...@gmail.com>
wrote:

> I am sure there are other companies out there who are still on Hadoop 2.x
> with migration to Hadoop 3.x being a no-go.
> If Druid was to drop support for Hadoop 3.x completely, I am afraid it
> would prevent users from updating to newer versions of Druid which would be
> a shame.
>
> FWIW, we have found in practice for high volume use cases that compaction
> based on Druid's Hadoop based batch ingestion is a lot more scale-able than
> the native compaction.
>
> Having said that, as an alternative, if we can merge Julian's Spark based
> ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid, that
> might provide an alternate way for users to get rid of the Hadoop
> dependency.
>
> On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <
> abhishek.agarwal@imply.io>
> wrote:
>
> > Reviving this conversation again.
> > @Will - Do you still have concerns about HDFS stability? Hadoop 3 has
> been
> > around for some time now and is very stable as far as I know.
> >
> > The dependencies coming from Hadoop 2 are also old enough that they cause
> > dependency scans to fail. E.g. Log4j 1.x dependencies that are coming
> from
> > Hadoop 2, get flagged during these scans. We have also seen issues when
> > customers try to use Hadoop ingestion with the latest log4j2 library.
> >
> > Exception in thread "main" java.lang.NoSuchMethodError:
> >
> >
> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
> > at
> >
> >
> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
> > at
> >
> >
> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
> > at
> >
> >
> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
> >
> >
> > Instead of fixing these point issues, we would be better served by
> > completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent
> > releases and dependencies are well isolated.
> >
> > On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <ka...@gmail.com>
> > wrote:
> >
> > > Hello
> > > We can also use maven profiles. We keep hadoop2 support by default and
> > add
> > > a new maven profile with hadoop3. This will allow the user to choose
> the
> > > profile which is best suited for the use case.
> > > Agreed, it will not help in the Hadoop dependency problems but does
> > enable
> > > our users to use druid with multiple flavors.
> > > Also with hadoop3, as clint mentioned, the dependencies come pre-shaded
> > so
> > > we significantly reduce our effort in solving the dependency problems.
> > > I have the PR in the last phases where I am able to run the entire test
> > > suit unit + integration tests on both the default ie hadoop2 and the
> new
> > > hadoop3 profile.
> > >
> > >
> > >
> > > On 2021/06/09 11:55:31, Will Lauer <wl...@verizonmedia.com.INVALID>
> > > wrote:
> > > > Clint,
> > > >
> > > > I fully understand what type of headache dealing with these
> dependency
> > > > issues is. We deal with this all the time, and based on conversations
> > > I've
> > > > had with our internal hadoop development team, they are quite aware
> of
> > > them
> > > > and just as frustrated by them as you are. I'm certainly in favor of
> > > doing
> > > > something to improve this situation, as long as it doesn't abandon a
> > > large
> > > > section of the user base, which I think DROPPING hadoop2 would do.
> > > >
> > > > I think there are solutions there that can help solve the conflicting
> > > > dependency problem. Refactoring Hadoop support into an independent
> > > > extension is certainly a start. But I think the dependency problem is
> > > > bigger than that. There are always going to be conflicts between
> > > > dependencies in the core system and in extensions as the system gets
> > > > bigger. We have one right now internally that prevents us from
> enabling
> > > SQL
> > > > in our instance of Druid due to conflicts between versions of
> protobuf
> > > used
> > > > by Calcite vs one of our critical extensions. Long term, I think you
> > are
> > > > going to need to carefully think through a ClassLoader based strategy
> > to
> > > > truly separate the impact of various dependencies.
> > > >
> > > > While I'm not seriously suggesting it for Druid, OSGi WOULD solve
> this
> > > > problem. It's a system that allows you to explicitly declare what
> each
> > > > bundle exposes to the system, and what each bundle consumes from the
> > > > system, allowing multiple conflicting dependencies to co-exist
> without
> > > > impacting each other. OSGi is the big hammer approach, but I bet a
> more
> > > > appropriate solution would be a simpler custom-ClassLoader based
> > solution
> > > > that hid all dependencies in extensions, keeping them from impacting
> > the
> > > > core, and that only exposed "public" pieces of the core to
> extensions.
> > If
> > > > Druid's core could be extended without impacting the various
> > extensions,
> > > > and the extensions' dependencies could be modified without impacting
> > the
> > > > core, this would go a long way towards solving the problem that you
> > have
> > > > described.
> > > >
> > > > Will
> > > >
> > > > <http://www.verizonmedia.com>
> > > >
> > > > Will Lauer
> > > >
> > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > Data Platforms & Systems Engineering
> > > >
> > > > M 508 561 6427
> > > > 1908 S. First St
> > > > Champaign, IL 61822
> > > >
> > > > <http://www.facebook.com/verizonmedia>   <
> > > http://twitter.com/verizonmedia>
> > > > <https://www.linkedin.com/company/verizon-media/>
> > > > <http://www.instagram.com/verizonmedia>
> > > >
> > > >
> > > >
> > > > On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org>
> wrote:
> > > >
> > > > > @itai, I think pending the outcome of this discussion that it makes
> > > sense
> > > > > to have a wider community thread to announce any decisions we make
> > > here,
> > > > > thanks for bringing that up.
> > > > >
> > > > > @rajiv, Minio support seems unrelated to this discussion. It seems
> > > like a
> > > > > reasonable request, but I recommend starting another thread to see
> if
> > > > > someone is interested in taking up this effort.
> > > > >
> > > > > @jihoon I definitely agree that Hadoop should be refactored to be
> an
> > > > > extension longer term. I don't think this upgrade would necessarily
> > > > > make doing such a refactor any easier, but not harder either. Just
> > > moving
> > > > > Hadoop to an extension also unfortunately doesn't really do
> anything
> > to
> > > > > help our dependency problem though, which is the thing that has
> > > agitated me
> > > > > enough to start this thread and start looking into solutions.
> > > > >
> > > > > @will/@frank I feel like the stranglehold Hadoop has on our
> > > dependencies
> > > > > has started to become especially more painful in the last couple of
> > > > > years. Most painful to me is that we are stuck using a version of
> > > Apache
> > > > > Calcite from 2019 (six versions behind the latest), because newer
> > > versions
> > > > > require a newer version of Guava. This means we cannot get any bug
> > > fixes
> > > > > and improvements in our SQL parsing layer without doing something
> > like
> > > > > packaging a shaded version of it ourselves or solving our Hadoop
> > > dependency
> > > > > problem.
> > > > >
> > > > > Many other dependencies have also proved problematic with Hadoop as
> > > well in
> > > > > the past, and since we aren't able to run the Hadoop integration
> > tests
> > > in
> > > > > Travis, there is always the chance that sometimes we don't catch
> > these
> > > when
> > > > > they go in. I imagine now that we have turned on dependabot this
> > week,
> > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > > > > , that we are going to have to
> > > > > proceed very carefully with it until we are able to resolve this
> > > dependency
> > > > > issue.
> > > > >
> > > > > Hadoop 3.3.0 is also the first to support running on a Java version
> > > that is
> > > > > newer than Java 8 per
> > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > > > > ,
> > > > > which is another area we have been working towards - Druid to
> > > officially
> > > > > support Java 11+ environments.
> > > > >
> > > > > I'm sort of at a loss of what else to do besides one of
> > > > > - switching to these Hadoop 3 shaded jars and dropping 2.x support
> > > > > - figuring out how to custom package our own Hadoop 2.x
> > dependendencies
> > > > > that are shaded similarly to the Hadoop 3 client jars, and only
> > > supporting
> > > > > Hadoop with application classpath isolation
> > (mapreduce.job.classloader
> > > =
> > > > > true)
> > > > > - just dropping support for Hadoop completely
> > > > >
> > > > > I would much rather devote all effort into making Druids native
> batch
> > > > > ingestion better to encourage people to migrate to that, than
> > > continuing to
> > > > > fight with figuring out how to keep supporting Hadoop, so upgrading
> > and
> > > > > switching to the shaded client jars at least seemed like a
> reasonable
> > > > > compromise to dropping it completely. Maybe making custom shaded
> > Hadoop
> > > > > dependencies in the spirit of the Hadoop 3 shaded jars isn't as
> hard
> > > as I
> > > > > am imagining, but it does seem like the most amount of work between
> > the
> > > > > solutions I could think of to potentially resolve this problem.
> > > > >
> > > > > Does anyone have any other ideas of how we can isolate our
> > dependencies
> > > > > from Hadoop? Solutions like shading Guava,
> > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > > > > , would let Druid itself use
> > > > > newer Guava, but that doesn't help conflicts within our
> dependencies
> > > which
> > > > > has always seemed to be the larger problem to me. Moving Hadoop
> > > support to
> > > > > an extension doesn't help anything unless we can ensure that we can
> > run
> > > > > Druid ingestion tasks on Hadoop without having to match all of the
> > > Hadoop
> > > > > clusters dependencies with some sort of classloader wizardry.
> > > > >
> > > > > Maybe we could consider keeping a 0.22.x release line in Druid that
> > > gets
> > > > > security and minor bug fixes for some period of time to give
> people a
> > > > > longer period to migrate off of Hadoop 2.x? I can't speak for the
> > rest
> > > of
> > > > > the committers, but I would personally be more open to maintaining
> > > such a
> > > > > branch if it meant that moving forward at least we could update all
> > of
> > > our
> > > > > dependencies to newer versions, while providing a transition path
> to
> > > still
> > > > > have at least some support until migrating to Hadoop 3 or native
> > Druid
> > > > > batch ingestion.
> > > > >
> > > > > Any other ideas?
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 8, 2021 at 7:44 PM frank chen <fr...@apache.org>
> > > wrote:
> > > > >
> > > > > > Considering Druid takes advantage of lots of external components
> to
> > > > > work, I
> > > > > > think we should upgrade Druid in a little bit conservitive way.
> > > Dropping
> > > > > > support of hadoop2 is not a good idea.
> > > > > > The upgrading of the ZooKeeper client in Druid also prevents me
> > from
> > > > > > adopting 0.22 for a longer time.
> > > > > >
> > > > > > Although users could upgrade these dependencies first to use the
> > > latest
> > > > > > Druid releases, frankly speaking, these upgrades are not so easy
> in
> > > > > > production and usually take longer time, which would prevent
> users
> > > from
> > > > > > experiencing new features of Druid.
> > > > > > For hadoop3, I have heard of some performance issues, which also
> > > makes me
> > > > > > have no confidence to upgrade.
> > > > > >
> > > > > > I think what Jihoon proposes is a good idea, separating hadoop2
> > from
> > > > > Druid
> > > > > > core as an extension.
> > > > > > Since hadoop2 has not been EOF, to achieve balance between
> > > compatibility
> > > > > > and long term evolution, maybe we could provide two extensions,
> one
> > > for
> > > > > > hadoop2, one for hadoop3.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三
> > 上午4:13写道：
> > > > > >
> > > > > > > Just to follow up on this, our main problem with hadoop3 right
> > now
> > > has
> > > > > > been
> > > > > > > instability in HDFS, to the extent that we put on hold any
> plans
> > to
> > > > > > deploy
> > > > > > > it to our production systems. I would claim Hadoop3 isn't
> mature
> > > enough
> > > > > > yet
> > > > > > > to consider migrating Druid to it.
> > > > > > >
> > > > > > > WIll
> > > > > > >
> > > > > > > <http://www.verizonmedia.com>
> > > > > > >
> > > > > > > Will Lauer
> > > > > > >
> > > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > > Data Platforms & Systems Engineering
> > > > > > >
> > > > > > > M 508 561 6427
> > > > > > > 1908 S. First St
> > > > > > > Champaign, IL 61822
> > > > > > >
> > > > > > > <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > > >   <
> > > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > > >
> > > > > > > <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > > >
> > > > > > > <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
> > wlauer@verizonmedia.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Unfortunately, the migration off of hadoop3 is a hard one
> > (maybe
> > > not
> > > > > > for
> > > > > > > > Druid, but certainly for big organizations running large
> > hadoop2
> > > > > > > > workloads). If druid migrated to hadoop3 after 0.22, that
> would
> > > > > > probably
> > > > > > > > prevent me from taking any new versions of Druid for at least
> > the
> > > > > > > remainder
> > > > > > > > of the year and possibly longer.
> > > > > > > >
> > > > > > > > Will
> > > > > > > >
> > > > > > > >
> > > > > > > > <http://www.verizonmedia.com>
> > > > > > > >
> > > > > > > > Will Lauer
> > > > > > > >
> > > > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > > > Data Platforms & Systems Engineering
> > > > > > > >
> > > > > > > > M 508 561 6427
> > > > > > > > 1908 S. First St
> > > > > > > > Champaign, IL 61822
> > > > > > > >
> > > > > > > > <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > > >   <
> > > > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > > >
> > > > > > > >    <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > > >
> > > > > > > > <
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <
> cwylie@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi all,
> > > > > > > >>
> > > > > > > >> I've been assisting with some experiments to see how we
> might
> > > want
> > > > > to
> > > > > > > >> migrate Druid to support Hadoop 3.x, and more importantly,
> see
> > > if
> > > > > > maybe
> > > > > > > we
> > > > > > > >> can finally be free of some of the dependency issues it has
> > been
> > > > > > causing
> > > > > > > >> for as long as I can remember working with Druid.
> > > > > > > >>
> > > > > > > >> Hadoop 3 introduced shaded client jars,
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > > > > >> , with the purpose to
> > > > > > > >> allow applications to talk to the Hadoop cluster without
> > > drowning in
> > > > > > its
> > > > > > > >> transitive dependencies. The experimental branch that I have
> > > been
> > > > > > > helping
> > > > > > > >> with, which is using these new shaded client jars, can be
> seen
> > > in
> > > > > this
> > > > > > > PR
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > > > > >> , and is currently working with
> > > > > > > >> the HDFS integration tests as well as the Hadoop tutorial
> flow
> > > in
> > > > > the
> > > > > > > >> Druid
> > > > > > > >> docs (which is pretty much equivalent to the HDFS
> integration
> > > test).
> > > > > > > >>
> > > > > > > >> The cloud deep storages still need some further testing and
> > some
> > > > > minor
> > > > > > > >> cleanup still needs done for the docs and such. Additionally
> > we
> > > > > still
> > > > > > > need
> > > > > > > >> to figure out how to handle the Kerberos extension, because
> it
> > > > > extends
> > > > > > > >> some
> > > > > > > >> Hadoop classes so isn't able to use the shaded client jars
> in
> > a
> > > > > > > >> straight-forward manner, and so still has heavy dependencies
> > and
> > > > > > hasn't
> > > > > > > >> been tested. However, the experiment has started to pan out
> > > enough
> > > > > to
> > > > > > > >> where
> > > > > > > >> I think it is worth starting this discussion, because it
> does
> > > have
> > > > > > some
> > > > > > > >> implications.
> > > > > > > >>
> > > > > > > >> Making this change I think will allow us to update our
> > > dependencies
> > > > > > > with a
> > > > > > > >> lot more freedom (I'm looking at you, Guava), but the catch
> is
> > > that
> > > > > > once
> > > > > > > >> we
> > > > > > > >> make this change and start updating these dependencies, it
> > will
> > > > > become
> > > > > > > >> hard, nearing impossible to support Hadoop 2.x, since as far
> > as
> > > I
> > > > > know
> > > > > > > >> there isn't an equivalent set of shaded client jars. I am
> also
> > > not
> > > > > > > certain
> > > > > > > >> how far back the Hadoop job classpath isolation stuff goes
> > > > > > > >> (mapreduce.job.classloader = true) which I think is required
> > to
> > > be
> > > > > set
> > > > > > > on
> > > > > > > >> Druid tasks for this shaded stuff to work alongside updated
> > > Druid
> > > > > > > >> dependencies.
> > > > > > > >>
> > > > > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x
> > > support
> > > > > > after
> > > > > > > >> the
> > > > > > > >> Druid 0.22 release?
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> > > For additional commands, e-mail: dev-help@druid.apache.org
> > >
> > >
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Posted by Samarth Jain <sa...@gmail.com>.

I am sure there are other companies out there who are still on Hadoop 2.x
with migration to Hadoop 3.x being a no-go.
If Druid was to drop support for Hadoop 3.x completely, I am afraid it
would prevent users from updating to newer versions of Druid which would be
a shame.

FWIW, we have found in practice for high volume use cases that compaction
based on Druid's Hadoop based batch ingestion is a lot more scale-able than
the native compaction.

Having said that, as an alternative, if we can merge Julian's Spark based
ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid, that
might provide an alternate way for users to get rid of the Hadoop
dependency.

On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <ab...@imply.io>
wrote:

> Reviving this conversation again.
> @Will - Do you still have concerns about HDFS stability? Hadoop 3 has been
> around for some time now and is very stable as far as I know.
>
> The dependencies coming from Hadoop 2 are also old enough that they cause
> dependency scans to fail. E.g. Log4j 1.x dependencies that are coming from
> Hadoop 2, get flagged during these scans. We have also seen issues when
> customers try to use Hadoop ingestion with the latest log4j2 library.
>
> Exception in thread "main" java.lang.NoSuchMethodError:
>
> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
> at
>
> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
> at
>
> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
> at
>
> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
>
>
> Instead of fixing these point issues, we would be better served by
> completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent
> releases and dependencies are well isolated.
>
> On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <ka...@gmail.com>
> wrote:
>
> > Hello
> > We can also use maven profiles. We keep hadoop2 support by default and
> add
> > a new maven profile with hadoop3. This will allow the user to choose the
> > profile which is best suited for the use case.
> > Agreed, it will not help in the Hadoop dependency problems but does
> enable
> > our users to use druid with multiple flavors.
> > Also with hadoop3, as clint mentioned, the dependencies come pre-shaded
> so
> > we significantly reduce our effort in solving the dependency problems.
> > I have the PR in the last phases where I am able to run the entire test
> > suit unit + integration tests on both the default ie hadoop2 and the new
> > hadoop3 profile.
> >
> >
> >
> > On 2021/06/09 11:55:31, Will Lauer <wl...@verizonmedia.com.INVALID>
> > wrote:
> > > Clint,
> > >
> > > I fully understand what type of headache dealing with these dependency
> > > issues is. We deal with this all the time, and based on conversations
> > I've
> > > had with our internal hadoop development team, they are quite aware of
> > them
> > > and just as frustrated by them as you are. I'm certainly in favor of
> > doing
> > > something to improve this situation, as long as it doesn't abandon a
> > large
> > > section of the user base, which I think DROPPING hadoop2 would do.
> > >
> > > I think there are solutions there that can help solve the conflicting
> > > dependency problem. Refactoring Hadoop support into an independent
> > > extension is certainly a start. But I think the dependency problem is
> > > bigger than that. There are always going to be conflicts between
> > > dependencies in the core system and in extensions as the system gets
> > > bigger. We have one right now internally that prevents us from enabling
> > SQL
> > > in our instance of Druid due to conflicts between versions of protobuf
> > used
> > > by Calcite vs one of our critical extensions. Long term, I think you
> are
> > > going to need to carefully think through a ClassLoader based strategy
> to
> > > truly separate the impact of various dependencies.
> > >
> > > While I'm not seriously suggesting it for Druid, OSGi WOULD solve this
> > > problem. It's a system that allows you to explicitly declare what each
> > > bundle exposes to the system, and what each bundle consumes from the
> > > system, allowing multiple conflicting dependencies to co-exist without
> > > impacting each other. OSGi is the big hammer approach, but I bet a more
> > > appropriate solution would be a simpler custom-ClassLoader based
> solution
> > > that hid all dependencies in extensions, keeping them from impacting
> the
> > > core, and that only exposed "public" pieces of the core to extensions.
> If
> > > Druid's core could be extended without impacting the various
> extensions,
> > > and the extensions' dependencies could be modified without impacting
> the
> > > core, this would go a long way towards solving the problem that you
> have
> > > described.
> > >
> > > Will
> > >
> > > <http://www.verizonmedia.com>
> > >
> > > Will Lauer
> > >
> > > Senior Principal Architect, Audience & Advertising Reporting
> > > Data Platforms & Systems Engineering
> > >
> > > M 508 561 6427
> > > 1908 S. First St
> > > Champaign, IL 61822
> > >
> > > <http://www.facebook.com/verizonmedia>   <
> > http://twitter.com/verizonmedia>
> > > <https://www.linkedin.com/company/verizon-media/>
> > > <http://www.instagram.com/verizonmedia>
> > >
> > >
> > >
> > > On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org> wrote:
> > >
> > > > @itai, I think pending the outcome of this discussion that it makes
> > sense
> > > > to have a wider community thread to announce any decisions we make
> > here,
> > > > thanks for bringing that up.
> > > >
> > > > @rajiv, Minio support seems unrelated to this discussion. It seems
> > like a
> > > > reasonable request, but I recommend starting another thread to see if
> > > > someone is interested in taking up this effort.
> > > >
> > > > @jihoon I definitely agree that Hadoop should be refactored to be an
> > > > extension longer term. I don't think this upgrade would necessarily
> > > > make doing such a refactor any easier, but not harder either. Just
> > moving
> > > > Hadoop to an extension also unfortunately doesn't really do anything
> to
> > > > help our dependency problem though, which is the thing that has
> > agitated me
> > > > enough to start this thread and start looking into solutions.
> > > >
> > > > @will/@frank I feel like the stranglehold Hadoop has on our
> > dependencies
> > > > has started to become especially more painful in the last couple of
> > > > years. Most painful to me is that we are stuck using a version of
> > Apache
> > > > Calcite from 2019 (six versions behind the latest), because newer
> > versions
> > > > require a newer version of Guava. This means we cannot get any bug
> > fixes
> > > > and improvements in our SQL parsing layer without doing something
> like
> > > > packaging a shaded version of it ourselves or solving our Hadoop
> > dependency
> > > > problem.
> > > >
> > > > Many other dependencies have also proved problematic with Hadoop as
> > well in
> > > > the past, and since we aren't able to run the Hadoop integration
> tests
> > in
> > > > Travis, there is always the chance that sometimes we don't catch
> these
> > when
> > > > they go in. I imagine now that we have turned on dependabot this
> week,
> > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > > > , that we are going to have to
> > > > proceed very carefully with it until we are able to resolve this
> > dependency
> > > > issue.
> > > >
> > > > Hadoop 3.3.0 is also the first to support running on a Java version
> > that is
> > > > newer than Java 8 per
> > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > > > ,
> > > > which is another area we have been working towards - Druid to
> > officially
> > > > support Java 11+ environments.
> > > >
> > > > I'm sort of at a loss of what else to do besides one of
> > > > - switching to these Hadoop 3 shaded jars and dropping 2.x support
> > > > - figuring out how to custom package our own Hadoop 2.x
> dependendencies
> > > > that are shaded similarly to the Hadoop 3 client jars, and only
> > supporting
> > > > Hadoop with application classpath isolation
> (mapreduce.job.classloader
> > =
> > > > true)
> > > > - just dropping support for Hadoop completely
> > > >
> > > > I would much rather devote all effort into making Druids native batch
> > > > ingestion better to encourage people to migrate to that, than
> > continuing to
> > > > fight with figuring out how to keep supporting Hadoop, so upgrading
> and
> > > > switching to the shaded client jars at least seemed like a reasonable
> > > > compromise to dropping it completely. Maybe making custom shaded
> Hadoop
> > > > dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard
> > as I
> > > > am imagining, but it does seem like the most amount of work between
> the
> > > > solutions I could think of to potentially resolve this problem.
> > > >
> > > > Does anyone have any other ideas of how we can isolate our
> dependencies
> > > > from Hadoop? Solutions like shading Guava,
> > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > > > , would let Druid itself use
> > > > newer Guava, but that doesn't help conflicts within our dependencies
> > which
> > > > has always seemed to be the larger problem to me. Moving Hadoop
> > support to
> > > > an extension doesn't help anything unless we can ensure that we can
> run
> > > > Druid ingestion tasks on Hadoop without having to match all of the
> > Hadoop
> > > > clusters dependencies with some sort of classloader wizardry.
> > > >
> > > > Maybe we could consider keeping a 0.22.x release line in Druid that
> > gets
> > > > security and minor bug fixes for some period of time to give people a
> > > > longer period to migrate off of Hadoop 2.x? I can't speak for the
> rest
> > of
> > > > the committers, but I would personally be more open to maintaining
> > such a
> > > > branch if it meant that moving forward at least we could update all
> of
> > our
> > > > dependencies to newer versions, while providing a transition path to
> > still
> > > > have at least some support until migrating to Hadoop 3 or native
> Druid
> > > > batch ingestion.
> > > >
> > > > Any other ideas?
> > > >
> > > >
> > > >
> > > > On Tue, Jun 8, 2021 at 7:44 PM frank chen <fr...@apache.org>
> > wrote:
> > > >
> > > > > Considering Druid takes advantage of lots of external components to
> > > > work, I
> > > > > think we should upgrade Druid in a little bit conservitive way.
> > Dropping
> > > > > support of hadoop2 is not a good idea.
> > > > > The upgrading of the ZooKeeper client in Druid also prevents me
> from
> > > > > adopting 0.22 for a longer time.
> > > > >
> > > > > Although users could upgrade these dependencies first to use the
> > latest
> > > > > Druid releases, frankly speaking, these upgrades are not so easy in
> > > > > production and usually take longer time, which would prevent users
> > from
> > > > > experiencing new features of Druid.
> > > > > For hadoop3, I have heard of some performance issues, which also
> > makes me
> > > > > have no confidence to upgrade.
> > > > >
> > > > > I think what Jihoon proposes is a good idea, separating hadoop2
> from
> > > > Druid
> > > > > core as an extension.
> > > > > Since hadoop2 has not been EOF, to achieve balance between
> > compatibility
> > > > > and long term evolution, maybe we could provide two extensions, one
> > for
> > > > > hadoop2, one for hadoop3.
> > > > >
> > > > >
> > > > >
> > > > > Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三
> 上午4:13写道：
> > > > >
> > > > > > Just to follow up on this, our main problem with hadoop3 right
> now
> > has
> > > > > been
> > > > > > instability in HDFS, to the extent that we put on hold any plans
> to
> > > > > deploy
> > > > > > it to our production systems. I would claim Hadoop3 isn't mature
> > enough
> > > > > yet
> > > > > > to consider migrating Druid to it.
> > > > > >
> > > > > > WIll
> > > > > >
> > > > > > <http://www.verizonmedia.com>
> > > > > >
> > > > > > Will Lauer
> > > > > >
> > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > Data Platforms & Systems Engineering
> > > > > >
> > > > > > M 508 561 6427
> > > > > > 1908 S. First St
> > > > > > Champaign, IL 61822
> > > > > >
> > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > >   <
> > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > >
> > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > >
> > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
> wlauer@verizonmedia.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Unfortunately, the migration off of hadoop3 is a hard one
> (maybe
> > not
> > > > > for
> > > > > > > Druid, but certainly for big organizations running large
> hadoop2
> > > > > > > workloads). If druid migrated to hadoop3 after 0.22, that would
> > > > > probably
> > > > > > > prevent me from taking any new versions of Druid for at least
> the
> > > > > > remainder
> > > > > > > of the year and possibly longer.
> > > > > > >
> > > > > > > Will
> > > > > > >
> > > > > > >
> > > > > > > <http://www.verizonmedia.com>
> > > > > > >
> > > > > > > Will Lauer
> > > > > > >
> > > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > > Data Platforms & Systems Engineering
> > > > > > >
> > > > > > > M 508 561 6427
> > > > > > > 1908 S. First St
> > > > > > > Champaign, IL 61822
> > > > > > >
> > > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > >   <
> > > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > >
> > > > > > >    <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > >
> > > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cw...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> I've been assisting with some experiments to see how we might
> > want
> > > > to
> > > > > > >> migrate Druid to support Hadoop 3.x, and more importantly, see
> > if
> > > > > maybe
> > > > > > we
> > > > > > >> can finally be free of some of the dependency issues it has
> been
> > > > > causing
> > > > > > >> for as long as I can remember working with Druid.
> > > > > > >>
> > > > > > >> Hadoop 3 introduced shaded client jars,
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > > > >> , with the purpose to
> > > > > > >> allow applications to talk to the Hadoop cluster without
> > drowning in
> > > > > its
> > > > > > >> transitive dependencies. The experimental branch that I have
> > been
> > > > > > helping
> > > > > > >> with, which is using these new shaded client jars, can be seen
> > in
> > > > this
> > > > > > PR
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > > > >> , and is currently working with
> > > > > > >> the HDFS integration tests as well as the Hadoop tutorial flow
> > in
> > > > the
> > > > > > >> Druid
> > > > > > >> docs (which is pretty much equivalent to the HDFS integration
> > test).
> > > > > > >>
> > > > > > >> The cloud deep storages still need some further testing and
> some
> > > > minor
> > > > > > >> cleanup still needs done for the docs and such. Additionally
> we
> > > > still
> > > > > > need
> > > > > > >> to figure out how to handle the Kerberos extension, because it
> > > > extends
> > > > > > >> some
> > > > > > >> Hadoop classes so isn't able to use the shaded client jars in
> a
> > > > > > >> straight-forward manner, and so still has heavy dependencies
> and
> > > > > hasn't
> > > > > > >> been tested. However, the experiment has started to pan out
> > enough
> > > > to
> > > > > > >> where
> > > > > > >> I think it is worth starting this discussion, because it does
> > have
> > > > > some
> > > > > > >> implications.
> > > > > > >>
> > > > > > >> Making this change I think will allow us to update our
> > dependencies
> > > > > > with a
> > > > > > >> lot more freedom (I'm looking at you, Guava), but the catch is
> > that
> > > > > once
> > > > > > >> we
> > > > > > >> make this change and start updating these dependencies, it
> will
> > > > become
> > > > > > >> hard, nearing impossible to support Hadoop 2.x, since as far
> as
> > I
> > > > know
> > > > > > >> there isn't an equivalent set of shaded client jars. I am also
> > not
> > > > > > certain
> > > > > > >> how far back the Hadoop job classpath isolation stuff goes
> > > > > > >> (mapreduce.job.classloader = true) which I think is required
> to
> > be
> > > > set
> > > > > > on
> > > > > > >> Druid tasks for this shaded stuff to work alongside updated
> > Druid
> > > > > > >> dependencies.
> > > > > > >>
> > > > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x
> > support
> > > > > after
> > > > > > >> the
> > > > > > >> Druid 0.22 release?
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> > For additional commands, e-mail: dev-help@druid.apache.org
> >
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Posted by Abhishek Agarwal <ab...@imply.io>.

Reviving this conversation again.
@Will - Do you still have concerns about HDFS stability? Hadoop 3 has been
around for some time now and is very stable as far as I know.

The dependencies coming from Hadoop 2 are also old enough that they cause
dependency scans to fail. E.g. Log4j 1.x dependencies that are coming from
Hadoop 2, get flagged during these scans. We have also seen issues when
customers try to use Hadoop ingestion with the latest log4j2 library.

Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
at
org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
at
org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
at
org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)


Instead of fixing these point issues, we would be better served by
completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent
releases and dependencies are well isolated.

On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <ka...@gmail.com>
wrote:

> Hello
> We can also use maven profiles. We keep hadoop2 support by default and add
> a new maven profile with hadoop3. This will allow the user to choose the
> profile which is best suited for the use case.
> Agreed, it will not help in the Hadoop dependency problems but does enable
> our users to use druid with multiple flavors.
> Also with hadoop3, as clint mentioned, the dependencies come pre-shaded so
> we significantly reduce our effort in solving the dependency problems.
> I have the PR in the last phases where I am able to run the entire test
> suit unit + integration tests on both the default ie hadoop2 and the new
> hadoop3 profile.
>
>
>
> On 2021/06/09 11:55:31, Will Lauer <wl...@verizonmedia.com.INVALID>
> wrote:
> > Clint,
> >
> > I fully understand what type of headache dealing with these dependency
> > issues is. We deal with this all the time, and based on conversations
> I've
> > had with our internal hadoop development team, they are quite aware of
> them
> > and just as frustrated by them as you are. I'm certainly in favor of
> doing
> > something to improve this situation, as long as it doesn't abandon a
> large
> > section of the user base, which I think DROPPING hadoop2 would do.
> >
> > I think there are solutions there that can help solve the conflicting
> > dependency problem. Refactoring Hadoop support into an independent
> > extension is certainly a start. But I think the dependency problem is
> > bigger than that. There are always going to be conflicts between
> > dependencies in the core system and in extensions as the system gets
> > bigger. We have one right now internally that prevents us from enabling
> SQL
> > in our instance of Druid due to conflicts between versions of protobuf
> used
> > by Calcite vs one of our critical extensions. Long term, I think you are
> > going to need to carefully think through a ClassLoader based strategy to
> > truly separate the impact of various dependencies.
> >
> > While I'm not seriously suggesting it for Druid, OSGi WOULD solve this
> > problem. It's a system that allows you to explicitly declare what each
> > bundle exposes to the system, and what each bundle consumes from the
> > system, allowing multiple conflicting dependencies to co-exist without
> > impacting each other. OSGi is the big hammer approach, but I bet a more
> > appropriate solution would be a simpler custom-ClassLoader based solution
> > that hid all dependencies in extensions, keeping them from impacting the
> > core, and that only exposed "public" pieces of the core to extensions. If
> > Druid's core could be extended without impacting the various extensions,
> > and the extensions' dependencies could be modified without impacting the
> > core, this would go a long way towards solving the problem that you have
> > described.
> >
> > Will
> >
> > <http://www.verizonmedia.com>
> >
> > Will Lauer
> >
> > Senior Principal Architect, Audience & Advertising Reporting
> > Data Platforms & Systems Engineering
> >
> > M 508 561 6427
> > 1908 S. First St
> > Champaign, IL 61822
> >
> > <http://www.facebook.com/verizonmedia>   <
> http://twitter.com/verizonmedia>
> > <https://www.linkedin.com/company/verizon-media/>
> > <http://www.instagram.com/verizonmedia>
> >
> >
> >
> > On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org> wrote:
> >
> > > @itai, I think pending the outcome of this discussion that it makes
> sense
> > > to have a wider community thread to announce any decisions we make
> here,
> > > thanks for bringing that up.
> > >
> > > @rajiv, Minio support seems unrelated to this discussion. It seems
> like a
> > > reasonable request, but I recommend starting another thread to see if
> > > someone is interested in taking up this effort.
> > >
> > > @jihoon I definitely agree that Hadoop should be refactored to be an
> > > extension longer term. I don't think this upgrade would necessarily
> > > make doing such a refactor any easier, but not harder either. Just
> moving
> > > Hadoop to an extension also unfortunately doesn't really do anything to
> > > help our dependency problem though, which is the thing that has
> agitated me
> > > enough to start this thread and start looking into solutions.
> > >
> > > @will/@frank I feel like the stranglehold Hadoop has on our
> dependencies
> > > has started to become especially more painful in the last couple of
> > > years. Most painful to me is that we are stuck using a version of
> Apache
> > > Calcite from 2019 (six versions behind the latest), because newer
> versions
> > > require a newer version of Guava. This means we cannot get any bug
> fixes
> > > and improvements in our SQL parsing layer without doing something like
> > > packaging a shaded version of it ourselves or solving our Hadoop
> dependency
> > > problem.
> > >
> > > Many other dependencies have also proved problematic with Hadoop as
> well in
> > > the past, and since we aren't able to run the Hadoop integration tests
> in
> > > Travis, there is always the chance that sometimes we don't catch these
> when
> > > they go in. I imagine now that we have turned on dependabot this week,
> > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > > , that we are going to have to
> > > proceed very carefully with it until we are able to resolve this
> dependency
> > > issue.
> > >
> > > Hadoop 3.3.0 is also the first to support running on a Java version
> that is
> > > newer than Java 8 per
> > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > > ,
> > > which is another area we have been working towards - Druid to
> officially
> > > support Java 11+ environments.
> > >
> > > I'm sort of at a loss of what else to do besides one of
> > > - switching to these Hadoop 3 shaded jars and dropping 2.x support
> > > - figuring out how to custom package our own Hadoop 2.x dependendencies
> > > that are shaded similarly to the Hadoop 3 client jars, and only
> supporting
> > > Hadoop with application classpath isolation (mapreduce.job.classloader
> =
> > > true)
> > > - just dropping support for Hadoop completely
> > >
> > > I would much rather devote all effort into making Druids native batch
> > > ingestion better to encourage people to migrate to that, than
> continuing to
> > > fight with figuring out how to keep supporting Hadoop, so upgrading and
> > > switching to the shaded client jars at least seemed like a reasonable
> > > compromise to dropping it completely. Maybe making custom shaded Hadoop
> > > dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard
> as I
> > > am imagining, but it does seem like the most amount of work between the
> > > solutions I could think of to potentially resolve this problem.
> > >
> > > Does anyone have any other ideas of how we can isolate our dependencies
> > > from Hadoop? Solutions like shading Guava,
> > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > > , would let Druid itself use
> > > newer Guava, but that doesn't help conflicts within our dependencies
> which
> > > has always seemed to be the larger problem to me. Moving Hadoop
> support to
> > > an extension doesn't help anything unless we can ensure that we can run
> > > Druid ingestion tasks on Hadoop without having to match all of the
> Hadoop
> > > clusters dependencies with some sort of classloader wizardry.
> > >
> > > Maybe we could consider keeping a 0.22.x release line in Druid that
> gets
> > > security and minor bug fixes for some period of time to give people a
> > > longer period to migrate off of Hadoop 2.x? I can't speak for the rest
> of
> > > the committers, but I would personally be more open to maintaining
> such a
> > > branch if it meant that moving forward at least we could update all of
> our
> > > dependencies to newer versions, while providing a transition path to
> still
> > > have at least some support until migrating to Hadoop 3 or native Druid
> > > batch ingestion.
> > >
> > > Any other ideas?
> > >
> > >
> > >
> > > On Tue, Jun 8, 2021 at 7:44 PM frank chen <fr...@apache.org>
> wrote:
> > >
> > > > Considering Druid takes advantage of lots of external components to
> > > work, I
> > > > think we should upgrade Druid in a little bit conservitive way.
> Dropping
> > > > support of hadoop2 is not a good idea.
> > > > The upgrading of the ZooKeeper client in Druid also prevents me from
> > > > adopting 0.22 for a longer time.
> > > >
> > > > Although users could upgrade these dependencies first to use the
> latest
> > > > Druid releases, frankly speaking, these upgrades are not so easy in
> > > > production and usually take longer time, which would prevent users
> from
> > > > experiencing new features of Druid.
> > > > For hadoop3, I have heard of some performance issues, which also
> makes me
> > > > have no confidence to upgrade.
> > > >
> > > > I think what Jihoon proposes is a good idea, separating hadoop2 from
> > > Druid
> > > > core as an extension.
> > > > Since hadoop2 has not been EOF, to achieve balance between
> compatibility
> > > > and long term evolution, maybe we could provide two extensions, one
> for
> > > > hadoop2, one for hadoop3.
> > > >
> > > >
> > > >
> > > > Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三 上午4:13写道：
> > > >
> > > > > Just to follow up on this, our main problem with hadoop3 right now
> has
> > > > been
> > > > > instability in HDFS, to the extent that we put on hold any plans to
> > > > deploy
> > > > > it to our production systems. I would claim Hadoop3 isn't mature
> enough
> > > > yet
> > > > > to consider migrating Druid to it.
> > > > >
> > > > > WIll
> > > > >
> > > > > <http://www.verizonmedia.com>
> > > > >
> > > > > Will Lauer
> > > > >
> > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > Data Platforms & Systems Engineering
> > > > >
> > > > > M 508 561 6427
> > > > > 1908 S. First St
> > > > > Champaign, IL 61822
> > > > >
> > > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > >   <
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > >
> > > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > >
> > > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <wlauer@verizonmedia.com
> >
> > > > wrote:
> > > > >
> > > > > > Unfortunately, the migration off of hadoop3 is a hard one (maybe
> not
> > > > for
> > > > > > Druid, but certainly for big organizations running large hadoop2
> > > > > > workloads). If druid migrated to hadoop3 after 0.22, that would
> > > > probably
> > > > > > prevent me from taking any new versions of Druid for at least the
> > > > > remainder
> > > > > > of the year and possibly longer.
> > > > > >
> > > > > > Will
> > > > > >
> > > > > >
> > > > > > <http://www.verizonmedia.com>
> > > > > >
> > > > > > Will Lauer
> > > > > >
> > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > Data Platforms & Systems Engineering
> > > > > >
> > > > > > M 508 561 6427
> > > > > > 1908 S. First St
> > > > > > Champaign, IL 61822
> > > > > >
> > > > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > >   <
> > > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > >
> > > > > >    <
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > >
> > > > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cw...@apache.org>
> > > wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> I've been assisting with some experiments to see how we might
> want
> > > to
> > > > > >> migrate Druid to support Hadoop 3.x, and more importantly, see
> if
> > > > maybe
> > > > > we
> > > > > >> can finally be free of some of the dependency issues it has been
> > > > causing
> > > > > >> for as long as I can remember working with Druid.
> > > > > >>
> > > > > >> Hadoop 3 introduced shaded client jars,
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > > >> , with the purpose to
> > > > > >> allow applications to talk to the Hadoop cluster without
> drowning in
> > > > its
> > > > > >> transitive dependencies. The experimental branch that I have
> been
> > > > > helping
> > > > > >> with, which is using these new shaded client jars, can be seen
> in
> > > this
> > > > > PR
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > > >> , and is currently working with
> > > > > >> the HDFS integration tests as well as the Hadoop tutorial flow
> in
> > > the
> > > > > >> Druid
> > > > > >> docs (which is pretty much equivalent to the HDFS integration
> test).
> > > > > >>
> > > > > >> The cloud deep storages still need some further testing and some
> > > minor
> > > > > >> cleanup still needs done for the docs and such. Additionally we
> > > still
> > > > > need
> > > > > >> to figure out how to handle the Kerberos extension, because it
> > > extends
> > > > > >> some
> > > > > >> Hadoop classes so isn't able to use the shaded client jars in a
> > > > > >> straight-forward manner, and so still has heavy dependencies and
> > > > hasn't
> > > > > >> been tested. However, the experiment has started to pan out
> enough
> > > to
> > > > > >> where
> > > > > >> I think it is worth starting this discussion, because it does
> have
> > > > some
> > > > > >> implications.
> > > > > >>
> > > > > >> Making this change I think will allow us to update our
> dependencies
> > > > > with a
> > > > > >> lot more freedom (I'm looking at you, Guava), but the catch is
> that
> > > > once
> > > > > >> we
> > > > > >> make this change and start updating these dependencies, it will
> > > become
> > > > > >> hard, nearing impossible to support Hadoop 2.x, since as far as
> I
> > > know
> > > > > >> there isn't an equivalent set of shaded client jars. I am also
> not
> > > > > certain
> > > > > >> how far back the Hadoop job classpath isolation stuff goes
> > > > > >> (mapreduce.job.classloader = true) which I think is required to
> be
> > > set
> > > > > on
> > > > > >> Druid tasks for this shaded stuff to work alongside updated
> Druid
> > > > > >> dependencies.
> > > > > >>
> > > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x
> support
> > > > after
> > > > > >> the
> > > > > >> Druid 0.22 release?
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> For additional commands, e-mail: dev-help@druid.apache.org
>
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

Posted by Karan Kumar <ka...@gmail.com>.

Hello 
We can also use maven profiles. We keep hadoop2 support by default and add a new maven profile with hadoop3. This will allow the user to choose the profile which is best suited for the use case. 
Agreed, it will not help in the Hadoop dependency problems but does enable our users to use druid with multiple flavors. 
Also with hadoop3, as clint mentioned, the dependencies come pre-shaded so we significantly reduce our effort in solving the dependency problems. 
I have the PR in the last phases where I am able to run the entire test suit unit + integration tests on both the default ie hadoop2 and the new hadoop3 profile. 



On 2021/06/09 11:55:31, Will Lauer <wl...@verizonmedia.com.INVALID> wrote: 
> Clint,
> 
> I fully understand what type of headache dealing with these dependency
> issues is. We deal with this all the time, and based on conversations I've
> had with our internal hadoop development team, they are quite aware of them
> and just as frustrated by them as you are. I'm certainly in favor of doing
> something to improve this situation, as long as it doesn't abandon a large
> section of the user base, which I think DROPPING hadoop2 would do.
> 
> I think there are solutions there that can help solve the conflicting
> dependency problem. Refactoring Hadoop support into an independent
> extension is certainly a start. But I think the dependency problem is
> bigger than that. There are always going to be conflicts between
> dependencies in the core system and in extensions as the system gets
> bigger. We have one right now internally that prevents us from enabling SQL
> in our instance of Druid due to conflicts between versions of protobuf used
> by Calcite vs one of our critical extensions. Long term, I think you are
> going to need to carefully think through a ClassLoader based strategy to
> truly separate the impact of various dependencies.
> 
> While I'm not seriously suggesting it for Druid, OSGi WOULD solve this
> problem. It's a system that allows you to explicitly declare what each
> bundle exposes to the system, and what each bundle consumes from the
> system, allowing multiple conflicting dependencies to co-exist without
> impacting each other. OSGi is the big hammer approach, but I bet a more
> appropriate solution would be a simpler custom-ClassLoader based solution
> that hid all dependencies in extensions, keeping them from impacting the
> core, and that only exposed "public" pieces of the core to extensions. If
> Druid's core could be extended without impacting the various extensions,
> and the extensions' dependencies could be modified without impacting the
> core, this would go a long way towards solving the problem that you have
> described.
> 
> Will
> 
> <http://www.verizonmedia.com>
> 
> Will Lauer
> 
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
> 
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
> 
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
> <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
> 
> 
> 
> On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org> wrote:
> 
> > @itai, I think pending the outcome of this discussion that it makes sense
> > to have a wider community thread to announce any decisions we make here,
> > thanks for bringing that up.
> >
> > @rajiv, Minio support seems unrelated to this discussion. It seems like a
> > reasonable request, but I recommend starting another thread to see if
> > someone is interested in taking up this effort.
> >
> > @jihoon I definitely agree that Hadoop should be refactored to be an
> > extension longer term. I don't think this upgrade would necessarily
> > make doing such a refactor any easier, but not harder either. Just moving
> > Hadoop to an extension also unfortunately doesn't really do anything to
> > help our dependency problem though, which is the thing that has agitated me
> > enough to start this thread and start looking into solutions.
> >
> > @will/@frank I feel like the stranglehold Hadoop has on our dependencies
> > has started to become especially more painful in the last couple of
> > years. Most painful to me is that we are stuck using a version of Apache
> > Calcite from 2019 (six versions behind the latest), because newer versions
> > require a newer version of Guava. This means we cannot get any bug fixes
> > and improvements in our SQL parsing layer without doing something like
> > packaging a shaded version of it ourselves or solving our Hadoop dependency
> > problem.
> >
> > Many other dependencies have also proved problematic with Hadoop as well in
> > the past, and since we aren't able to run the Hadoop integration tests in
> > Travis, there is always the chance that sometimes we don't catch these when
> > they go in. I imagine now that we have turned on dependabot this week,
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > , that we are going to have to
> > proceed very carefully with it until we are able to resolve this dependency
> > issue.
> >
> > Hadoop 3.3.0 is also the first to support running on a Java version that is
> > newer than Java 8 per
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > ,
> > which is another area we have been working towards - Druid to officially
> > support Java 11+ environments.
> >
> > I'm sort of at a loss of what else to do besides one of
> > - switching to these Hadoop 3 shaded jars and dropping 2.x support
> > - figuring out how to custom package our own Hadoop 2.x dependendencies
> > that are shaded similarly to the Hadoop 3 client jars, and only supporting
> > Hadoop with application classpath isolation (mapreduce.job.classloader =
> > true)
> > - just dropping support for Hadoop completely
> >
> > I would much rather devote all effort into making Druids native batch
> > ingestion better to encourage people to migrate to that, than continuing to
> > fight with figuring out how to keep supporting Hadoop, so upgrading and
> > switching to the shaded client jars at least seemed like a reasonable
> > compromise to dropping it completely. Maybe making custom shaded Hadoop
> > dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard as I
> > am imagining, but it does seem like the most amount of work between the
> > solutions I could think of to potentially resolve this problem.
> >
> > Does anyone have any other ideas of how we can isolate our dependencies
> > from Hadoop? Solutions like shading Guava,
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > , would let Druid itself use
> > newer Guava, but that doesn't help conflicts within our dependencies which
> > has always seemed to be the larger problem to me. Moving Hadoop support to
> > an extension doesn't help anything unless we can ensure that we can run
> > Druid ingestion tasks on Hadoop without having to match all of the Hadoop
> > clusters dependencies with some sort of classloader wizardry.
> >
> > Maybe we could consider keeping a 0.22.x release line in Druid that gets
> > security and minor bug fixes for some period of time to give people a
> > longer period to migrate off of Hadoop 2.x? I can't speak for the rest of
> > the committers, but I would personally be more open to maintaining such a
> > branch if it meant that moving forward at least we could update all of our
> > dependencies to newer versions, while providing a transition path to still
> > have at least some support until migrating to Hadoop 3 or native Druid
> > batch ingestion.
> >
> > Any other ideas?
> >
> >
> >
> > On Tue, Jun 8, 2021 at 7:44 PM frank chen <fr...@apache.org> wrote:
> >
> > > Considering Druid takes advantage of lots of external components to
> > work, I
> > > think we should upgrade Druid in a little bit conservitive way. Dropping
> > > support of hadoop2 is not a good idea.
> > > The upgrading of the ZooKeeper client in Druid also prevents me from
> > > adopting 0.22 for a longer time.
> > >
> > > Although users could upgrade these dependencies first to use the latest
> > > Druid releases, frankly speaking, these upgrades are not so easy in
> > > production and usually take longer time, which would prevent users from
> > > experiencing new features of Druid.
> > > For hadoop3, I have heard of some performance issues, which also makes me
> > > have no confidence to upgrade.
> > >
> > > I think what Jihoon proposes is a good idea, separating hadoop2 from
> > Druid
> > > core as an extension.
> > > Since hadoop2 has not been EOF, to achieve balance between compatibility
> > > and long term evolution, maybe we could provide two extensions, one for
> > > hadoop2, one for hadoop3.
> > >
> > >
> > >
> > > Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三 上午4:13写道：
> > >
> > > > Just to follow up on this, our main problem with hadoop3 right now has
> > > been
> > > > instability in HDFS, to the extent that we put on hold any plans to
> > > deploy
> > > > it to our production systems. I would claim Hadoop3 isn't mature enough
> > > yet
> > > > to consider migrating Druid to it.
> > > >
> > > > WIll
> > > >
> > > > <http://www.verizonmedia.com>
> > > >
> > > > Will Lauer
> > > >
> > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > Data Platforms & Systems Engineering
> > > >
> > > > M 508 561 6427
> > > > 1908 S. First St
> > > > Champaign, IL 61822
> > > >
> > > > <
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > >   <
> > >
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > >
> > > > <
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > >
> > > > <
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > >
> > > >
> > > >
> > > >
> > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <wl...@verizonmedia.com>
> > > wrote:
> > > >
> > > > > Unfortunately, the migration off of hadoop3 is a hard one (maybe not
> > > for
> > > > > Druid, but certainly for big organizations running large hadoop2
> > > > > workloads). If druid migrated to hadoop3 after 0.22, that would
> > > probably
> > > > > prevent me from taking any new versions of Druid for at least the
> > > > remainder
> > > > > of the year and possibly longer.
> > > > >
> > > > > Will
> > > > >
> > > > >
> > > > > <http://www.verizonmedia.com>
> > > > >
> > > > > Will Lauer
> > > > >
> > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > Data Platforms & Systems Engineering
> > > > >
> > > > > M 508 561 6427
> > > > > 1908 S. First St
> > > > > Champaign, IL 61822
> > > > >
> > > > > <
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > >   <
> > > >
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > >
> > > > >    <
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > >
> > > > > <
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cw...@apache.org>
> > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> I've been assisting with some experiments to see how we might want
> > to
> > > > >> migrate Druid to support Hadoop 3.x, and more importantly, see if
> > > maybe
> > > > we
> > > > >> can finally be free of some of the dependency issues it has been
> > > causing
> > > > >> for as long as I can remember working with Druid.
> > > > >>
> > > > >> Hadoop 3 introduced shaded client jars,
> > > > >>
> > > > >>
> > > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > >> , with the purpose to
> > > > >> allow applications to talk to the Hadoop cluster without drowning in
> > > its
> > > > >> transitive dependencies. The experimental branch that I have been
> > > > helping
> > > > >> with, which is using these new shaded client jars, can be seen in
> > this
> > > > PR
> > > > >>
> > > > >>
> > > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > >> , and is currently working with
> > > > >> the HDFS integration tests as well as the Hadoop tutorial flow in
> > the
> > > > >> Druid
> > > > >> docs (which is pretty much equivalent to the HDFS integration test).
> > > > >>
> > > > >> The cloud deep storages still need some further testing and some
> > minor
> > > > >> cleanup still needs done for the docs and such. Additionally we
> > still
> > > > need
> > > > >> to figure out how to handle the Kerberos extension, because it
> > extends
> > > > >> some
> > > > >> Hadoop classes so isn't able to use the shaded client jars in a
> > > > >> straight-forward manner, and so still has heavy dependencies and
> > > hasn't
> > > > >> been tested. However, the experiment has started to pan out enough
> > to
> > > > >> where
> > > > >> I think it is worth starting this discussion, because it does have
> > > some
> > > > >> implications.
> > > > >>
> > > > >> Making this change I think will allow us to update our dependencies
> > > > with a
> > > > >> lot more freedom (I'm looking at you, Guava), but the catch is that
> > > once
> > > > >> we
> > > > >> make this change and start updating these dependencies, it will
> > become
> > > > >> hard, nearing impossible to support Hadoop 2.x, since as far as I
> > know
> > > > >> there isn't an equivalent set of shaded client jars. I am also not
> > > > certain
> > > > >> how far back the Hadoop job classpath isolation stuff goes
> > > > >> (mapreduce.job.classloader = true) which I think is required to be
> > set
> > > > on
> > > > >> Druid tasks for this shaded stuff to work alongside updated Druid
> > > > >> dependencies.
> > > > >>
> > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x support
> > > after
> > > > >> the
> > > > >> Druid 0.22 release?
> > > > >>
> > > > >
> > > >
> > >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
For additional commands, e-mail: dev-help@druid.apache.org

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

Posted by Will Lauer <wl...@verizonmedia.com.INVALID>.

Clint,

I fully understand what type of headache dealing with these dependency
issues is. We deal with this all the time, and based on conversations I've
had with our internal hadoop development team, they are quite aware of them
and just as frustrated by them as you are. I'm certainly in favor of doing
something to improve this situation, as long as it doesn't abandon a large
section of the user base, which I think DROPPING hadoop2 would do.

I think there are solutions there that can help solve the conflicting
dependency problem. Refactoring Hadoop support into an independent
extension is certainly a start. But I think the dependency problem is
bigger than that. There are always going to be conflicts between
dependencies in the core system and in extensions as the system gets
bigger. We have one right now internally that prevents us from enabling SQL
in our instance of Druid due to conflicts between versions of protobuf used
by Calcite vs one of our critical extensions. Long term, I think you are
going to need to carefully think through a ClassLoader based strategy to
truly separate the impact of various dependencies.

While I'm not seriously suggesting it for Druid, OSGi WOULD solve this
problem. It's a system that allows you to explicitly declare what each
bundle exposes to the system, and what each bundle consumes from the
system, allowing multiple conflicting dependencies to co-exist without
impacting each other. OSGi is the big hammer approach, but I bet a more
appropriate solution would be a simpler custom-ClassLoader based solution
that hid all dependencies in extensions, keeping them from impacting the
core, and that only exposed "public" pieces of the core to extensions. If
Druid's core could be extended without impacting the various extensions,
and the extensions' dependencies could be modified without impacting the
core, this would go a long way towards solving the problem that you have
described.

Will

<http://www.verizonmedia.com>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

<http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
<https://www.linkedin.com/company/verizon-media/>
<http://www.instagram.com/verizonmedia>



On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cw...@apache.org> wrote:

> @itai, I think pending the outcome of this discussion that it makes sense
> to have a wider community thread to announce any decisions we make here,
> thanks for bringing that up.
>
> @rajiv, Minio support seems unrelated to this discussion. It seems like a
> reasonable request, but I recommend starting another thread to see if
> someone is interested in taking up this effort.
>
> @jihoon I definitely agree that Hadoop should be refactored to be an
> extension longer term. I don't think this upgrade would necessarily
> make doing such a refactor any easier, but not harder either. Just moving
> Hadoop to an extension also unfortunately doesn't really do anything to
> help our dependency problem though, which is the thing that has agitated me
> enough to start this thread and start looking into solutions.
>
> @will/@frank I feel like the stranglehold Hadoop has on our dependencies
> has started to become especially more painful in the last couple of
> years. Most painful to me is that we are stuck using a version of Apache
> Calcite from 2019 (six versions behind the latest), because newer versions
> require a newer version of Guava. This means we cannot get any bug fixes
> and improvements in our SQL parsing layer without doing something like
> packaging a shaded version of it ourselves or solving our Hadoop dependency
> problem.
>
> Many other dependencies have also proved problematic with Hadoop as well in
> the past, and since we aren't able to run the Hadoop integration tests in
> Travis, there is always the chance that sometimes we don't catch these when
> they go in. I imagine now that we have turned on dependabot this week,
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> , that we are going to have to
> proceed very carefully with it until we are able to resolve this dependency
> issue.
>
> Hadoop 3.3.0 is also the first to support running on a Java version that is
> newer than Java 8 per
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> ,
> which is another area we have been working towards - Druid to officially
> support Java 11+ environments.
>
> I'm sort of at a loss of what else to do besides one of
> - switching to these Hadoop 3 shaded jars and dropping 2.x support
> - figuring out how to custom package our own Hadoop 2.x dependendencies
> that are shaded similarly to the Hadoop 3 client jars, and only supporting
> Hadoop with application classpath isolation (mapreduce.job.classloader =
> true)
> - just dropping support for Hadoop completely
>
> I would much rather devote all effort into making Druids native batch
> ingestion better to encourage people to migrate to that, than continuing to
> fight with figuring out how to keep supporting Hadoop, so upgrading and
> switching to the shaded client jars at least seemed like a reasonable
> compromise to dropping it completely. Maybe making custom shaded Hadoop
> dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard as I
> am imagining, but it does seem like the most amount of work between the
> solutions I could think of to potentially resolve this problem.
>
> Does anyone have any other ideas of how we can isolate our dependencies
> from Hadoop? Solutions like shading Guava,
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> , would let Druid itself use
> newer Guava, but that doesn't help conflicts within our dependencies which
> has always seemed to be the larger problem to me. Moving Hadoop support to
> an extension doesn't help anything unless we can ensure that we can run
> Druid ingestion tasks on Hadoop without having to match all of the Hadoop
> clusters dependencies with some sort of classloader wizardry.
>
> Maybe we could consider keeping a 0.22.x release line in Druid that gets
> security and minor bug fixes for some period of time to give people a
> longer period to migrate off of Hadoop 2.x? I can't speak for the rest of
> the committers, but I would personally be more open to maintaining such a
> branch if it meant that moving forward at least we could update all of our
> dependencies to newer versions, while providing a transition path to still
> have at least some support until migrating to Hadoop 3 or native Druid
> batch ingestion.
>
> Any other ideas?
>
>
>
> On Tue, Jun 8, 2021 at 7:44 PM frank chen <fr...@apache.org> wrote:
>
> > Considering Druid takes advantage of lots of external components to
> work, I
> > think we should upgrade Druid in a little bit conservitive way. Dropping
> > support of hadoop2 is not a good idea.
> > The upgrading of the ZooKeeper client in Druid also prevents me from
> > adopting 0.22 for a longer time.
> >
> > Although users could upgrade these dependencies first to use the latest
> > Druid releases, frankly speaking, these upgrades are not so easy in
> > production and usually take longer time, which would prevent users from
> > experiencing new features of Druid.
> > For hadoop3, I have heard of some performance issues, which also makes me
> > have no confidence to upgrade.
> >
> > I think what Jihoon proposes is a good idea, separating hadoop2 from
> Druid
> > core as an extension.
> > Since hadoop2 has not been EOF, to achieve balance between compatibility
> > and long term evolution, maybe we could provide two extensions, one for
> > hadoop2, one for hadoop3.
> >
> >
> >
> > Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三 上午4:13写道：
> >
> > > Just to follow up on this, our main problem with hadoop3 right now has
> > been
> > > instability in HDFS, to the extent that we put on hold any plans to
> > deploy
> > > it to our production systems. I would claim Hadoop3 isn't mature enough
> > yet
> > > to consider migrating Druid to it.
> > >
> > > WIll
> > >
> > > <http://www.verizonmedia.com>
> > >
> > > Will Lauer
> > >
> > > Senior Principal Architect, Audience & Advertising Reporting
> > > Data Platforms & Systems Engineering
> > >
> > > M 508 561 6427
> > > 1908 S. First St
> > > Champaign, IL 61822
> > >
> > > <
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> >   <
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> >
> > > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> >
> > > <
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> >
> > >
> > >
> > >
> > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <wl...@verizonmedia.com>
> > wrote:
> > >
> > > > Unfortunately, the migration off of hadoop3 is a hard one (maybe not
> > for
> > > > Druid, but certainly for big organizations running large hadoop2
> > > > workloads). If druid migrated to hadoop3 after 0.22, that would
> > probably
> > > > prevent me from taking any new versions of Druid for at least the
> > > remainder
> > > > of the year and possibly longer.
> > > >
> > > > Will
> > > >
> > > >
> > > > <http://www.verizonmedia.com>
> > > >
> > > > Will Lauer
> > > >
> > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > Data Platforms & Systems Engineering
> > > >
> > > > M 508 561 6427
> > > > 1908 S. First St
> > > > Champaign, IL 61822
> > > >
> > > > <
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> >   <
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> >
> > > >    <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> >
> > > > <
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> >
> > > >
> > > >
> > > >
> > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cw...@apache.org>
> wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I've been assisting with some experiments to see how we might want
> to
> > > >> migrate Druid to support Hadoop 3.x, and more importantly, see if
> > maybe
> > > we
> > > >> can finally be free of some of the dependency issues it has been
> > causing
> > > >> for as long as I can remember working with Druid.
> > > >>
> > > >> Hadoop 3 introduced shaded client jars,
> > > >>
> > > >>
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > >> , with the purpose to
> > > >> allow applications to talk to the Hadoop cluster without drowning in
> > its
> > > >> transitive dependencies. The experimental branch that I have been
> > > helping
> > > >> with, which is using these new shaded client jars, can be seen in
> this
> > > PR
> > > >>
> > > >>
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > >> , and is currently working with
> > > >> the HDFS integration tests as well as the Hadoop tutorial flow in
> the
> > > >> Druid
> > > >> docs (which is pretty much equivalent to the HDFS integration test).
> > > >>
> > > >> The cloud deep storages still need some further testing and some
> minor
> > > >> cleanup still needs done for the docs and such. Additionally we
> still
> > > need
> > > >> to figure out how to handle the Kerberos extension, because it
> extends
> > > >> some
> > > >> Hadoop classes so isn't able to use the shaded client jars in a
> > > >> straight-forward manner, and so still has heavy dependencies and
> > hasn't
> > > >> been tested. However, the experiment has started to pan out enough
> to
> > > >> where
> > > >> I think it is worth starting this discussion, because it does have
> > some
> > > >> implications.
> > > >>
> > > >> Making this change I think will allow us to update our dependencies
> > > with a
> > > >> lot more freedom (I'm looking at you, Guava), but the catch is that
> > once
> > > >> we
> > > >> make this change and start updating these dependencies, it will
> become
> > > >> hard, nearing impossible to support Hadoop 2.x, since as far as I
> know
> > > >> there isn't an equivalent set of shaded client jars. I am also not
> > > certain
> > > >> how far back the Hadoop job classpath isolation stuff goes
> > > >> (mapreduce.job.classloader = true) which I think is required to be
> set
> > > on
> > > >> Druid tasks for this shaded stuff to work alongside updated Druid
> > > >> dependencies.
> > > >>
> > > >> Is anyone opposed to or worried about dropping Hadoop 2.x support
> > after
> > > >> the
> > > >> Druid 0.22 release?
> > > >>
> > > >
> > >
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

Posted by Clint Wylie <cw...@apache.org>.

@itai, I think pending the outcome of this discussion that it makes sense
to have a wider community thread to announce any decisions we make here,
thanks for bringing that up.

@rajiv, Minio support seems unrelated to this discussion. It seems like a
reasonable request, but I recommend starting another thread to see if
someone is interested in taking up this effort.

@jihoon I definitely agree that Hadoop should be refactored to be an
extension longer term. I don't think this upgrade would necessarily
make doing such a refactor any easier, but not harder either. Just moving
Hadoop to an extension also unfortunately doesn't really do anything to
help our dependency problem though, which is the thing that has agitated me
enough to start this thread and start looking into solutions.

@will/@frank I feel like the stranglehold Hadoop has on our dependencies
has started to become especially more painful in the last couple of
years. Most painful to me is that we are stuck using a version of Apache
Calcite from 2019 (six versions behind the latest), because newer versions
require a newer version of Guava. This means we cannot get any bug fixes
and improvements in our SQL parsing layer without doing something like
packaging a shaded version of it ourselves or solving our Hadoop dependency
problem.

Many other dependencies have also proved problematic with Hadoop as well in
the past, and since we aren't able to run the Hadoop integration tests in
Travis, there is always the chance that sometimes we don't catch these when
they go in. I imagine now that we have turned on dependabot this week,
https://github.com/apache/druid/pull/11079, that we are going to have to
proceed very carefully with it until we are able to resolve this dependency
issue.

Hadoop 3.3.0 is also the first to support running on a Java version that is
newer than Java 8 per
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions,
which is another area we have been working towards - Druid to officially
support Java 11+ environments.

I'm sort of at a loss of what else to do besides one of
- switching to these Hadoop 3 shaded jars and dropping 2.x support
- figuring out how to custom package our own Hadoop 2.x dependendencies
that are shaded similarly to the Hadoop 3 client jars, and only supporting
Hadoop with application classpath isolation (mapreduce.job.classloader =
true)
- just dropping support for Hadoop completely

I would much rather devote all effort into making Druids native batch
ingestion better to encourage people to migrate to that, than continuing to
fight with figuring out how to keep supporting Hadoop, so upgrading and
switching to the shaded client jars at least seemed like a reasonable
compromise to dropping it completely. Maybe making custom shaded Hadoop
dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard as I
am imagining, but it does seem like the most amount of work between the
solutions I could think of to potentially resolve this problem.

Does anyone have any other ideas of how we can isolate our dependencies
from Hadoop? Solutions like shading Guava,
https://github.com/apache/druid/pull/10964, would let Druid itself use
newer Guava, but that doesn't help conflicts within our dependencies which
has always seemed to be the larger problem to me. Moving Hadoop support to
an extension doesn't help anything unless we can ensure that we can run
Druid ingestion tasks on Hadoop without having to match all of the Hadoop
clusters dependencies with some sort of classloader wizardry.

Maybe we could consider keeping a 0.22.x release line in Druid that gets
security and minor bug fixes for some period of time to give people a
longer period to migrate off of Hadoop 2.x? I can't speak for the rest of
the committers, but I would personally be more open to maintaining such a
branch if it meant that moving forward at least we could update all of our
dependencies to newer versions, while providing a transition path to still
have at least some support until migrating to Hadoop 3 or native Druid
batch ingestion.

Any other ideas?

On Tue, Jun 8, 2021 at 7:44 PM frank chen <fr...@apache.org> wrote:

> Considering Druid takes advantage of lots of external components to work, I
> think we should upgrade Druid in a little bit conservitive way. Dropping
> support of hadoop2 is not a good idea.
> The upgrading of the ZooKeeper client in Druid also prevents me from
> adopting 0.22 for a longer time.
>
> Although users could upgrade these dependencies first to use the latest
> Druid releases, frankly speaking, these upgrades are not so easy in
> production and usually take longer time, which would prevent users from
> experiencing new features of Druid.
> For hadoop3, I have heard of some performance issues, which also makes me
> have no confidence to upgrade.
>
> I think what Jihoon proposes is a good idea, separating hadoop2 from Druid
> core as an extension.
> Since hadoop2 has not been EOF, to achieve balance between compatibility
> and long term evolution, maybe we could provide two extensions, one for
> hadoop2, one for hadoop3.
>
>
>
> Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三 上午4:13写道：
>
> > Just to follow up on this, our main problem with hadoop3 right now has
> been
> > instability in HDFS, to the extent that we put on hold any plans to
> deploy
> > it to our production systems. I would claim Hadoop3 isn't mature enough
> yet
> > to consider migrating Druid to it.
> >
> > WIll
> >
> > <http://www.verizonmedia.com>
> >
> > Will Lauer
> >
> > Senior Principal Architect, Audience & Advertising Reporting
> > Data Platforms & Systems Engineering
> >
> > M 508 561 6427
> > 1908 S. First St
> > Champaign, IL 61822
> >
> > <http://www.facebook.com/verizonmedia>   <
> http://twitter.com/verizonmedia>
> > <https://www.linkedin.com/company/verizon-media/>
> > <http://www.instagram.com/verizonmedia>
> >
> >
> >
> > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <wl...@verizonmedia.com>
> wrote:
> >
> > > Unfortunately, the migration off of hadoop3 is a hard one (maybe not
> for
> > > Druid, but certainly for big organizations running large hadoop2
> > > workloads). If druid migrated to hadoop3 after 0.22, that would
> probably
> > > prevent me from taking any new versions of Druid for at least the
> > remainder
> > > of the year and possibly longer.
> > >
> > > Will
> > >
> > >
> > > <http://www.verizonmedia.com>
> > >
> > > Will Lauer
> > >
> > > Senior Principal Architect, Audience & Advertising Reporting
> > > Data Platforms & Systems Engineering
> > >
> > > M 508 561 6427
> > > 1908 S. First St
> > > Champaign, IL 61822
> > >
> > > <http://www.facebook.com/verizonmedia>   <
> > http://twitter.com/verizonmedia>
> > >    <https://www.linkedin.com/company/verizon-media/>
> > > <http://www.instagram.com/verizonmedia>
> > >
> > >
> > >
> > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cw...@apache.org> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I've been assisting with some experiments to see how we might want to
> > >> migrate Druid to support Hadoop 3.x, and more importantly, see if
> maybe
> > we
> > >> can finally be free of some of the dependency issues it has been
> causing
> > >> for as long as I can remember working with Druid.
> > >>
> > >> Hadoop 3 introduced shaded client jars,
> > >>
> > >>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > >> , with the purpose to
> > >> allow applications to talk to the Hadoop cluster without drowning in
> its
> > >> transitive dependencies. The experimental branch that I have been
> > helping
> > >> with, which is using these new shaded client jars, can be seen in this
> > PR
> > >>
> > >>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > >> , and is currently working with
> > >> the HDFS integration tests as well as the Hadoop tutorial flow in the
> > >> Druid
> > >> docs (which is pretty much equivalent to the HDFS integration test).
> > >>
> > >> The cloud deep storages still need some further testing and some minor
> > >> cleanup still needs done for the docs and such. Additionally we still
> > need
> > >> to figure out how to handle the Kerberos extension, because it extends
> > >> some
> > >> Hadoop classes so isn't able to use the shaded client jars in a
> > >> straight-forward manner, and so still has heavy dependencies and
> hasn't
> > >> been tested. However, the experiment has started to pan out enough to
> > >> where
> > >> I think it is worth starting this discussion, because it does have
> some
> > >> implications.
> > >>
> > >> Making this change I think will allow us to update our dependencies
> > with a
> > >> lot more freedom (I'm looking at you, Guava), but the catch is that
> once
> > >> we
> > >> make this change and start updating these dependencies, it will become
> > >> hard, nearing impossible to support Hadoop 2.x, since as far as I know
> > >> there isn't an equivalent set of shaded client jars. I am also not
> > certain
> > >> how far back the Hadoop job classpath isolation stuff goes
> > >> (mapreduce.job.classloader = true) which I think is required to be set
> > on
> > >> Druid tasks for this shaded stuff to work alongside updated Druid
> > >> dependencies.
> > >>
> > >> Is anyone opposed to or worried about dropping Hadoop 2.x support
> after
> > >> the
> > >> Druid 0.22 release?
> > >>
> > >
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

Posted by frank chen <fr...@apache.org>.

Considering Druid takes advantage of lots of external components to work, I
think we should upgrade Druid in a little bit conservitive way. Dropping
support of hadoop2 is not a good idea.
The upgrading of the ZooKeeper client in Druid also prevents me from
adopting 0.22 for a longer time.

Although users could upgrade these dependencies first to use the latest
Druid releases, frankly speaking, these upgrades are not so easy in
production and usually take longer time, which would prevent users from
experiencing new features of Druid.
For hadoop3, I have heard of some performance issues, which also makes me
have no confidence to upgrade.

I think what Jihoon proposes is a good idea, separating hadoop2 from Druid
core as an extension.
Since hadoop2 has not been EOF, to achieve balance between compatibility
and long term evolution, maybe we could provide two extensions, one for
hadoop2, one for hadoop3.



Will Lauer <wl...@verizonmedia.com.invalid> 于2021年6月9日周三 上午4:13写道：

> Just to follow up on this, our main problem with hadoop3 right now has been
> instability in HDFS, to the extent that we put on hold any plans to deploy
> it to our production systems. I would claim Hadoop3 isn't mature enough yet
> to consider migrating Druid to it.
>
> WIll
>
> <http://www.verizonmedia.com>
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
> <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
>
>
>
> On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <wl...@verizonmedia.com> wrote:
>
> > Unfortunately, the migration off of hadoop3 is a hard one (maybe not for
> > Druid, but certainly for big organizations running large hadoop2
> > workloads). If druid migrated to hadoop3 after 0.22, that would probably
> > prevent me from taking any new versions of Druid for at least the
> remainder
> > of the year and possibly longer.
> >
> > Will
> >
> >
> > <http://www.verizonmedia.com>
> >
> > Will Lauer
> >
> > Senior Principal Architect, Audience & Advertising Reporting
> > Data Platforms & Systems Engineering
> >
> > M 508 561 6427
> > 1908 S. First St
> > Champaign, IL 61822
> >
> > <http://www.facebook.com/verizonmedia>   <
> http://twitter.com/verizonmedia>
> >    <https://www.linkedin.com/company/verizon-media/>
> > <http://www.instagram.com/verizonmedia>
> >
> >
> >
> > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cw...@apache.org> wrote:
> >
> >> Hi all,
> >>
> >> I've been assisting with some experiments to see how we might want to
> >> migrate Druid to support Hadoop 3.x, and more importantly, see if maybe
> we
> >> can finally be free of some of the dependency issues it has been causing
> >> for as long as I can remember working with Druid.
> >>
> >> Hadoop 3 introduced shaded client jars,
> >>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> >> , with the purpose to
> >> allow applications to talk to the Hadoop cluster without drowning in its
> >> transitive dependencies. The experimental branch that I have been
> helping
> >> with, which is using these new shaded client jars, can be seen in this
> PR
> >>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> >> , and is currently working with
> >> the HDFS integration tests as well as the Hadoop tutorial flow in the
> >> Druid
> >> docs (which is pretty much equivalent to the HDFS integration test).
> >>
> >> The cloud deep storages still need some further testing and some minor
> >> cleanup still needs done for the docs and such. Additionally we still
> need
> >> to figure out how to handle the Kerberos extension, because it extends
> >> some
> >> Hadoop classes so isn't able to use the shaded client jars in a
> >> straight-forward manner, and so still has heavy dependencies and hasn't
> >> been tested. However, the experiment has started to pan out enough to
> >> where
> >> I think it is worth starting this discussion, because it does have some
> >> implications.
> >>
> >> Making this change I think will allow us to update our dependencies
> with a
> >> lot more freedom (I'm looking at you, Guava), but the catch is that once
> >> we
> >> make this change and start updating these dependencies, it will become
> >> hard, nearing impossible to support Hadoop 2.x, since as far as I know
> >> there isn't an equivalent set of shaded client jars. I am also not
> certain
> >> how far back the Hadoop job classpath isolation stuff goes
> >> (mapreduce.job.classloader = true) which I think is required to be set
> on
> >> Druid tasks for this shaded stuff to work alongside updated Druid
> >> dependencies.
> >>
> >> Is anyone opposed to or worried about dropping Hadoop 2.x support after
> >> the
> >> Druid 0.22 release?
> >>
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

Posted by Will Lauer <wl...@verizonmedia.com.INVALID>.

Just to follow up on this, our main problem with hadoop3 right now has been
instability in HDFS, to the extent that we put on hold any plans to deploy
it to our production systems. I would claim Hadoop3 isn't mature enough yet
to consider migrating Druid to it.

WIll

<http://www.verizonmedia.com>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

<http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
<https://www.linkedin.com/company/verizon-media/>
<http://www.instagram.com/verizonmedia>



On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <wl...@verizonmedia.com> wrote:

> Unfortunately, the migration off of hadoop3 is a hard one (maybe not for
> Druid, but certainly for big organizations running large hadoop2
> workloads). If druid migrated to hadoop3 after 0.22, that would probably
> prevent me from taking any new versions of Druid for at least the remainder
> of the year and possibly longer.
>
> Will
>
>
> <http://www.verizonmedia.com>
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
>    <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
>
>
>
> On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cw...@apache.org> wrote:
>
>> Hi all,
>>
>> I've been assisting with some experiments to see how we might want to
>> migrate Druid to support Hadoop 3.x, and more importantly, see if maybe we
>> can finally be free of some of the dependency issues it has been causing
>> for as long as I can remember working with Druid.
>>
>> Hadoop 3 introduced shaded client jars,
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
>> , with the purpose to
>> allow applications to talk to the Hadoop cluster without drowning in its
>> transitive dependencies. The experimental branch that I have been helping
>> with, which is using these new shaded client jars, can be seen in this PR
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
>> , and is currently working with
>> the HDFS integration tests as well as the Hadoop tutorial flow in the
>> Druid
>> docs (which is pretty much equivalent to the HDFS integration test).
>>
>> The cloud deep storages still need some further testing and some minor
>> cleanup still needs done for the docs and such. Additionally we still need
>> to figure out how to handle the Kerberos extension, because it extends
>> some
>> Hadoop classes so isn't able to use the shaded client jars in a
>> straight-forward manner, and so still has heavy dependencies and hasn't
>> been tested. However, the experiment has started to pan out enough to
>> where
>> I think it is worth starting this discussion, because it does have some
>> implications.
>>
>> Making this change I think will allow us to update our dependencies with a
>> lot more freedom (I'm looking at you, Guava), but the catch is that once
>> we
>> make this change and start updating these dependencies, it will become
>> hard, nearing impossible to support Hadoop 2.x, since as far as I know
>> there isn't an equivalent set of shaded client jars. I am also not certain
>> how far back the Hadoop job classpath isolation stuff goes
>> (mapreduce.job.classloader = true) which I think is required to be set on
>> Druid tasks for this shaded stuff to work alongside updated Druid
>> dependencies.
>>
>> Is anyone opposed to or worried about dropping Hadoop 2.x support after
>> the
>> Druid 0.22 release?
>>
>