You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Konstantin Boudnik <co...@apache.org> on 2013/08/03 08:57:50 UTC
Re: Looking into Maven build for spark

[Bcc: spark-developers@googlegroups.com]

Guys, just wanted to close the loop on this.

I have committed the packaging support for Spark (BIGTOP-715) into Bigtop
master. The packaging is built on top of the Maven assembly and provides
standard Linux services to control master and worker daemons.

Shark is next in the pipeline ;)
  Cos

On Tue, Jul 09, 2013 at 08:01PM, Konstantin Boudnik wrote:
> Any chance someone can review the 
>     https://github.com/mesos/spark/pull/675
> 
> Appreciate how busy you are, though
>   Cos
> 
> On Thu, Jul 04, 2013 at 12:23PM, Konstantin Boudnik wrote:
> > Guys,
> > 
> > I have a working version of the assembly and will publish the pull-request
> > shortly.
> > 
> > non-standard classfiers are a lot of pain as usual though. Here's question.
> > 'hadoop2' profile isn't really a hadoop2 profile - it is a cdh4 profile in the
> > reality. And in my opinion it should be called as such to avoid confusion.
> > Any objections on this?
> > 
> > Also, is there any particular reason to use 1.7.1.cloudera.2 instead of a
> > standard 
> >   http://mvnrepository.com/artifact/org.apache.avro/avro/1.7.1
> > 
> > It is kinda 'frowned upon' to mix ASF and nonASF provided artifacts in the
> > release. And I think we might get called on that once the incubation process
> > is well underway (please correct me if I am wrong at this, but there were
> > discussions of the sort in the Hadoop project).
> > 
> > Looking forward for your input,
> >   Cos
> > 
> > I can fix it by a separate pull-request as well, while I am at it.
> > On Mon, Jun 10, 2013 at 10:08PM, Konstantin Boudnik wrote:
> > > [moving to spark-developers, bcc'ing spark-users - my troubles are over:
> > > there's an irrelevant spark-dev@ group ;( ]
> > > 
> > > Matei,
> > > 
> > > It would be an ideal way to deal with situation. If not - we can hack the
> > > assembly in the Bigtop during the build (pretty suboptimal though); so I like
> > > the idea of having two different assemblies.
> > > 
> > > Cos
> > > 
> > > On Sun, Jun 09, 2013 at 04:03PM, Matei Zaharia wrote:
> > > > No worries, Cos. To comment on your proposal: we can also add separate
> > > > assembly targets for "normal" Spark users and BigTop. I believe the SBT
> > > > assembly tool allows that. For the one for "normal" users, we'll probably
> > > > include both Scala and a default version of hadoop-client, and they'd bring
> > > > in their own version of Hadoop only if they link to it specifically in their
> > > > own project.
> > > > 
> > > > Matei
> > > > 
> > > > >> Essentially, the goal of the proposed exercise is to achieve the following:
> > > > >>  - eliminate the need to package hadoop libraries and their transitive
> > > > >>    dependencies. Just with this one the size of the dist package would be cut
> > > > >>    in half or so. The startup protocol for spark would need to change a bit to
> > > > >>    reflect the need to add add hadoop jar and trans. deps to the classpath.
> > > > >> 
> > > > >>    I understand that Bigtop deployment isn't the only scenario that Spark is
> > > > >>    interested in, so once assembly is done it might have to be massages a
> > > > >>    little a bit during the packaging by Bigtop.
> > > > >> 
> > > > >>  - Scala redistribution. Currently, all Scala stuff is being Shader'ed into
> > > > >>    the same fat-jar. I think for real system deployment it make sense to
> > > > >>    simply make Spark package to depend on a Scala package. However,
> > > > >>    considering a somewhat lesser popularity of Scala among Linux distros, it
> > > > >>    might makes - for Bigtop itself - to package and supply a needed version
> > > > >>    of Scala along with the distribution. But this is different from this
> > > > >>    conversation and would be solved elsewhere.
> > > > >> 
> > > > >> It is damn hot today, and my brain is melting. So I will try to put something
> > > > >> together in the next few days and will publish the pull request for further
> > > > >> considerations and discussion.
> > > > >> 
> > > > >> Regards,
> > > > >>  Cos
> > > > >> 
> > > > >> On Tuesday, June 4, 2013 11:07:14 PM UTC-7, Matt Massie wrote:
> > > > >>> Cos
> > > > >>> 
> > > > >>> Thanks for the email, Cos. Good to hear from you.
> > > > >>> 
> > > > >>> Our plan is to cleanup and simplify the Spark build for the 0.8 release. We've
> > > > >>> talked with leaders of other projects that integrate with Hadoop (e.g. Hive,
> > > > >>> Parquet) and the consensus was to use the "hadoop-client" artifact with a
> > > > >>> simple shim (e.g. HadoopShims, ContextUtil)  that uses reflection at runtime.
> > > > >>> This approach will allow us to release a single artifact for Spark that is
> > > > >>> binary compatible with all versions of Hadoop.
> > > > >>> 
> > > > >>> I think, in general, the community will support this change if it simplifies
> > > > >>> deployment and works seamlessly. I believe it will.
> > > > >>> 
> > > > >>> If you're interested in helping with this effort, we'd love your help. Is the
> > > > >>> high-level approach of using hadoop-client with a shim in line with you
> > > > >>> thinking on how to avoid jar hell?  - hide quoted text -
> > > > >>> 
> > > > >>> 
> > > > >>>> On Monday, June 3, 2013 11:07:14 PM UTC-7, Reynold Xin wrote:
> > > > >>>> 
> > > > >>>>   Moving the discussion to spark-dev, and copying Matt/Jey as they have
> > > > >>>> looked into the binary packaging for Spark on precisely the Hadoop dependency
> > > > >>>> issue. 
> > > > >>>> 
> > > > >>>>   FYI Cos, at Berkeley this morning we discussed some methods to allow a
> > > > >>>> single binary jar for Spark that would work with both Hadoop1 and Hadoop2.
> > > > >>>> Matt, can you comment on this?
> > > > >>>> 
> > > > >>>> 
> > > > >>>>>   On Mon, Jun 3, 2013 at 10:33 PM, Konstantin Boudnik <c....@apache.org> wrote:
> > > > >>>>> 
> > > > >>>>>       Guys,
> > > > >>>>> 
> > > > >>>>>       I am working on BIGTOP-715 to include latest Spark into Bigtop's Hadoop stack
> > > > >>>>>       with Hadoop 2.0.5-alpha.
> > > > >>>>> 
> > > > >>>>>       As a temp. hack I am reusing the fat-jar created by shader. And, as always
> > > > >>>>>       with Shader. there's something that looks like a potential problem. By
> > > > >>>>>       default, repl-bin project will be packing all hadoop dependencies into the
> > > > >>>>>       same fatjar. This is essentially allows to deploy Spark independent of the
> > > > >>>>>       presence of Hadoop's binaries, making Spark deb package pretty much
> > > > >>>>>       standalone.
> > > > >>>>> 
> > > > >>>>>       However, it might create potential issues of jar-hell: say Spark got compiled
> > > > >>>>>       against Hadoop 2.0.3-alpha and then I want to use it against Hadoop
> > > > >>>>>       2.0.5-alpha. Both of these version are binary compatible with each other.
> > > > >>>>>       Hence, I should be able to re-use Hadoop binaries that are readily available
> > > > >>>>>       from my Hadoop cluster, instead of installing a fresh yet slightly different
> > > > >>>>>       set of the dependencies.
> > > > >>>>> 
> > > > >>>>>       Now, my understanding is that Spark doesn't really depends on low-level HDFS
> > > > >>>>>       or YARN APIs and only uses what's publicly available for a normal client
> > > > >>>>>       application. That makes it potentially possible to run Spark against any
> > > > >>>>>       Hadoop2 cluster and use dynamic classpath configuration, unless the concrete
> > > > >>>>>       binaries are included into the package.
> > > > >>>>> 
> > > > >>>>>       Would the dev. community be willing to accept an improvement of the binary
> > > > >>>>>       packaging in a form of proper assembly instead or in parallel with shader's
> > > > >>>>>       fatjar?
> > > > >>>>> 
> > > > >>>>>       --
> > > > >>>>>       Take care,
> > > > >>>>>               Cos
> > > > > 
> > > > > 
> > > > > -- 
> > > > > You received this message because you are subscribed to the Google Groups "Spark Users" group.
> > > > > To unsubscribe from this group and stop receiving emails from it, send an email to spark-users+unsubscribe@googlegroups.com.
> > > > > For more options, visit https://groups.google.com/groups/opt_out.
> > > > > 
> > > > > 
> > > > 
> > > > -- 
> > > > You received this message because you are subscribed to the Google Groups "Spark Users" group.
> > > > To unsubscribe from this group and stop receiving emails from it, send an email to spark-users+unsubscribe@googlegroups.com.
> > > > For more options, visit https://groups.google.com/groups/opt_out.
> > > > 
> > > > 
> > 
> > 
> > -- 
> > You received this message because you are subscribed to the Google Groups "Spark Developers" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to spark-developers+unsubscribe@googlegroups.com.
> > For more options, visit https://groups.google.com/groups/opt_out.
> > 
> > 
> 
> -- 
> You received this message because you are subscribed to the Google Groups "Spark Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to spark-developers+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
> 
>