You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2017/03/02 14:55:25 UTC

[DISCUSS] Spark's fork of hive

All,

I have compiled a short (non exhaustive) list of items related to Spark's
forking of Apache Hive code and usage of Apache Hive trademarks.

1)
----------------------------
The original spark proposal repeatedly claims that Spark "inter operates"
with hive.

https://wiki.apache.org/incubator/SparkProposal

"Finally, Shark (a higher layer framework built on Spark) inter-operates
with Apache Hive."

(EC note: Originally spark may have linked to hive, but now the situation
is much different.)
-------------------------

2)
------------------
Spark distributes jar files to maven repositories carrying the hive name.

https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec

(EC note These are not simple "ports" features are added/missing/broken in
artifacts named "hive")
-----------------------

3)
---------------------------------
Spark carries forked and modified copies of hive source code

https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java
--------------------------------------------

4
-------------------------------
Spark has "imported" and modified components of hive


https://issues.apache.org/jira/browse/SPARK-12572

(EC note: Further discussions of the code make little no reference to it's
origins in propaganda)
---------------------------------------------

5
--------------------------------
Databricks, a company heaving involved in spark development, uses the Hive
trademark to make claims

https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html

"The Databricks platform provides a fully managed Hive Metastore that
allows users to share a data catalog across multiple Spark clusters."


This blog defining hadoop (draft) is clear on this:
https://wiki.apache.org/hadoop/Defining%20Hadoop

"Products that are derivative works of Apache Hadoop are not Apache Hadoop,
and may not call themselves versions of Apache Hadoop, nor Distributions of
Apache Hadoop."

--------------------

6
----------------------
https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html

"Apache Spark supports multiple versions of Hive, from 0.12 up to 1.2.1. "

Apache spark can NOT support multiple versions of Hive because they are
working with a fork, and there is no standard body for "supporting hive"

Some products have been released that have been described as "compatible"
with Hadoop, even though parts of the Hadoop codebase have either been
changed or replaced. The Apache™ Hadoop® developer team are not a standards
body: they do not qualify such (derivative) works as compatible. Nor do
they feel constrained by the requirements of external entities when
changing the behavior of Apache Hadoop software or related Apache software.
-----------------------

7
---------------------------------
The spark committers openly use the word "take" during the process of
"importing" hive code.

https://github.com/apache/spark/pull/10583/files
"are there unit tests from Hive that we can take?"

Apache foundation will not take a hostile fork for a proposal. Had the
original Spark proposal implied they wished to fork portions of the hive
code base, I would have considered it a hostile fork. (this is open to
interpretation).

(EC Note: Is this the Apache way? How can we build communities? How would
small projects feel if for example hive "imported" copying code while they
sat in incubation)
------------------------------

8
----------------------------
Databricks (after borrowing slabs of hive code, using our trademarks, etc)
makes disparaging comments about the performance of hive.

https://databricks.com/blog/2017/02/28/voice-facebook-using-apache-spark-large-scale-language-model-training.html

"Spark-based pipelines can scale comfortably to process many times more
input data than what Hive could handle at peak. "

(EC Note: How is this statement verifiable?)
-----------------------------------------------

9
--------------------------
https://issues.apache.org/jira/browse/SPARK-10793

It's easily enough added, to the code, there's just the risk of the fork
diverging more from ASF hive.

(EC Note Even those responsible for this admit the code is diverging and
will diverge more from there actions.)
------------------------

10
----------------------

My opinion of all of this:
The above points are hurtful to Hive.First, we are robbed of community.
People could be improving hive by making it more modular, but instead they
are improving Spark's fork of hive. Next, our code base is subject to
continued "poaching". Apache Spark "imports", copies, alter, and claim
compatibility with/from Hive (I pointed out above why the compatibility
claims should not be made). Finally, We are subject to unfair performance
comparisons "x is faster then hive", by software (spark) that is
essentially

*POWERED BY Hive (via the forking and code copying).  *

Hive has always been a bullseye as the best hadoop SQL
https://vision.cloudera.com/impala-v-hive/

In my hood we have a saying, "Haters gonna hate"

For every Impala and every Spark claiming to be better then hive, there is
10 HadoopDB's that collapsed under the weight of themselves. We outlasted
fleets of them.

That being said, software like Hive Metastore our baby. It is our TM. It is
our creation. It is what makes us special. People have the right to fork it
via the licence. We can not stop that. But it cant be both ways: either
downstream needs to bring in our published artifacts, or they fork and give
what they are doing another name.

None of this activity represents what I believe is the "Apache Way". I
believe the Apache Way would be to communicate to us, the hive community,
about ways to make the components more modular and easier to use in other
projects. Users suffer when the same code "moves" between two projects
there is fragmentation and typically it leads to negative effects for both
projects.


--------------------------------------

Thanks,
Edward

Re: [DISCUSS] Spark's fork of hive

Posted by Edward Capriolo <ed...@gmail.com>.
On Thu, Mar 2, 2017 at 2:08 PM, Alan Gates <al...@gmail.com> wrote:

> I think the issues you point out fall into a couple of different buckets:
>
> 1) Spark forking Hive code.  As you point out, based on the license,
> there's nothing wrong with this (see naming concerns below).  I agree it's
> crappy technical practice because in the long term Hive and Spark will
> diverge and the Spark community will either give up on interoperability or
> spend more and more time maintaining it.  But if their MO is "we take the
> best of whatever you write and include it in Spark", then I think all we
> can do about it is 1) remember that imitation is the sincerest form of
> flattery; and 2) see what of there's we can incorporate into Hive.
>
> 2) I agree that they should not call Hive what they incorporate into
> Spark.  In particular shipping maven jars with org.apache.hive that do not
> contain the same functionality as ours seems problematic.  IIRC the Hive
> community raised concerns about this before with the Spark community.  I
> don't recall the outcome.  But it would make sense to me to approach the
> Spark community and ask that they not do this.
>
> As for them dissing on us in benchmarks, we all know you can set up Hive
> to run like mule (use MR on text files) and people do it all the time to
> make their stuff look good.  I'm not sure what to do about that other than
> publish our own benchmarks showing what Hive can do.
>
> Alan.
>
> > On Mar 2, 2017, at 6:55 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
> >
> > All,
> >
> > I have compiled a short (non exhaustive) list of items related to Spark's
> > forking of Apache Hive code and usage of Apache Hive trademarks.
> >
> > 1)
> > ----------------------------
> > The original spark proposal repeatedly claims that Spark "inter operates"
> > with hive.
> >
> > https://wiki.apache.org/incubator/SparkProposal
> >
> > "Finally, Shark (a higher layer framework built on Spark) inter-operates
> > with Apache Hive."
> >
> > (EC note: Originally spark may have linked to hive, but now the situation
> > is much different.)
> > -------------------------
> >
> > 2)
> > ------------------
> > Spark distributes jar files to maven repositories carrying the hive name.
> >
> > https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec
> >
> > (EC note These are not simple "ports" features are added/missing/broken
> in
> > artifacts named "hive")
> > -----------------------
> >
> > 3)
> > ---------------------------------
> > Spark carries forked and modified copies of hive source code
> >
> > https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a
> 1da3ee84cc/sql/hive-thriftserver/src/main/java/
> org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java
> > --------------------------------------------
> >
> > 4
> > -------------------------------
> > Spark has "imported" and modified components of hive
> >
> >
> > https://issues.apache.org/jira/browse/SPARK-12572
> >
> > (EC note: Further discussions of the code make little no reference to
> it's
> > origins in propaganda)
> > ---------------------------------------------
> >
> > 5
> > --------------------------------
> > Databricks, a company heaving involved in spark development, uses the
> Hive
> > trademark to make claims
> >
> > https://databricks.com/blog/2017/01/30/integrating-
> central-hive-metastore-apache-spark-databricks.html
> >
> > "The Databricks platform provides a fully managed Hive Metastore that
> > allows users to share a data catalog across multiple Spark clusters."
> >
> >
> > This blog defining hadoop (draft) is clear on this:
> > https://wiki.apache.org/hadoop/Defining%20Hadoop
> >
> > "Products that are derivative works of Apache Hadoop are not Apache
> Hadoop,
> > and may not call themselves versions of Apache Hadoop, nor Distributions
> of
> > Apache Hadoop."
> >
> > --------------------
> >
> > 6
> > ----------------------
> > https://databricks.com/blog/2017/01/30/integrating-
> central-hive-metastore-apache-spark-databricks.html
> >
> > "Apache Spark supports multiple versions of Hive, from 0.12 up to 1.2.1.
> "
> >
> > Apache spark can NOT support multiple versions of Hive because they are
> > working with a fork, and there is no standard body for "supporting hive"
> >
> > Some products have been released that have been described as "compatible"
> > with Hadoop, even though parts of the Hadoop codebase have either been
> > changed or replaced. The Apache™ Hadoop® developer team are not a
> standards
> > body: they do not qualify such (derivative) works as compatible. Nor do
> > they feel constrained by the requirements of external entities when
> > changing the behavior of Apache Hadoop software or related Apache
> software.
> > -----------------------
> >
> > 7
> > ---------------------------------
> > The spark committers openly use the word "take" during the process of
> > "importing" hive code.
> >
> > https://github.com/apache/spark/pull/10583/files
> > "are there unit tests from Hive that we can take?"
> >
> > Apache foundation will not take a hostile fork for a proposal. Had the
> > original Spark proposal implied they wished to fork portions of the hive
> > code base, I would have considered it a hostile fork. (this is open to
> > interpretation).
> >
> > (EC Note: Is this the Apache way? How can we build communities? How would
> > small projects feel if for example hive "imported" copying code while
> they
> > sat in incubation)
> > ------------------------------
> >
> > 8
> > ----------------------------
> > Databricks (after borrowing slabs of hive code, using our trademarks,
> etc)
> > makes disparaging comments about the performance of hive.
> >
> > https://databricks.com/blog/2017/02/28/voice-facebook-
> using-apache-spark-large-scale-language-model-training.html
> >
> > "Spark-based pipelines can scale comfortably to process many times more
> > input data than what Hive could handle at peak. "
> >
> > (EC Note: How is this statement verifiable?)
> > -----------------------------------------------
> >
> > 9
> > --------------------------
> > https://issues.apache.org/jira/browse/SPARK-10793
> >
> > It's easily enough added, to the code, there's just the risk of the fork
> > diverging more from ASF hive.
> >
> > (EC Note Even those responsible for this admit the code is diverging and
> > will diverge more from there actions.)
> > ------------------------
> >
> > 10
> > ----------------------
> >
> > My opinion of all of this:
> > The above points are hurtful to Hive.First, we are robbed of community.
> > People could be improving hive by making it more modular, but instead
> they
> > are improving Spark's fork of hive. Next, our code base is subject to
> > continued "poaching". Apache Spark "imports", copies, alter, and claim
> > compatibility with/from Hive (I pointed out above why the compatibility
> > claims should not be made). Finally, We are subject to unfair performance
> > comparisons "x is faster then hive", by software (spark) that is
> > essentially
> >
> > *POWERED BY Hive (via the forking and code copying).  *
> >
> > Hive has always been a bullseye as the best hadoop SQL
> > https://vision.cloudera.com/impala-v-hive/
> >
> > In my hood we have a saying, "Haters gonna hate"
> >
> > For every Impala and every Spark claiming to be better then hive, there
> is
> > 10 HadoopDB's that collapsed under the weight of themselves. We outlasted
> > fleets of them.
> >
> > That being said, software like Hive Metastore our baby. It is our TM. It
> is
> > our creation. It is what makes us special. People have the right to fork
> it
> > via the licence. We can not stop that. But it cant be both ways: either
> > downstream needs to bring in our published artifacts, or they fork and
> give
> > what they are doing another name.
> >
> > None of this activity represents what I believe is the "Apache Way". I
> > believe the Apache Way would be to communicate to us, the hive community,
> > about ways to make the components more modular and easier to use in other
> > projects. Users suffer when the same code "moves" between two projects
> > there is fragmentation and typically it leads to negative effects for
> both
> > projects.
> >
> >
> > --------------------------------------
> >
> > Thanks,
> > Edward
>
>
2) I agree that they should not call Hive what they incorporate into
Spark.  In particular shipping maven jars with org.apache.hive that do not
contain the same functionality as ours seems problematic.  IIRC the Hive
community raised concerns about this before with the Spark community.  I
don't recall the outcome.  But it would make sense to me to approach the
Spark community and ask that they not do this.

I believe we should start with a conversation, but resolving to ask is not
enough. As you pointed out, you remember raising concerns and there was no
outcome.

Carl also had a very simple ask in this ticket.

https://issues.apache.org/jira/browse/SPARK-5916

"The naming conflict is unfortunate. However, scripts like this are a
public API in Spark, so I don't think we can remove this API randomly in a
minor release given our versioning policies."

Nothing every materialized and the issue was closed wont fix.

Asking implies we require permission. We (hive-dev/pmc) need to come to an
agreement if others are allowed to have and released forked/modified code
carrying the name hive in their code base.

For example the user of spark calls this method:

SparkSession.builder.appName("myapp").enable*Hive*Support()

Is the user being led to believe they are getting support of Apache Hive?

Re: [DISCUSS] Spark's fork of hive

Posted by Alan Gates <al...@gmail.com>.
I think the issues you point out fall into a couple of different buckets:

1) Spark forking Hive code.  As you point out, based on the license, there's nothing wrong with this (see naming concerns below).  I agree it's crappy technical practice because in the long term Hive and Spark will diverge and the Spark community will either give up on interoperability or spend more and more time maintaining it.  But if their MO is "we take the best of whatever you write and include it in Spark", then I think all we can do about it is 1) remember that imitation is the sincerest form of flattery; and 2) see what of there's we can incorporate into Hive.

2) I agree that they should not call Hive what they incorporate into Spark.  In particular shipping maven jars with org.apache.hive that do not contain the same functionality as ours seems problematic.  IIRC the Hive community raised concerns about this before with the Spark community.  I don't recall the outcome.  But it would make sense to me to approach the Spark community and ask that they not do this.

As for them dissing on us in benchmarks, we all know you can set up Hive to run like mule (use MR on text files) and people do it all the time to make their stuff look good.  I'm not sure what to do about that other than publish our own benchmarks showing what Hive can do.

Alan.

> On Mar 2, 2017, at 6:55 AM, Edward Capriolo <ed...@gmail.com> wrote:
> 
> All,
> 
> I have compiled a short (non exhaustive) list of items related to Spark's
> forking of Apache Hive code and usage of Apache Hive trademarks.
> 
> 1)
> ----------------------------
> The original spark proposal repeatedly claims that Spark "inter operates"
> with hive.
> 
> https://wiki.apache.org/incubator/SparkProposal
> 
> "Finally, Shark (a higher layer framework built on Spark) inter-operates
> with Apache Hive."
> 
> (EC note: Originally spark may have linked to hive, but now the situation
> is much different.)
> -------------------------
> 
> 2)
> ------------------
> Spark distributes jar files to maven repositories carrying the hive name.
> 
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec
> 
> (EC note These are not simple "ports" features are added/missing/broken in
> artifacts named "hive")
> -----------------------
> 
> 3)
> ---------------------------------
> Spark carries forked and modified copies of hive source code
> 
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java
> --------------------------------------------
> 
> 4
> -------------------------------
> Spark has "imported" and modified components of hive
> 
> 
> https://issues.apache.org/jira/browse/SPARK-12572
> 
> (EC note: Further discussions of the code make little no reference to it's
> origins in propaganda)
> ---------------------------------------------
> 
> 5
> --------------------------------
> Databricks, a company heaving involved in spark development, uses the Hive
> trademark to make claims
> 
> https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html
> 
> "The Databricks platform provides a fully managed Hive Metastore that
> allows users to share a data catalog across multiple Spark clusters."
> 
> 
> This blog defining hadoop (draft) is clear on this:
> https://wiki.apache.org/hadoop/Defining%20Hadoop
> 
> "Products that are derivative works of Apache Hadoop are not Apache Hadoop,
> and may not call themselves versions of Apache Hadoop, nor Distributions of
> Apache Hadoop."
> 
> --------------------
> 
> 6
> ----------------------
> https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html
> 
> "Apache Spark supports multiple versions of Hive, from 0.12 up to 1.2.1. "
> 
> Apache spark can NOT support multiple versions of Hive because they are
> working with a fork, and there is no standard body for "supporting hive"
> 
> Some products have been released that have been described as "compatible"
> with Hadoop, even though parts of the Hadoop codebase have either been
> changed or replaced. The Apache™ Hadoop® developer team are not a standards
> body: they do not qualify such (derivative) works as compatible. Nor do
> they feel constrained by the requirements of external entities when
> changing the behavior of Apache Hadoop software or related Apache software.
> -----------------------
> 
> 7
> ---------------------------------
> The spark committers openly use the word "take" during the process of
> "importing" hive code.
> 
> https://github.com/apache/spark/pull/10583/files
> "are there unit tests from Hive that we can take?"
> 
> Apache foundation will not take a hostile fork for a proposal. Had the
> original Spark proposal implied they wished to fork portions of the hive
> code base, I would have considered it a hostile fork. (this is open to
> interpretation).
> 
> (EC Note: Is this the Apache way? How can we build communities? How would
> small projects feel if for example hive "imported" copying code while they
> sat in incubation)
> ------------------------------
> 
> 8
> ----------------------------
> Databricks (after borrowing slabs of hive code, using our trademarks, etc)
> makes disparaging comments about the performance of hive.
> 
> https://databricks.com/blog/2017/02/28/voice-facebook-using-apache-spark-large-scale-language-model-training.html
> 
> "Spark-based pipelines can scale comfortably to process many times more
> input data than what Hive could handle at peak. "
> 
> (EC Note: How is this statement verifiable?)
> -----------------------------------------------
> 
> 9
> --------------------------
> https://issues.apache.org/jira/browse/SPARK-10793
> 
> It's easily enough added, to the code, there's just the risk of the fork
> diverging more from ASF hive.
> 
> (EC Note Even those responsible for this admit the code is diverging and
> will diverge more from there actions.)
> ------------------------
> 
> 10
> ----------------------
> 
> My opinion of all of this:
> The above points are hurtful to Hive.First, we are robbed of community.
> People could be improving hive by making it more modular, but instead they
> are improving Spark's fork of hive. Next, our code base is subject to
> continued "poaching". Apache Spark "imports", copies, alter, and claim
> compatibility with/from Hive (I pointed out above why the compatibility
> claims should not be made). Finally, We are subject to unfair performance
> comparisons "x is faster then hive", by software (spark) that is
> essentially
> 
> *POWERED BY Hive (via the forking and code copying).  *
> 
> Hive has always been a bullseye as the best hadoop SQL
> https://vision.cloudera.com/impala-v-hive/
> 
> In my hood we have a saying, "Haters gonna hate"
> 
> For every Impala and every Spark claiming to be better then hive, there is
> 10 HadoopDB's that collapsed under the weight of themselves. We outlasted
> fleets of them.
> 
> That being said, software like Hive Metastore our baby. It is our TM. It is
> our creation. It is what makes us special. People have the right to fork it
> via the licence. We can not stop that. But it cant be both ways: either
> downstream needs to bring in our published artifacts, or they fork and give
> what they are doing another name.
> 
> None of this activity represents what I believe is the "Apache Way". I
> believe the Apache Way would be to communicate to us, the hive community,
> about ways to make the components more modular and easier to use in other
> projects. Users suffer when the same code "moves" between two projects
> there is fragmentation and typically it leads to negative effects for both
> projects.
> 
> 
> --------------------------------------
> 
> Thanks,
> Edward


Re: [DISCUSS] Spark's fork of hive

Posted by Edward Capriolo <ed...@gmail.com>.
On Thu, Mar 2, 2017 at 12:35 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> > Had the original Spark proposal implied they wished to fork portions of
> the hive
> > code base, I would have considered it a hostile fork. (this is open to
> interpretation).
>
> FYI, I did ask bluntly whether Spark intends to cut-paste Hive code into
> their repos previously & got an affirmative answer from rxin.
>
> http://grokbase.com/t/hive/dev/15cjb3kjvn/using-the-hive-
> sql-parser-in-spark
>
> > People have the right to fork it via the licence. We can not stop that.
>
> Later, I did get a response that they never made a release with the said
> copy-paste & they deprecated the "HiveContext" object in Spark 2.0.
>
> > than what Hive could handle at peak."
> >
> >  (EC Note: How is this statement verifiable?)
>
> Reading about Hive at Facebook, I feel like we've already solved those
> problems that were due to FB Corona + Hadoop-1 (or, 0.20 *shudder*)
> limitations.
>
> Spark does not need be limited by Corona and the version of Hive being
> compared might not have YARN or Tez on its side.
>
> Cheers,
> Gopal
>
> On 3/2/17, 8:25 PM, "Edward Capriolo" <ed...@gmail.com> wrote:
>
>     All,
>
>     I have compiled a short (non exhaustive) list of items related to
> Spark's
>     forking of Apache Hive code and usage of Apache Hive trademarks.
>
>     1)
>     ----------------------------
>     The original spark proposal repeatedly claims that Spark "inter
> operates"
>     with hive.
>
>     https://wiki.apache.org/incubator/SparkProposal
>
>     "Finally, Shark (a higher layer framework built on Spark)
> inter-operates
>     with Apache Hive."
>
>     (EC note: Originally spark may have linked to hive, but now the
> situation
>     is much different.)
>     -------------------------
>
>     2)
>     ------------------
>     Spark distributes jar files to maven repositories carrying the hive
> name.
>
>     https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec
>
>     (EC note These are not simple "ports" features are
> added/missing/broken in
>     artifacts named "hive")
>     -----------------------
>
>     3)
>     ---------------------------------
>     Spark carries forked and modified copies of hive source code
>
>     https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a
> 1da3ee84cc/sql/hive-thriftserver/src/main/java/
> org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java
>     --------------------------------------------
>
>     4
>     -------------------------------
>     Spark has "imported" and modified components of hive
>
>
>     https://issues.apache.org/jira/browse/SPARK-12572
>
>     (EC note: Further discussions of the code make little no reference to
> it's
>     origins in propaganda)
>     ---------------------------------------------
>
>     5
>     --------------------------------
>     Databricks, a company heaving involved in spark development, uses the
> Hive
>     trademark to make claims
>
>     https://databricks.com/blog/2017/01/30/integrating-
> central-hive-metastore-apache-spark-databricks.html
>
>     "The Databricks platform provides a fully managed Hive Metastore that
>     allows users to share a data catalog across multiple Spark clusters."
>
>
>     This blog defining hadoop (draft) is clear on this:
>     https://wiki.apache.org/hadoop/Defining%20Hadoop
>
>     "Products that are derivative works of Apache Hadoop are not Apache
> Hadoop,
>     and may not call themselves versions of Apache Hadoop, nor
> Distributions of
>     Apache Hadoop."
>
>     --------------------
>
>     6
>     ----------------------
>     https://databricks.com/blog/2017/01/30/integrating-
> central-hive-metastore-apache-spark-databricks.html
>
>     "Apache Spark supports multiple versions of Hive, from 0.12 up to
> 1.2.1. "
>
>     Apache spark can NOT support multiple versions of Hive because they are
>     working with a fork, and there is no standard body for "supporting
> hive"
>
>     Some products have been released that have been described as
> "compatible"
>     with Hadoop, even though parts of the Hadoop codebase have either been
>     changed or replaced. The Apache™ Hadoop® developer team are not a
> standards
>     body: they do not qualify such (derivative) works as compatible. Nor do
>     they feel constrained by the requirements of external entities when
>     changing the behavior of Apache Hadoop software or related Apache
> software.
>     -----------------------
>
>     7
>     ---------------------------------
>     The spark committers openly use the word "take" during the process of
>     "importing" hive code.
>
>     https://github.com/apache/spark/pull/10583/files
>     "are there unit tests from Hive that we can take?"
>
>     Apache foundation will not take a hostile fork for a proposal. Had the
>     original Spark proposal implied they wished to fork portions of the
> hive
>     code base, I would have considered it a hostile fork. (this is open to
>     interpretation).
>
>     (EC Note: Is this the Apache way? How can we build communities? How
> would
>     small projects feel if for example hive "imported" copying code while
> they
>     sat in incubation)
>     ------------------------------
>
>     8
>     ----------------------------
>     Databricks (after borrowing slabs of hive code, using our trademarks,
> etc)
>     makes disparaging comments about the performance of hive.
>
>     https://databricks.com/blog/2017/02/28/voice-facebook-
> using-apache-spark-large-scale-language-model-training.html
>
>     "Spark-based pipelines can scale comfortably to process many times more
>     input data than what Hive could handle at peak. "
>
>     (EC Note: How is this statement verifiable?)
>     -----------------------------------------------
>
>     9
>     --------------------------
>     https://issues.apache.org/jira/browse/SPARK-10793
>
>     It's easily enough added, to the code, there's just the risk of the
> fork
>     diverging more from ASF hive.
>
>     (EC Note Even those responsible for this admit the code is diverging
> and
>     will diverge more from there actions.)
>     ------------------------
>
>     10
>     ----------------------
>
>     My opinion of all of this:
>     The above points are hurtful to Hive.First, we are robbed of community.
>     People could be improving hive by making it more modular, but instead
> they
>     are improving Spark's fork of hive. Next, our code base is subject to
>     continued "poaching". Apache Spark "imports", copies, alter, and claim
>     compatibility with/from Hive (I pointed out above why the compatibility
>     claims should not be made). Finally, We are subject to unfair
> performance
>     comparisons "x is faster then hive", by software (spark) that is
>     essentially
>
>     *POWERED BY Hive (via the forking and code copying).  *
>
>     Hive has always been a bullseye as the best hadoop SQL
>     https://vision.cloudera.com/impala-v-hive/
>
>     In my hood we have a saying, "Haters gonna hate"
>
>     For every Impala and every Spark claiming to be better then hive,
> there is
>     10 HadoopDB's that collapsed under the weight of themselves. We
> outlasted
>     fleets of them.
>
>     That being said, software like Hive Metastore our baby. It is our TM.
> It is
>     our creation. It is what makes us special. People have the right to
> fork it
>     via the licence. We can not stop that. But it cant be both ways: either
>     downstream needs to bring in our published artifacts, or they fork and
> give
>     what they are doing another name.
>
>     None of this activity represents what I believe is the "Apache Way". I
>     believe the Apache Way would be to communicate to us, the hive
> community,
>     about ways to make the components more modular and easier to use in
> other
>     projects. Users suffer when the same code "moves" between two projects
>     there is fragmentation and typically it leads to negative effects for
> both
>     projects.
>
>
>     --------------------------------------
>
>     Thanks,
>     Edward
>
>
>
>
>
Thank you for replying.

http://grokbase.com/t/hive/dev/15cjb3kjvn/using-the-hive-sql-parser-in-spark

"Under the Apache license, there's no actual restriction against a hostile
embrace-extend by copying hive's code verbatim as long as the fork retains
license notices."

There is a difference between managing a project "Apache Way" and what is
ultimately allowable via an Apache Licence.

This exchange highlights the problem.
They justify by citing no restriction on embrace-extend.

However:

"We do have a pretty comprehensive suite of Hive compatibility
tests (by using the Hive tests directly) to ensure SQL compatibility with
Hive"

 https://wiki.apache.org/hadoop/Defining%20Hadoop

  Some products have been released that have been described as "compatible"
  with Hadoop, even though parts of the Hadoop codebase have either been
  changed or replaced. The Apache™ Hadoop® developer team are not a
standards
  body: they do not qualify such (derivative) works as compatible. Nor do
  they feel constrained by the requirements of external entities when
  changing the behavior of Apache Hadoop software or related Apache
software.

They assert "compatability" via their own "pretty comprehensive" suite.

This is not a one off statement. I pointed this out in points 5 and 6
above.
It happens repeatedly, when convenient  "Hive compatbility" is asserted
both on Apache lists and in marketing materials.

Re: [DISCUSS] Spark's fork of hive

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> Had the original Spark proposal implied they wished to fork portions of the hive
> code base, I would have considered it a hostile fork. (this is open to interpretation).

FYI, I did ask bluntly whether Spark intends to cut-paste Hive code into their repos previously & got an affirmative answer from rxin.

http://grokbase.com/t/hive/dev/15cjb3kjvn/using-the-hive-sql-parser-in-spark

> People have the right to fork it via the licence. We can not stop that.

Later, I did get a response that they never made a release with the said copy-paste & they deprecated the "HiveContext" object in Spark 2.0.

> than what Hive could handle at peak."
> 
>  (EC Note: How is this statement verifiable?)

Reading about Hive at Facebook, I feel like we've already solved those problems that were due to FB Corona + Hadoop-1 (or, 0.20 *shudder*) limitations.

Spark does not need be limited by Corona and the version of Hive being compared might not have YARN or Tez on its side.

Cheers,
Gopal

On 3/2/17, 8:25 PM, "Edward Capriolo" <ed...@gmail.com> wrote:

    All,
    
    I have compiled a short (non exhaustive) list of items related to Spark's
    forking of Apache Hive code and usage of Apache Hive trademarks.
    
    1)
    ----------------------------
    The original spark proposal repeatedly claims that Spark "inter operates"
    with hive.
    
    https://wiki.apache.org/incubator/SparkProposal
    
    "Finally, Shark (a higher layer framework built on Spark) inter-operates
    with Apache Hive."
    
    (EC note: Originally spark may have linked to hive, but now the situation
    is much different.)
    -------------------------
    
    2)
    ------------------
    Spark distributes jar files to maven repositories carrying the hive name.
    
    https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec
    
    (EC note These are not simple "ports" features are added/missing/broken in
    artifacts named "hive")
    -----------------------
    
    3)
    ---------------------------------
    Spark carries forked and modified copies of hive source code
    
    https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java
    --------------------------------------------
    
    4
    -------------------------------
    Spark has "imported" and modified components of hive
    
    
    https://issues.apache.org/jira/browse/SPARK-12572
    
    (EC note: Further discussions of the code make little no reference to it's
    origins in propaganda)
    ---------------------------------------------
    
    5
    --------------------------------
    Databricks, a company heaving involved in spark development, uses the Hive
    trademark to make claims
    
    https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html
    
    "The Databricks platform provides a fully managed Hive Metastore that
    allows users to share a data catalog across multiple Spark clusters."
    
    
    This blog defining hadoop (draft) is clear on this:
    https://wiki.apache.org/hadoop/Defining%20Hadoop
    
    "Products that are derivative works of Apache Hadoop are not Apache Hadoop,
    and may not call themselves versions of Apache Hadoop, nor Distributions of
    Apache Hadoop."
    
    --------------------
    
    6
    ----------------------
    https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html
    
    "Apache Spark supports multiple versions of Hive, from 0.12 up to 1.2.1. "
    
    Apache spark can NOT support multiple versions of Hive because they are
    working with a fork, and there is no standard body for "supporting hive"
    
    Some products have been released that have been described as "compatible"
    with Hadoop, even though parts of the Hadoop codebase have either been
    changed or replaced. The Apache™ Hadoop® developer team are not a standards
    body: they do not qualify such (derivative) works as compatible. Nor do
    they feel constrained by the requirements of external entities when
    changing the behavior of Apache Hadoop software or related Apache software.
    -----------------------
    
    7
    ---------------------------------
    The spark committers openly use the word "take" during the process of
    "importing" hive code.
    
    https://github.com/apache/spark/pull/10583/files
    "are there unit tests from Hive that we can take?"
    
    Apache foundation will not take a hostile fork for a proposal. Had the
    original Spark proposal implied they wished to fork portions of the hive
    code base, I would have considered it a hostile fork. (this is open to
    interpretation).
    
    (EC Note: Is this the Apache way? How can we build communities? How would
    small projects feel if for example hive "imported" copying code while they
    sat in incubation)
    ------------------------------
    
    8
    ----------------------------
    Databricks (after borrowing slabs of hive code, using our trademarks, etc)
    makes disparaging comments about the performance of hive.
    
    https://databricks.com/blog/2017/02/28/voice-facebook-using-apache-spark-large-scale-language-model-training.html
    
    "Spark-based pipelines can scale comfortably to process many times more
    input data than what Hive could handle at peak. "
    
    (EC Note: How is this statement verifiable?)
    -----------------------------------------------
    
    9
    --------------------------
    https://issues.apache.org/jira/browse/SPARK-10793
    
    It's easily enough added, to the code, there's just the risk of the fork
    diverging more from ASF hive.
    
    (EC Note Even those responsible for this admit the code is diverging and
    will diverge more from there actions.)
    ------------------------
    
    10
    ----------------------
    
    My opinion of all of this:
    The above points are hurtful to Hive.First, we are robbed of community.
    People could be improving hive by making it more modular, but instead they
    are improving Spark's fork of hive. Next, our code base is subject to
    continued "poaching". Apache Spark "imports", copies, alter, and claim
    compatibility with/from Hive (I pointed out above why the compatibility
    claims should not be made). Finally, We are subject to unfair performance
    comparisons "x is faster then hive", by software (spark) that is
    essentially
    
    *POWERED BY Hive (via the forking and code copying).  *
    
    Hive has always been a bullseye as the best hadoop SQL
    https://vision.cloudera.com/impala-v-hive/
    
    In my hood we have a saying, "Haters gonna hate"
    
    For every Impala and every Spark claiming to be better then hive, there is
    10 HadoopDB's that collapsed under the weight of themselves. We outlasted
    fleets of them.
    
    That being said, software like Hive Metastore our baby. It is our TM. It is
    our creation. It is what makes us special. People have the right to fork it
    via the licence. We can not stop that. But it cant be both ways: either
    downstream needs to bring in our published artifacts, or they fork and give
    what they are doing another name.
    
    None of this activity represents what I believe is the "Apache Way". I
    believe the Apache Way would be to communicate to us, the hive community,
    about ways to make the components more modular and easier to use in other
    projects. Users suffer when the same code "moves" between two projects
    there is fragmentation and typically it leads to negative effects for both
    projects.
    
    
    --------------------------------------
    
    Thanks,
    Edward