You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tinkerpop.apache.org by Marc de Lignie <m....@xs4all.nl> on 2017/07/06 09:02:03 UTC

spark-yarn recipe

Hi Stephen,

I recently posted recipes on the gremlin and janusgraph users lists to 
configure the binary distributions to work with a spark-yarn cluster. I 
think it would be useful to have the tinkerpop recipe included in Apache 
Tinkerpop repo itself in the following way:

  - include the spark-yarn dependency to spark-gremlin

  - add the recipe to the docs so that it is actually run in the 
existing documentation environment at build time

In this way:

  - the recipe would be less clumsy for users to follow (no external deps)

  - the recipe would be maintained and still work after version upgrades

I do not have to remind you that many users have had problems with 
spark-yarn and that the ability to run OLAP queries on an existing 
cluster is one of the attractive feature of Tinkerpop.

This brings me to the question: do you see potential obstacles in 
accepting a PR along these lines? I will probably wait for some time 
until actually doing this, though, to have more opportunity to "eat my 
own dogfood" and see if changes are still required.

Cheers,   HadoopMarc

Re: spark-yarn recipe

Posted by Stephen Mallette <sp...@gmail.com>.

I did see that - I was wondering if anyone would try to convert that into
TinkerPop documentation of some sort. I'll save my less positive comments
for the end and first just say what you could do if everyone is into this
idea. You could add it to the "Implementation Recipes" subsection of the
"Recipes" document.

>  - include the spark-yarn dependency to spark-gremlin

I could be wrong, but I don't think you need to add that as a direct
dependency. If we don't need it for compilation it probably shouldn't be in
the pom.xml. If you just need extra jars to come with the plugin to the
console when you do:

:install org.apache.tinkerpop spark-gremlin 3.2.5

you can just add a manifest entry to spark-gremlin to suck in additional
jars as part of that.  Note that we already do this with spark-gremlin -
see:

https://github.com/apache/tinkerpop/blob/0d532aa91e0c9bc775c36d9572f5f816d323abb6/spark-gremlin/pom.xml#L406

dependencies are semi-colon separated, so you can just add more after that
entry. As for:

> do you see potential obstacles in accepting a PR along these lines?

Are there any other dependencies to add? Like, the blog post says you
tested on Hortonworks Data Platform sandbox - do we need that in the mix
too?

....and here's where i get sorta cringy as I alluded to at the start of
this......the only problem i'm concerned about is the one you posted:

> the recipe would be maintained and still work after version upgrades

that terrifies me. personally speaking, i'm terribly uninterested in
hunting down spark to the yarn to hadoop to the hortonworks to the cloudera
to the map-red-env.sh to the yarn-site.xml type of errors. it's not a nice
place at all. If that integration starts to fail for some reason our docs
will effectively be broken and someone is going to have to go down into
that ungodly hole of demons to unblock us and i'm scared of the dark.

on the flip side, i'm sensitive to users struggling with yarn stuff and
every time i see you solve a problem like that on the mailing list related
to that, i'm like "All hail the the Tamer of Hadoop! Long live HadoopMarc!"
- so it seems like this is a need to some degree so it would be nice if we
could make it work somehow. Anyway - those are my thoughts on the matter.
Let's see what other people have to say.

On Thu, Jul 6, 2017 at 5:02 AM, Marc de Lignie <m....@xs4all.nl>
wrote:

> Hi Stephen,
>
> I recently posted recipes on the gremlin and janusgraph users lists to
> configure the binary distributions to work with a spark-yarn cluster. I
> think it would be useful to have the tinkerpop recipe included in Apache
> Tinkerpop repo itself in the following way:
>
>  - include the spark-yarn dependency to spark-gremlin
>
>  - add the recipe to the docs so that it is actually run in the existing
> documentation environment at build time
>
> In this way:
>
>  - the recipe would be less clumsy for users to follow (no external deps)
>
>  - the recipe would be maintained and still work after version upgrades
>
> I do not have to remind you that many users have had problems with
> spark-yarn and that the ability to run OLAP queries on an existing cluster
> is one of the attractive feature of Tinkerpop.
>
> This brings me to the question: do you see potential obstacles in
> accepting a PR along these lines? I will probably wait for some time until
> actually doing this, though, to have more opportunity to "eat my own
> dogfood" and see if changes are still required.
>
> Cheers,   HadoopMarc
>
>
>

Re: Re: spark-yarn recipe

Posted by Stephen Mallette <sp...@gmail.com>.

> First, I only intended to "recipe test in the doc environment" against
the vanilla Apache Hadoop

good!

>  Does this mean that spark-gremlin as plugin from the gremlin console is
not really tested (but only as module)?

I'm not sure about what you're asking. The spark-gremlin plugin is tested
through the docs directly at an integration level, but if you look at the
plugin code itself there isn't much to unit test - it's just imports and
bindings to the console or scriptengine. Obviously spark-gremlin as a
module has a pretty fat bag of unit/integration tests executed over it.

> And is the manifest entry at all necessary, since spark-gremlin depends
on hadoop-gremlin, which depends on hadoop-client?

I don't have a clear memory of this, but I recall the Spark dependency tree
being a zoo of madness. Why the enforcer plugin is not used on every maven
project is beyond me. I further have hazy recollection that Grape was not
pulling all dependencies required to submit jobs from the console. In any
case, that whirlwind of stuff probably made us add that entry to further
instruct Grape to grab extra stuff.

> That would speak in favor of including spark-yarn (with the proper
excludes) as a spark-gremlin dependency. It would also be consistent with
hadoop-yarn.jar hanging around already :-)

I'm not really well informed on Spark so I shouldn't hold up what others
want to do. Maybe you could add it as <optional> - I think Grape would pull
that with the plugin...you'd have to test I guess.

>  I would first try the spark-yarn recipe from the documentation
environment to see how this works out. Then I can come back with more
specific questions.

Cool - Thanks HadoopMarc



On Thu, Jul 6, 2017 at 3:03 PM, Marc de Lignie <m....@xs4all.nl>
wrote:

> Hi Stephen,
>
> Thanks for your valuable comments, which certainly changed my view on this
> matter.
>
> First, I only intended to "recipe test in the doc environment" against the
> vanilla Apache Hadoop; the commercial providers can cater for themselves. I
> was not clear on that one.
>
> Secondly, I was not aware of the manifest entry you pointed at. Since all
> dependency convergence conflicts of the spark-gremlin module are managed
> away manually in the pom dependency section, I had not expected the
> spark-gremlin plugin to have a backdoor that reintroduces some of these
> excluded dependencies. Does this mean that spark-gremlin as plugin from the
> gremlin console is not really tested (but only as module)? And is the
> manifest entry at all necessary, since spark-gremlin depends on
> hadoop-gremlin, which depends on hadoop-client?  OK, sorry, too many
> questions, it works as it is and the hadoop deps are a jungle in general,
> as you note. Let's just keep this in the back of our minds.
>
> Apart from my recipe question, it still would be nice to be able define a
> java project with just spark-gremlin and hadoop-gremlin dependencies and
> being able to connect to a yarn cluster. Implicitly, yarn support is in the
> spark-gremlin API because spark-gremlin accepts the
> spark.master=yarn-client property from the HadoopGraph. That would speak in
> favor of including spark-yarn (with the proper excludes) as a spark-gremlin
> dependency. It would also be consistent with hadoop-yarn.jar hanging around
> already :-)
>
> For now, your concerns are clear to me. If I want to proceed on this, I
> would first try the spark-yarn recipe from the documentation environment to
> see how this works out. Then I can come back with more specific questions.
>
> Cheers,   Marc
>
>
> I did see that - I was wondering if anyone would try to convert that into
>> TinkerPop documentation of some sort. I'll save my less positive comments
>> for the end and first just say what you could do if everyone is into this
>> idea. You could add it to the "Implementation Recipes" subsection of the
>> "Recipes" document.
>>
>> >  - include the spark-yarn dependency to spark-gremlin
>>
>> I could be wrong, but I don't think you need to add that as a direct
>> dependency. If we don't need it for compilation it probably shouldn't be
>> in
>> the pom.xml. If you just need extra jars to come with the plugin to the
>> console when you do:
>>
>> :install org.apache.tinkerpop spark-gremlin 3.2.5
>>
>> you can just add a manifest entry to spark-gremlin to suck in additional
>> jars as part of that.  Note that we already do this with spark-gremlin -
>> see:
>>
>> https://github.com/apache/tinkerpop/blob/0d532aa91e0c9bc775c
>> 36d9572f5f816d323abb6/spark-gremlin/pom.xml#L406
>>
>> dependencies are semi-colon separated, so you can just add more after that
>> entry. As for:
>>
>> > do you see potential obstacles in accepting a PR along these lines?
>>
>> Are there any other dependencies to add? Like, the blog post says you
>> tested on Hortonworks Data Platform sandbox - do we need that in the mix
>> too?
>>
>> ....and here's where i get sorta cringy as I alluded to at the start of
>> this......the only problem i'm concerned about is the one you posted:
>>
>> > the recipe would be maintained and still work after version upgrades
>>
>> that terrifies me. personally speaking, i'm terribly uninterested in
>> hunting down spark to the yarn to hadoop to the hortonworks to the
>> cloudera
>> to the map-red-env.sh to the yarn-site.xml type of errors. it's not a nice
>> place at all. If that integration starts to fail for some reason our docs
>> will effectively be broken and someone is going to have to go down into
>> that ungodly hole of demons to unblock us and i'm scared of the dark.
>>
>> on the flip side, i'm sensitive to users struggling with yarn stuff and
>> every time i see you solve a problem like that on the mailing list related
>> to that, i'm like "All hail the the Tamer of Hadoop! Long live
>> HadoopMarc!"
>> - so it seems like this is a need to some degree so it would be nice if we
>> could make it work somehow. Anyway - those are my thoughts on the matter.
>> Let's see what other people have to say.
>>
>>
>>
>> Op 06-07-17 om 11:02 schreef Marc de Lignie:
>>
>> Hi Stephen,
>>>
>>> I recently posted recipes on the gremlin and janusgraph users lists to
>>> configure the binary distributions to work with a spark-yarn cluster. I
>>> think it would be useful to have the tinkerpop recipe included in Apache
>>> Tinkerpop repo itself in the following way:
>>>
>>>  - include the spark-yarn dependency to spark-gremlin
>>>
>>>  - add the recipe to the docs so that it is actually run in the existing
>>> documentation environment at build time
>>>
>>> In this way:
>>>
>>>  - the recipe would be less clumsy for users to follow (no external deps)
>>>
>>>  - the recipe would be maintained and still work after version upgrades
>>>
>>> I do not have to remind you that many users have had problems with
>>> spark-yarn and that the ability to run OLAP queries on an existing cluster
>>> is one of the attractive feature of Tinkerpop.
>>>
>>> This brings me to the question: do you see potential obstacles in
>>> accepting a PR along these lines? I will probably wait for some time until
>>> actually doing this, though, to have more opportunity to "eat my own
>>> dogfood" and see if changes are still required.
>>>
>>> Cheers,   HadoopMarc
>>>
>>>
>>>
>>
> --
> Marc de Lignie
>
>

Re: Re: spark-yarn recipe

Posted by Marc de Lignie <m....@xs4all.nl>.

Hi Stephen,

Thanks for your valuable comments, which certainly changed my view on 
this matter.

First, I only intended to "recipe test in the doc environment" against 
the vanilla Apache Hadoop; the commercial providers can cater for 
themselves. I was not clear on that one.

Secondly, I was not aware of the manifest entry you pointed at. Since 
all dependency convergence conflicts of the spark-gremlin module are 
managed away manually in the pom dependency section, I had not expected 
the spark-gremlin plugin to have a backdoor that reintroduces some of 
these excluded dependencies. Does this mean that spark-gremlin as plugin 
from the gremlin console is not really tested (but only as module)? And 
is the manifest entry at all necessary, since spark-gremlin depends on 
hadoop-gremlin, which depends on hadoop-client?  OK, sorry, too many 
questions, it works as it is and the hadoop deps are a jungle in 
general, as you note. Let's just keep this in the back of our minds.

Apart from my recipe question, it still would be nice to be able define 
a java project with just spark-gremlin and hadoop-gremlin dependencies 
and being able to connect to a yarn cluster. Implicitly, yarn support is 
in the spark-gremlin API because spark-gremlin accepts the 
spark.master=yarn-client property from the HadoopGraph. That would speak 
in favor of including spark-yarn (with the proper excludes) as a 
spark-gremlin dependency. It would also be consistent with 
hadoop-yarn.jar hanging around already :-)

For now, your concerns are clear to me. If I want to proceed on this, I 
would first try the spark-yarn recipe from the documentation environment 
to see how this works out. Then I can come back with more specific 
questions.

Cheers,   Marc


> I did see that - I was wondering if anyone would try to convert that into
> TinkerPop documentation of some sort. I'll save my less positive comments
> for the end and first just say what you could do if everyone is into this
> idea. You could add it to the "Implementation Recipes" subsection of the
> "Recipes" document.
>
> >  - include the spark-yarn dependency to spark-gremlin
>
> I could be wrong, but I don't think you need to add that as a direct
> dependency. If we don't need it for compilation it probably shouldn't be in
> the pom.xml. If you just need extra jars to come with the plugin to the
> console when you do:
>
> :install org.apache.tinkerpop spark-gremlin 3.2.5
>
> you can just add a manifest entry to spark-gremlin to suck in additional
> jars as part of that.  Note that we already do this with spark-gremlin -
> see:
>
> https://github.com/apache/tinkerpop/blob/0d532aa91e0c9bc775c36d9572f5f816d323abb6/spark-gremlin/pom.xml#L406
>
> dependencies are semi-colon separated, so you can just add more after that
> entry. As for:
>
> > do you see potential obstacles in accepting a PR along these lines?
>
> Are there any other dependencies to add? Like, the blog post says you
> tested on Hortonworks Data Platform sandbox - do we need that in the mix
> too?
>
> ....and here's where i get sorta cringy as I alluded to at the start of
> this......the only problem i'm concerned about is the one you posted:
>
> > the recipe would be maintained and still work after version upgrades
>
> that terrifies me. personally speaking, i'm terribly uninterested in
> hunting down spark to the yarn to hadoop to the hortonworks to the cloudera
> to the map-red-env.sh to the yarn-site.xml type of errors. it's not a nice
> place at all. If that integration starts to fail for some reason our docs
> will effectively be broken and someone is going to have to go down into
> that ungodly hole of demons to unblock us and i'm scared of the dark.
>
> on the flip side, i'm sensitive to users struggling with yarn stuff and
> every time i see you solve a problem like that on the mailing list related
> to that, i'm like "All hail the the Tamer of Hadoop! Long live HadoopMarc!"
> - so it seems like this is a need to some degree so it would be nice if we
> could make it work somehow. Anyway - those are my thoughts on the matter.
> Let's see what other people have to say.
>
>
>
> Op 06-07-17 om 11:02 schreef Marc de Lignie:
>> Hi Stephen,
>>
>> I recently posted recipes on the gremlin and janusgraph users lists 
>> to configure the binary distributions to work with a spark-yarn 
>> cluster. I think it would be useful to have the tinkerpop recipe 
>> included in Apache Tinkerpop repo itself in the following way:
>>
>>  - include the spark-yarn dependency to spark-gremlin
>>
>>  - add the recipe to the docs so that it is actually run in the 
>> existing documentation environment at build time
>>
>> In this way:
>>
>>  - the recipe would be less clumsy for users to follow (no external 
>> deps)
>>
>>  - the recipe would be maintained and still work after version upgrades
>>
>> I do not have to remind you that many users have had problems with 
>> spark-yarn and that the ability to run OLAP queries on an existing 
>> cluster is one of the attractive feature of Tinkerpop.
>>
>> This brings me to the question: do you see potential obstacles in 
>> accepting a PR along these lines? I will probably wait for some time 
>> until actually doing this, though, to have more opportunity to "eat 
>> my own dogfood" and see if changes are still required.
>>
>> Cheers,   HadoopMarc
>>
>>
>

-- 
Marc de Lignie