You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Anwar AliKhan <an...@gmail.com> on 2020/06/23 19:21:01 UTC

Where are all the jars gone ?

Hi,

I prefer to do most of my projects in Python and for that I use Jupyter.
I have been downloading the compiled version of spark.

I do not normally like the source code version because the build process
makes me nervous.
You know with lines of stuff   scrolling up the screen.
What am I am going to do if a build fails. I am a user!

I decided to risk it and it was only one  mvn command to build. (45 minutes
later)
Everything is great. Success.

I removed all jvms except jdk8 for compilation.

I used jdk8 so I know which libraries where linked in the build process.
I also used my local version of maven. Not the apt install version .

I used jdk8 because if you go this scala site.

http://scala-ide.org/download/sdk.html. they say requirement  jdk8 for IDE
 even for scala12.
They don't say JDK 8 or higher ,  just jdk8.

So anyway  once in a while I  do spark projects in scala with eclipse.

For that I don't use maven or anything. I prefer to make use of build path
And external jars. This way I know exactly which libraries I am linking to.

creating a jar in eclipse is straight forward for spark_submit.


Anyway  as you can see (below) I am pointing jupyter to find
spark.init('opt/spark').
That's OK everything is fine.

With the compiled version of spark there is a jar directory which I have
been using in eclipse.



With my own compiled from source version there is no jar directory.


Where are all the jars gone  ?.



I am not sure how findspark.init('/opt/spark') is locating the libraries
unless it is finding them from
Anaconda.


import findspark
findspark.init('/opt/spark')
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName('Titanic Data') \
    .getOrCreate()
<http://www.backbutton.co.uk/>

Re: Where are all the jars gone ?

Posted by Anwar AliKhan <an...@gmail.com>.

I know I can  arrive at the same result with this code,

      val range100 = spark.range(1,101).agg((sum('id) as
"sum")).first.get(0)
      println(f"sum of range100 =  $range100")

so I am not stuck,
I was just curious  😯 why the code breaks using the current link
libraries.

spark.range(1,101).reduce(_+_)

spark-submit test

/opt/spark/spark-submit

spark.range(1,101).reduce(_+_)
<console>:24: error: overloaded method value reduce with alternatives:
  (func:
org.apache.spark.api.java.function.ReduceFunction[java.lang.Long])java.lang.Long
<and>
  (func: (java.lang.Long, java.lang.Long) => java.lang.Long)java.lang.Long
 cannot be applied to ((java.lang.Long, java.lang.Long) => scala.Long)
       spark.range(1,101).reduce(_+_)
<http://www.backbutton.co.uk/>


On Wed, 24 Jun 2020, 19:54 Anwar AliKhan, <an...@gmail.com> wrote:

>
> I am using the method describe on this page for Scala development in
> eclipse.
>
> https://data-flair.training/blogs/create-spark-scala-project/
>
>
> in the middle of the page you will find
>
>
> *“y**ou will see lots of error due to missing libraries.*
> viii. Add Spark Libraries”
>
>
> Now that I have my own build I will be pointing to the jars (spark
> libraries)
>
> in directory /opt/spark/assembly/target/scala-2.12/jars
>
>
> This way I know exactly the jar libraries I am using to remove the
> formentioned errors.
>
>
> At the same time I am trying to setup a template environment as shown here
>
>
> https://medium.com/@faizanahemad/apache-spark-setup-with-gradle-scala-and-intellij-2eeb9f30c02a
>
>
> so that I can have variables sc and spark in the eclipse editor same you
> would have spark, sc variables in the spark-shell.
>
>
> I used the word trying because the following code is broken
>
>
> spark.range(1,101).reduce(_ + _)
>
> with latest spark.
>
>
> If I use the gradle method as described then the code does work because
> it is pulling the libraries from maven repository as stipulated in
> gradle.properties
> <https://github.com/faizanahemad/spark-gradle-template/blob/master/gradle.properties>
> .
>
>
> In my previous post I *forget* with maven pom.xml you can actually
> specify version number of jar you want to pull from maven repository using *mvn
> clean package *command.
>
>
> So even if I use maven with eclipse then any new libraries uploaded in
> maven repository by developers will have recent version numbers. So will
> not effect my project.
>
> Can you please tell me why the code spark.range(1,101).reduce(_ + _) is
> broken with latest spark ?
>
>
> <http://www.backbutton.co.uk/>
>
>
> On Wed, 24 Jun 2020, 17:07 Jeff Evans, <je...@gmail.com>
> wrote:
>
>> If I'm understanding this correctly, you are building Spark from source
>> and using the built artifacts (jars) in some other project.  Correct?  If
>> so, then why are you concerning yourself with the directory structure that
>> Spark, internally, uses when building its artifacts?  It should be a black
>> box to your application, entirely.  You would pick the profiles (ex: Scala
>> version, Hadoop version, etc.) you need, then the install phase of Maven
>> will take care of building the jars and putting them in your local Maven
>> repo.  After that, you can resolve them from your other project seamlessly
>> (simply by declaring the org/artifact/version).
>>
>> Maven artifacts are immutable, at least released versions in Maven
>> central.  If "someone" (unclear who you are talking about) is "swapping
>> out" jars in a Maven repo then they're doing something extremely strange
>> and broken, unless they're simply replacing snapshot versions, which is a different
>> beast entirely
>> <https://maven.apache.org/guides/getting-started/index.html#What_is_a_SNAPSHOT_version>
>> .
>>
>> On Wed, Jun 24, 2020 at 10:39 AM Anwar AliKhan <an...@gmail.com>
>> wrote:
>>
>>> THANKS
>>>
>>>
>>> It appears the directory containing the jars have been switched from
>>> download version to source version.
>>>
>>> In the download version it is just below parent directory called jars.
>>> level 1.
>>>
>>> In the git source version it is  4 levels down in the directory
>>>  /spark/assembly/target/scala-2.12/jars
>>>
>>> The issue I have with using maven is that the linking libraries can be
>>> changed at maven repository without my knowledge .
>>> So if an application compiled and worked previously could just break.
>>>
>>> It is not like when the developers make a change to the link libraries
>>> they run it by me first ,😢  they just upload it to maven repository with
>>> out asking me if their change
>>> Is going to impact my app.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 24 Jun 2020, 16:07 ArtemisDev, <ar...@dtechspace.com> wrote:
>>>
>>>> If you are using Maven to manage your jar dependencies, the jar files
>>>> are located in the maven repository on your home directory.  It is usually
>>>> in the .m2 directory.
>>>>
>>>> Hope this helps.
>>>>
>>>> -ND
>>>> On 6/23/20 3:21 PM, Anwar AliKhan wrote:
>>>>
>>>> Hi,
>>>>
>>>> I prefer to do most of my projects in Python and for that I use Jupyter.
>>>> I have been downloading the compiled version of spark.
>>>>
>>>> I do not normally like the source code version because the build
>>>> process makes me nervous.
>>>> You know with lines of stuff   scrolling up the screen.
>>>> What am I am going to do if a build fails. I am a user!
>>>>
>>>> I decided to risk it and it was only one  mvn command to build. (45
>>>> minutes later)
>>>> Everything is great. Success.
>>>>
>>>> I removed all jvms except jdk8 for compilation.
>>>>
>>>> I used jdk8 so I know which libraries where linked in the build process.
>>>> I also used my local version of maven. Not the apt install version .
>>>>
>>>> I used jdk8 because if you go this scala site.
>>>>
>>>> http://scala-ide.org/download/sdk.html. they say requirement  jdk8 for
>>>> IDE
>>>>  even for scala12.
>>>> They don't say JDK 8 or higher ,  just jdk8.
>>>>
>>>> So anyway  once in a while I  do spark projects in scala with eclipse.
>>>>
>>>> For that I don't use maven or anything. I prefer to make use of build
>>>> path
>>>> And external jars. This way I know exactly which libraries I am linking
>>>> to.
>>>>
>>>> creating a jar in eclipse is straight forward for spark_submit.
>>>>
>>>>
>>>> Anyway  as you can see (below) I am pointing jupyter to find
>>>> spark.init('opt/spark').
>>>> That's OK everything is fine.
>>>>
>>>> With the compiled version of spark there is a jar directory which I
>>>> have been using in eclipse.
>>>>
>>>>
>>>>
>>>> With my own compiled from source version there is no jar directory.
>>>>
>>>>
>>>> Where are all the jars gone  ?.
>>>>
>>>>
>>>>
>>>> I am not sure how findspark.init('/opt/spark') is locating the
>>>> libraries unless it is finding them from
>>>> Anaconda.
>>>>
>>>>
>>>> import findspark
>>>> findspark.init('/opt/spark')
>>>> from pyspark.sql import SparkSession
>>>> spark = SparkSession \
>>>>     .builder \
>>>>     .appName('Titanic Data') \
>>>>     .getOrCreate()
>>>>
>>>>

Re: Where are all the jars gone ?

Posted by Anwar AliKhan <an...@gmail.com>.

I am using the method describe on this page for Scala development in
eclipse.

https://data-flair.training/blogs/create-spark-scala-project/


in the middle of the page you will find


*“y**ou will see lots of error due to missing libraries.*
viii. Add Spark Libraries”


Now that I have my own build I will be pointing to the jars (spark
libraries)

in directory /opt/spark/assembly/target/scala-2.12/jars


This way I know exactly the jar libraries I am using to remove the
formentioned errors.


At the same time I am trying to setup a template environment as shown here

https://medium.com/@faizanahemad/apache-spark-setup-with-gradle-scala-and-intellij-2eeb9f30c02a


so that I can have variables sc and spark in the eclipse editor same you
would have spark, sc variables in the spark-shell.


I used the word trying because the following code is broken


spark.range(1,101).reduce(_ + _)

with latest spark.


If I use the gradle method as described then the code does work because it
is pulling the libraries from maven repository as stipulated in
gradle.properties
<https://github.com/faizanahemad/spark-gradle-template/blob/master/gradle.properties>
.


In my previous post I *forget* with maven pom.xml you can actually specify
version number of jar you want to pull from maven repository using *mvn
clean package *command.


So even if I use maven with eclipse then any new libraries uploaded in
maven repository by developers will have recent version numbers. So will
not effect my project.

Can you please tell me why the code spark.range(1,101).reduce(_ + _) is
broken with latest spark ?


<http://www.backbutton.co.uk/>


On Wed, 24 Jun 2020, 17:07 Jeff Evans, <je...@gmail.com>
wrote:

> If I'm understanding this correctly, you are building Spark from source
> and using the built artifacts (jars) in some other project.  Correct?  If
> so, then why are you concerning yourself with the directory structure that
> Spark, internally, uses when building its artifacts?  It should be a black
> box to your application, entirely.  You would pick the profiles (ex: Scala
> version, Hadoop version, etc.) you need, then the install phase of Maven
> will take care of building the jars and putting them in your local Maven
> repo.  After that, you can resolve them from your other project seamlessly
> (simply by declaring the org/artifact/version).
>
> Maven artifacts are immutable, at least released versions in Maven
> central.  If "someone" (unclear who you are talking about) is "swapping
> out" jars in a Maven repo then they're doing something extremely strange
> and broken, unless they're simply replacing snapshot versions, which is a different
> beast entirely
> <https://maven.apache.org/guides/getting-started/index.html#What_is_a_SNAPSHOT_version>
> .
>
> On Wed, Jun 24, 2020 at 10:39 AM Anwar AliKhan <an...@gmail.com>
> wrote:
>
>> THANKS
>>
>>
>> It appears the directory containing the jars have been switched from
>> download version to source version.
>>
>> In the download version it is just below parent directory called jars.
>> level 1.
>>
>> In the git source version it is  4 levels down in the directory
>>  /spark/assembly/target/scala-2.12/jars
>>
>> The issue I have with using maven is that the linking libraries can be
>> changed at maven repository without my knowledge .
>> So if an application compiled and worked previously could just break.
>>
>> It is not like when the developers make a change to the link libraries
>> they run it by me first ,😢  they just upload it to maven repository with
>> out asking me if their change
>> Is going to impact my app.
>>
>>
>>
>>
>>
>>
>> On Wed, 24 Jun 2020, 16:07 ArtemisDev, <ar...@dtechspace.com> wrote:
>>
>>> If you are using Maven to manage your jar dependencies, the jar files
>>> are located in the maven repository on your home directory.  It is usually
>>> in the .m2 directory.
>>>
>>> Hope this helps.
>>>
>>> -ND
>>> On 6/23/20 3:21 PM, Anwar AliKhan wrote:
>>>
>>> Hi,
>>>
>>> I prefer to do most of my projects in Python and for that I use Jupyter.
>>> I have been downloading the compiled version of spark.
>>>
>>> I do not normally like the source code version because the build process
>>> makes me nervous.
>>> You know with lines of stuff   scrolling up the screen.
>>> What am I am going to do if a build fails. I am a user!
>>>
>>> I decided to risk it and it was only one  mvn command to build. (45
>>> minutes later)
>>> Everything is great. Success.
>>>
>>> I removed all jvms except jdk8 for compilation.
>>>
>>> I used jdk8 so I know which libraries where linked in the build process.
>>> I also used my local version of maven. Not the apt install version .
>>>
>>> I used jdk8 because if you go this scala site.
>>>
>>> http://scala-ide.org/download/sdk.html. they say requirement  jdk8 for
>>> IDE
>>>  even for scala12.
>>> They don't say JDK 8 or higher ,  just jdk8.
>>>
>>> So anyway  once in a while I  do spark projects in scala with eclipse.
>>>
>>> For that I don't use maven or anything. I prefer to make use of build
>>> path
>>> And external jars. This way I know exactly which libraries I am linking
>>> to.
>>>
>>> creating a jar in eclipse is straight forward for spark_submit.
>>>
>>>
>>> Anyway  as you can see (below) I am pointing jupyter to find
>>> spark.init('opt/spark').
>>> That's OK everything is fine.
>>>
>>> With the compiled version of spark there is a jar directory which I have
>>> been using in eclipse.
>>>
>>>
>>>
>>> With my own compiled from source version there is no jar directory.
>>>
>>>
>>> Where are all the jars gone  ?.
>>>
>>>
>>>
>>> I am not sure how findspark.init('/opt/spark') is locating the libraries
>>> unless it is finding them from
>>> Anaconda.
>>>
>>>
>>> import findspark
>>> findspark.init('/opt/spark')
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession \
>>>     .builder \
>>>     .appName('Titanic Data') \
>>>     .getOrCreate()
>>>
>>>

Re: Where are all the jars gone ?

Posted by Jeff Evans <je...@gmail.com>.

If I'm understanding this correctly, you are building Spark from source and
using the built artifacts (jars) in some other project.  Correct?  If so,
then why are you concerning yourself with the directory structure that
Spark, internally, uses when building its artifacts?  It should be a black
box to your application, entirely.  You would pick the profiles (ex: Scala
version, Hadoop version, etc.) you need, then the install phase of Maven
will take care of building the jars and putting them in your local Maven
repo.  After that, you can resolve them from your other project seamlessly
(simply by declaring the org/artifact/version).

Maven artifacts are immutable, at least released versions in Maven
central.  If "someone" (unclear who you are talking about) is "swapping
out" jars in a Maven repo then they're doing something extremely strange
and broken, unless they're simply replacing snapshot versions, which
is a different
beast entirely
<https://maven.apache.org/guides/getting-started/index.html#What_is_a_SNAPSHOT_version>
.

On Wed, Jun 24, 2020 at 10:39 AM Anwar AliKhan <an...@gmail.com>
wrote:

> THANKS
>
>
> It appears the directory containing the jars have been switched from
> download version to source version.
>
> In the download version it is just below parent directory called jars.
> level 1.
>
> In the git source version it is  4 levels down in the directory
>  /spark/assembly/target/scala-2.12/jars
>
> The issue I have with using maven is that the linking libraries can be
> changed at maven repository without my knowledge .
> So if an application compiled and worked previously could just break.
>
> It is not like when the developers make a change to the link libraries
> they run it by me first ,😢  they just upload it to maven repository with
> out asking me if their change
> Is going to impact my app.
>
>
>
>
>
>
> On Wed, 24 Jun 2020, 16:07 ArtemisDev, <ar...@dtechspace.com> wrote:
>
>> If you are using Maven to manage your jar dependencies, the jar files are
>> located in the maven repository on your home directory.  It is usually in
>> the .m2 directory.
>>
>> Hope this helps.
>>
>> -ND
>> On 6/23/20 3:21 PM, Anwar AliKhan wrote:
>>
>> Hi,
>>
>> I prefer to do most of my projects in Python and for that I use Jupyter.
>> I have been downloading the compiled version of spark.
>>
>> I do not normally like the source code version because the build process
>> makes me nervous.
>> You know with lines of stuff   scrolling up the screen.
>> What am I am going to do if a build fails. I am a user!
>>
>> I decided to risk it and it was only one  mvn command to build. (45
>> minutes later)
>> Everything is great. Success.
>>
>> I removed all jvms except jdk8 for compilation.
>>
>> I used jdk8 so I know which libraries where linked in the build process.
>> I also used my local version of maven. Not the apt install version .
>>
>> I used jdk8 because if you go this scala site.
>>
>> http://scala-ide.org/download/sdk.html. they say requirement  jdk8 for
>> IDE
>>  even for scala12.
>> They don't say JDK 8 or higher ,  just jdk8.
>>
>> So anyway  once in a while I  do spark projects in scala with eclipse.
>>
>> For that I don't use maven or anything. I prefer to make use of build path
>> And external jars. This way I know exactly which libraries I am linking
>> to.
>>
>> creating a jar in eclipse is straight forward for spark_submit.
>>
>>
>> Anyway  as you can see (below) I am pointing jupyter to find
>> spark.init('opt/spark').
>> That's OK everything is fine.
>>
>> With the compiled version of spark there is a jar directory which I have
>> been using in eclipse.
>>
>>
>>
>> With my own compiled from source version there is no jar directory.
>>
>>
>> Where are all the jars gone  ?.
>>
>>
>>
>> I am not sure how findspark.init('/opt/spark') is locating the libraries
>> unless it is finding them from
>> Anaconda.
>>
>>
>> import findspark
>> findspark.init('/opt/spark')
>> from pyspark.sql import SparkSession
>> spark = SparkSession \
>>     .builder \
>>     .appName('Titanic Data') \
>>     .getOrCreate()
>>
>>

Re: Where are all the jars gone ?

Posted by Anwar AliKhan <an...@gmail.com>.

THANKS


It appears the directory containing the jars have been switched from
download version to source version.

In the download version it is just below parent directory called jars.
level 1.

In the git source version it is  4 levels down in the directory
 /spark/assembly/target/scala-2.12/jars

The issue I have with using maven is that the linking libraries can be
changed at maven repository without my knowledge .
So if an application compiled and worked previously could just break.

It is not like when the developers make a change to the link libraries they
run it by me first ,😢  they just upload it to maven repository with out
asking me if their change
Is going to impact my app.






On Wed, 24 Jun 2020, 16:07 ArtemisDev, <ar...@dtechspace.com> wrote:

> If you are using Maven to manage your jar dependencies, the jar files are
> located in the maven repository on your home directory.  It is usually in
> the .m2 directory.
>
> Hope this helps.
>
> -ND
> On 6/23/20 3:21 PM, Anwar AliKhan wrote:
>
> Hi,
>
> I prefer to do most of my projects in Python and for that I use Jupyter.
> I have been downloading the compiled version of spark.
>
> I do not normally like the source code version because the build process
> makes me nervous.
> You know with lines of stuff   scrolling up the screen.
> What am I am going to do if a build fails. I am a user!
>
> I decided to risk it and it was only one  mvn command to build. (45
> minutes later)
> Everything is great. Success.
>
> I removed all jvms except jdk8 for compilation.
>
> I used jdk8 so I know which libraries where linked in the build process.
> I also used my local version of maven. Not the apt install version .
>
> I used jdk8 because if you go this scala site.
>
> http://scala-ide.org/download/sdk.html. they say requirement  jdk8 for IDE
>  even for scala12.
> They don't say JDK 8 or higher ,  just jdk8.
>
> So anyway  once in a while I  do spark projects in scala with eclipse.
>
> For that I don't use maven or anything. I prefer to make use of build path
> And external jars. This way I know exactly which libraries I am linking to.
>
> creating a jar in eclipse is straight forward for spark_submit.
>
>
> Anyway  as you can see (below) I am pointing jupyter to find
> spark.init('opt/spark').
> That's OK everything is fine.
>
> With the compiled version of spark there is a jar directory which I have
> been using in eclipse.
>
>
>
> With my own compiled from source version there is no jar directory.
>
>
> Where are all the jars gone  ?.
>
>
>
> I am not sure how findspark.init('/opt/spark') is locating the libraries
> unless it is finding them from
> Anaconda.
>
>
> import findspark
> findspark.init('/opt/spark')
> from pyspark.sql import SparkSession
> spark = SparkSession \
>     .builder \
>     .appName('Titanic Data') \
>     .getOrCreate()
>
>

Re: Where are all the jars gone ?

Posted by ArtemisDev <ar...@dtechspace.com>.

If you are using Maven to manage your jar dependencies, the jar files 
are located in the maven repository on your home directory. It is 
usually in the .m2 directory.

Hope this helps.

-ND

On 6/23/20 3:21 PM, Anwar AliKhan wrote:
> Hi,
>
> I prefer to do most of my projects in Python and for that I use Jupyter.
> I have been downloading the compiled version of spark.
>
> I do not normally like the source code version because the build 
> process makes me nervous.
> You know with lines of stuff   scrolling up the screen.
> What am I am going to do if a build fails. I am a user!
>
> I decided to risk it and it was only one  mvn command to build. (45 
> minutes later)
> Everything is great. Success.
>
> I removed all jvms except jdk8 for compilation.
>
> I used jdk8 so I know which libraries where linked in the build process.
> I also used my local version of maven. Not the apt install version .
>
> I used jdk8 because if you go this scala site.
>
> http://scala-ide.org/download/sdk.html. they say requirement jdk8 for IDE
>  even for scala12.
> They don't say JDK 8 or higher ,  just jdk8.
>
> So anyway  once in a while I  do spark projects in scala with eclipse.
>
> For that I don't use maven or anything. I prefer to make use of build path
> And external jars. This way I know exactly which libraries I am 
> linking to.
>
> creating a jar in eclipse is straight forward for spark_submit.
>
>
> Anyway as you can see (below) I am pointing jupyter to find 
> spark.init('opt/spark').
> That's OK everything is fine.
>
> With the compiled version of spark there is a jar directory which I 
> have been using in eclipse.
>
>
>
> With my own compiled from source version there is no jar directory.
>
>
> Where are all the jars gone ?.
>
>
>
> I am not sure how findspark.init('/opt/spark') is locating the 
> libraries unless it is finding them from
> Anaconda.
>
>
> import findspark
> findspark.init('/opt/spark')
> from pyspark.sql import SparkSession
> spark = SparkSession \
>     .builder \
>     .appName('Titanic Data') \
>     .getOrCreate()
>