You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vikas Garg <sp...@gmail.com> on 2019/07/05 04:33:54 UTC

Learning Spark

Hi,

I am new Spark learner. Can someone guide me with the strategy towards
getting expertise in PySpark.

Thanks!!!

Re: Learning Spark

Posted by ayan guha <gu...@gmail.com>.
My best advise is to go through the docs and listen to lots of demo/videos
from spark committers.

On Fri, 5 Jul 2019 at 3:03 pm, Kurt Fehlhauer <kf...@gmail.com> wrote:

> Are you a data scientist or data engineer?
>
>
> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg <sp...@gmail.com> wrote:
>
>> Hi,
>>
>> I am new Spark learner. Can someone guide me with the strategy towards
>> getting expertise in PySpark.
>>
>> Thanks!!!
>>
> --
Best Regards,
Ayan Guha

Re: Learning Spark

Posted by "Alex A. Reda" <al...@gmail.com>.
Hello,

I also second Gourav's point regarding "Spark the definitive guide" book.
This is great for learning both Scala and python based SPARK. But as others
mentioned, you will need to continuously read the documentation as SPARK is
still undergoing a lot of improvements. I list additional resources below,
no plug :)

-       Excellent training on Spark 2 in Udemy by Jose Portilla. This one
is on Pyspark, he also has a training on Scala. Not super advanced but
touches the basics very well.
https://www.udemy.com/apache-spark-with-python-big-data-with-pyspark-and-spark/



-        Great book on Spark 2, "Learning Pyspark" by Chambers and Zaharia
- so far the best in the resource lineup both for scala based and python
based Spark -
https://www.packtpub.com/big-data-and-business-intelligence/learning-pyspark
(Read
Chapter 1, 2, 4, and 6 to get immediate benefits)



-        Great book on Spark by Tomasz Drabas and Denny Lee.
https://www.amazon.com/Spark-Definitive-Guide-Processing-Simple/dp/1491912219/ref=sr_1_1?ie=UTF8&qid=1540567390&sr=8-1&keywords=spark+the+definitive+guide
(Part
I, II, VI are the most important to get started). Apparently, they have a
new edition, I am referring to the 2017 edition.


- A bit dated now because Spark has evolved so much but I like Jeffrey
Aven's book and style of writing too."Sams Teach Yourself Apache Spark in
24 hours
<https://www.amazon.com/Apache-Spark-Hours-Teach-Yourself/dp/0672338513/ref=sr_1_1?crid=75O5XD7JSREF&keywords=apache+spark+in+24+hours%2C+sams+teach+yourself&qid=1562333740&s=gateway&sprefix=sams+teach++apache+spark%2Caps%2C156&sr=8-1>
"

In terms of actually learning, I would suggest practicing the code plus
based on my experience you are better off installing spark to your local
PC. I found this a much better way of learning than using an enterprise
cluster. Depending on which rout you take, if you decide to focus on
Pyspark, learning Scikit learn will provide you a lot of transferable
skills.

One final note, I am providing the suggestion from the perspective of a
data scientist.

Kind regards,

Alex Reda







On Fri, Jul 5, 2019 at 9:24 AM Gourav Sengupta <go...@gmail.com>
wrote:

> okay this is all something which I would disagree with.
>
> Dr. Matei Zaharia created SPARK
> Then he and Bill Chambers wrote a book on SPARK recently
> He is still the main thinking power behind SPARK (look at his research in
> Stanford)
> The name of the book is "SPARK the definitive guide", its the best ever
> book and introduction on SPARK.
>
> I have been through several documentation, at least 40 books on SPARK, and
> nothing even comes close to this book. And also it puts into rest much of
> arguments around which language to choose.
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Fri, Jul 5, 2019 at 11:55 AM Vikas Garg <sp...@gmail.com> wrote:
>
>> Thanks!!!
>>
>> On Fri, 5 Jul 2019 at 15:38, Chris Teoh <ch...@gmail.com> wrote:
>>
>>> Scala is better suited to data engineering work. It also has better
>>> integration with other components like HBase, Kafka, etc.
>>>
>>> Python is great for data scientists as there are more data science
>>> libraries available in Python.
>>>
>>> On Fri., 5 Jul. 2019, 7:40 pm Vikas Garg, <sp...@gmail.com> wrote:
>>>
>>>> Is there any disadvantage of using Python? I have gone through multiple
>>>> articles which says that Python has advantages over Scala.
>>>>
>>>> Scala is super fast in comparison but Python has more pre-built
>>>> libraries and options for analytics.
>>>>
>>>> Still should I go with Scala?
>>>>
>>>> On Fri, 5 Jul 2019 at 13:07, Kurt Fehlhauer <kf...@gmail.com> wrote:
>>>>
>>>>> Since you are a data engineer I would start by learning Scala. The
>>>>> parts of Scala you would need to learn are pretty basic. Start with the
>>>>> examples on the Spark website, which gives examples in multiple languages.
>>>>> Think of Scala as a typed version of Python. You will find that the error
>>>>> messages tend to be much more meaningful in Scala because that is the
>>>>> native language of Spark. If you don’t want to to install the JVM and
>>>>> Scala, I highly recommend Databricks community edition as a place to start.
>>>>>
>>>>> On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg <sp...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I am currently working as a data engineer and I am working on Power
>>>>>> BI, SSIS (ETL Tool). For learning purpose, I have done the setup PySpark
>>>>>> and also able to run queries through Spark on multi node cluster DB (I am
>>>>>> using Vertica DB and later will move on HDFS or SQL Server).
>>>>>>
>>>>>> I have good knowledge of Python also.
>>>>>>
>>>>>> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer <kf...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Are you a data scientist or data engineer?
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg <sp...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am new Spark learner. Can someone guide me with the strategy
>>>>>>>> towards getting expertise in PySpark.
>>>>>>>>
>>>>>>>> Thanks!!!
>>>>>>>>
>>>>>>>

Re: Learning Spark

Posted by Gourav Sengupta <go...@gmail.com>.
okay this is all something which I would disagree with.

Dr. Matei Zaharia created SPARK
Then he and Bill Chambers wrote a book on SPARK recently
He is still the main thinking power behind SPARK (look at his research in
Stanford)
The name of the book is "SPARK the definitive guide", its the best ever
book and introduction on SPARK.

I have been through several documentation, at least 40 books on SPARK, and
nothing even comes close to this book. And also it puts into rest much of
arguments around which language to choose.

Thanks and Regards,
Gourav Sengupta

On Fri, Jul 5, 2019 at 11:55 AM Vikas Garg <sp...@gmail.com> wrote:

> Thanks!!!
>
> On Fri, 5 Jul 2019 at 15:38, Chris Teoh <ch...@gmail.com> wrote:
>
>> Scala is better suited to data engineering work. It also has better
>> integration with other components like HBase, Kafka, etc.
>>
>> Python is great for data scientists as there are more data science
>> libraries available in Python.
>>
>> On Fri., 5 Jul. 2019, 7:40 pm Vikas Garg, <sp...@gmail.com> wrote:
>>
>>> Is there any disadvantage of using Python? I have gone through multiple
>>> articles which says that Python has advantages over Scala.
>>>
>>> Scala is super fast in comparison but Python has more pre-built
>>> libraries and options for analytics.
>>>
>>> Still should I go with Scala?
>>>
>>> On Fri, 5 Jul 2019 at 13:07, Kurt Fehlhauer <kf...@gmail.com> wrote:
>>>
>>>> Since you are a data engineer I would start by learning Scala. The
>>>> parts of Scala you would need to learn are pretty basic. Start with the
>>>> examples on the Spark website, which gives examples in multiple languages.
>>>> Think of Scala as a typed version of Python. You will find that the error
>>>> messages tend to be much more meaningful in Scala because that is the
>>>> native language of Spark. If you don’t want to to install the JVM and
>>>> Scala, I highly recommend Databricks community edition as a place to start.
>>>>
>>>> On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg <sp...@gmail.com> wrote:
>>>>
>>>>> I am currently working as a data engineer and I am working on Power
>>>>> BI, SSIS (ETL Tool). For learning purpose, I have done the setup PySpark
>>>>> and also able to run queries through Spark on multi node cluster DB (I am
>>>>> using Vertica DB and later will move on HDFS or SQL Server).
>>>>>
>>>>> I have good knowledge of Python also.
>>>>>
>>>>> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer <kf...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Are you a data scientist or data engineer?
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg <sp...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am new Spark learner. Can someone guide me with the strategy
>>>>>>> towards getting expertise in PySpark.
>>>>>>>
>>>>>>> Thanks!!!
>>>>>>>
>>>>>>

Re: Learning Spark

Posted by Vikas Garg <sp...@gmail.com>.
Thanks!!!

On Fri, 5 Jul 2019 at 15:38, Chris Teoh <ch...@gmail.com> wrote:

> Scala is better suited to data engineering work. It also has better
> integration with other components like HBase, Kafka, etc.
>
> Python is great for data scientists as there are more data science
> libraries available in Python.
>
> On Fri., 5 Jul. 2019, 7:40 pm Vikas Garg, <sp...@gmail.com> wrote:
>
>> Is there any disadvantage of using Python? I have gone through multiple
>> articles which says that Python has advantages over Scala.
>>
>> Scala is super fast in comparison but Python has more pre-built libraries
>> and options for analytics.
>>
>> Still should I go with Scala?
>>
>> On Fri, 5 Jul 2019 at 13:07, Kurt Fehlhauer <kf...@gmail.com> wrote:
>>
>>> Since you are a data engineer I would start by learning Scala. The parts
>>> of Scala you would need to learn are pretty basic. Start with the examples
>>> on the Spark website, which gives examples in multiple languages. Think of
>>> Scala as a typed version of Python. You will find that the error messages
>>> tend to be much more meaningful in Scala because that is the native
>>> language of Spark. If you don’t want to to install the JVM and Scala, I
>>> highly recommend Databricks community edition as a place to start.
>>>
>>> On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg <sp...@gmail.com> wrote:
>>>
>>>> I am currently working as a data engineer and I am working on Power BI,
>>>> SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
>>>> also able to run queries through Spark on multi node cluster DB (I am using
>>>> Vertica DB and later will move on HDFS or SQL Server).
>>>>
>>>> I have good knowledge of Python also.
>>>>
>>>> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer <kf...@gmail.com> wrote:
>>>>
>>>>> Are you a data scientist or data engineer?
>>>>>
>>>>>
>>>>> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg <sp...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am new Spark learner. Can someone guide me with the strategy
>>>>>> towards getting expertise in PySpark.
>>>>>>
>>>>>> Thanks!!!
>>>>>>
>>>>>

Re: Learning Spark

Posted by Chris Teoh <ch...@gmail.com>.
Scala is better suited to data engineering work. It also has better
integration with other components like HBase, Kafka, etc.

Python is great for data scientists as there are more data science
libraries available in Python.

On Fri., 5 Jul. 2019, 7:40 pm Vikas Garg, <sp...@gmail.com> wrote:

> Is there any disadvantage of using Python? I have gone through multiple
> articles which says that Python has advantages over Scala.
>
> Scala is super fast in comparison but Python has more pre-built libraries
> and options for analytics.
>
> Still should I go with Scala?
>
> On Fri, 5 Jul 2019 at 13:07, Kurt Fehlhauer <kf...@gmail.com> wrote:
>
>> Since you are a data engineer I would start by learning Scala. The parts
>> of Scala you would need to learn are pretty basic. Start with the examples
>> on the Spark website, which gives examples in multiple languages. Think of
>> Scala as a typed version of Python. You will find that the error messages
>> tend to be much more meaningful in Scala because that is the native
>> language of Spark. If you don’t want to to install the JVM and Scala, I
>> highly recommend Databricks community edition as a place to start.
>>
>> On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg <sp...@gmail.com> wrote:
>>
>>> I am currently working as a data engineer and I am working on Power BI,
>>> SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
>>> also able to run queries through Spark on multi node cluster DB (I am using
>>> Vertica DB and later will move on HDFS or SQL Server).
>>>
>>> I have good knowledge of Python also.
>>>
>>> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer <kf...@gmail.com> wrote:
>>>
>>>> Are you a data scientist or data engineer?
>>>>
>>>>
>>>> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg <sp...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am new Spark learner. Can someone guide me with the strategy towards
>>>>> getting expertise in PySpark.
>>>>>
>>>>> Thanks!!!
>>>>>
>>>>

Re: Learning Spark

Posted by Vikas Garg <sp...@gmail.com>.
Is there any disadvantage of using Python? I have gone through multiple
articles which says that Python has advantages over Scala.

Scala is super fast in comparison but Python has more pre-built libraries
and options for analytics.

Still should I go with Scala?

On Fri, 5 Jul 2019 at 13:07, Kurt Fehlhauer <kf...@gmail.com> wrote:

> Since you are a data engineer I would start by learning Scala. The parts
> of Scala you would need to learn are pretty basic. Start with the examples
> on the Spark website, which gives examples in multiple languages. Think of
> Scala as a typed version of Python. You will find that the error messages
> tend to be much more meaningful in Scala because that is the native
> language of Spark. If you don’t want to to install the JVM and Scala, I
> highly recommend Databricks community edition as a place to start.
>
> On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg <sp...@gmail.com> wrote:
>
>> I am currently working as a data engineer and I am working on Power BI,
>> SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
>> also able to run queries through Spark on multi node cluster DB (I am using
>> Vertica DB and later will move on HDFS or SQL Server).
>>
>> I have good knowledge of Python also.
>>
>> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer <kf...@gmail.com> wrote:
>>
>>> Are you a data scientist or data engineer?
>>>
>>>
>>> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg <sp...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am new Spark learner. Can someone guide me with the strategy towards
>>>> getting expertise in PySpark.
>>>>
>>>> Thanks!!!
>>>>
>>>

Re: Learning Spark

Posted by Kurt Fehlhauer <kf...@gmail.com>.
Since you are a data engineer I would start by learning Scala. The parts of
Scala you would need to learn are pretty basic. Start with the examples on
the Spark website, which gives examples in multiple languages. Think of
Scala as a typed version of Python. You will find that the error messages
tend to be much more meaningful in Scala because that is the native
language of Spark. If you don’t want to to install the JVM and Scala, I
highly recommend Databricks community edition as a place to start.

On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg <sp...@gmail.com> wrote:

> I am currently working as a data engineer and I am working on Power BI,
> SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
> also able to run queries through Spark on multi node cluster DB (I am using
> Vertica DB and later will move on HDFS or SQL Server).
>
> I have good knowledge of Python also.
>
> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer <kf...@gmail.com> wrote:
>
>> Are you a data scientist or data engineer?
>>
>>
>> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg <sp...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am new Spark learner. Can someone guide me with the strategy towards
>>> getting expertise in PySpark.
>>>
>>> Thanks!!!
>>>
>>

Re: Learning Spark

Posted by Vikas Garg <sp...@gmail.com>.
I am currently working as a data engineer and I am working on Power BI,
SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
also able to run queries through Spark on multi node cluster DB (I am using
Vertica DB and later will move on HDFS or SQL Server).

I have good knowledge of Python also.

On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer <kf...@gmail.com> wrote:

> Are you a data scientist or data engineer?
>
>
> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg <sp...@gmail.com> wrote:
>
>> Hi,
>>
>> I am new Spark learner. Can someone guide me with the strategy towards
>> getting expertise in PySpark.
>>
>> Thanks!!!
>>
>

Re: Learning Spark

Posted by Kurt Fehlhauer <kf...@gmail.com>.
Are you a data scientist or data engineer?


On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg <sp...@gmail.com> wrote:

> Hi,
>
> I am new Spark learner. Can someone guide me with the strategy towards
> getting expertise in PySpark.
>
> Thanks!!!
>