You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Steve Pruitt <bp...@opentext.com> on 2018/05/21 12:24:40 UTC

testing frameworks

Hi,

Can anyone recommend testing frameworks suitable for Spark jobs.  Something that can be integrated into a CI tool would be great.

Thanks.


Re: testing frameworks

Posted by Spico Florin <sp...@gmail.com>.
Hello!
  Thank you very much for your helpful answer and for the very good job
performed in spark-testing-base . I managed to perform unit testing with
spark-testing-base library as the provided article and also get inspired
from

https://github.com/holdenk/spark-testing-base/blob/master/src/test/1.3/java/com/holdenkarau/spark/testing/SampleJavaRDDTest.java
.


I had some concerns regarding on how to deal with compairing the RDDs that
come from Dataframe and the one that come from jsc().parallelize method.

My workflow tests is as follow:
1. Get the data from a parquet file as dataframe
2. Convert dataframe  to toJavaRDD()
3. perform some mapping on the JavaRdd
4. Check whether the resulted mapped rdd  is equal with the expected one
(retrieved from a text file)

I performed the above test with following code snippet

 JavaRDD<MyCustomer> expected = jsc().parallelize(input_from_text_file);
SparkSession spark = SparkSession.builder().getOrCreate();

JavaRDD<Row> input =

spark.read().parquet("src/test/resources/test_data.parquet").toJavaRDD();

JavaRDD<MyCustomer> result = MyDriver.convertToMyCustomerData(input);
 JavaRDDComparisons.assertRDDEquals(expected, result);

The above tests failed failed, even through the data is the same. By
debugging the code, I observed that the data from that came from the
DataFrame didn't have the same order as the one that came from
jsc().parallelize(text_file).

So, I suppose that the issue came from the fact that the SparkSession and
jsc() don't share the same SparkContext (there is a warning about this when
running the program).

Therefore I came to the solution, to use the same jsc for both of the
expected and the result. With this solution the assertion succeeded as
expected.

  List<Row> df
=spark.read().parquet("src/test/resources/test_data.parquet").toJavaRDD().collect();
    JavaRDD<Row> input = jsc().parallelize(df);

JavaRDD<MyCustomer> result = MyDriver.convertToMyCustomerData(input);
 JavaRDDComparisons.assertRDDEquals(expected, result);


My questions are:
1. what is the best solution to deal with RDDs comparison  when the RDDs
are built from Dataframes and when they are tested with RDDs obtained via
jsc().parallelize()?
2. Is the above solution a suitable one?

I look forward for your answers.

Regards,
  Florin







On Wed, May 30, 2018 at 3:11 PM, Holden Karau <ho...@pigscanfly.ca> wrote:

> So Jessie has an excellent blog post on how to use it with Java
> applications -
> http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/
>
> On Wed, May 30, 2018 at 4:14 AM Spico Florin <sp...@gmail.com>
> wrote:
>
>> Hello!
>>   I'm also looking for unit testing spark Java application. I've seen the
>> great work done in  spark-testing-base but it seemed to me that I could
>> not use for Spark Java applications.
>> Only spark scala applications are supported?
>> Thanks.
>> Regards,
>>  Florin
>>
>> On Wed, May 23, 2018 at 8:07 AM, umargeek <um...@gmail.com>
>> wrote:
>>
>>> Hi Steve,
>>>
>>> you can try out pytest-spark plugin if your writing programs using
>>> pyspark
>>> ,please find below link for reference.
>>>
>>> https://github.com/malexer/pytest-spark
>>> <https://github.com/malexer/pytest-spark>
>>>
>>> Thanks,
>>> Umar
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
>

Re: testing frameworks

Posted by Holden Karau <ho...@pigscanfly.ca>.
So Jessie has an excellent blog post on how to use it with Java
applications -
http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/

On Wed, May 30, 2018 at 4:14 AM Spico Florin <sp...@gmail.com> wrote:

> Hello!
>   I'm also looking for unit testing spark Java application. I've seen the
> great work done in  spark-testing-base but it seemed to me that I could
> not use for Spark Java applications.
> Only spark scala applications are supported?
> Thanks.
> Regards,
>  Florin
>
> On Wed, May 23, 2018 at 8:07 AM, umargeek <um...@gmail.com>
> wrote:
>
>> Hi Steve,
>>
>> you can try out pytest-spark plugin if your writing programs using pyspark
>> ,please find below link for reference.
>>
>> https://github.com/malexer/pytest-spark
>> <https://github.com/malexer/pytest-spark>
>>
>> Thanks,
>> Umar
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
> --
Twitter: https://twitter.com/holdenkarau

Re: testing frameworks

Posted by Spico Florin <sp...@gmail.com>.
Hello!
  I'm also looking for unit testing spark Java application. I've seen the
great work done in  spark-testing-base but it seemed to me that I could not
use for Spark Java applications.
Only spark scala applications are supported?
Thanks.
Regards,
 Florin

On Wed, May 23, 2018 at 8:07 AM, umargeek <um...@gmail.com>
wrote:

> Hi Steve,
>
> you can try out pytest-spark plugin if your writing programs using pyspark
> ,please find below link for reference.
>
> https://github.com/malexer/pytest-spark
> <https://github.com/malexer/pytest-spark>
>
> Thanks,
> Umar
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: testing frameworks

Posted by umargeek <um...@gmail.com>.
Hi Steve,

you can try out pytest-spark plugin if your writing programs using pyspark
,please find below link for reference.

https://github.com/malexer/pytest-spark
<https://github.com/malexer/pytest-spark>  

Thanks,
Umar



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: testing frameworks

Posted by Marco Mistroni <mm...@gmail.com>.
Thanks Hichame will follow up on that

Anyonen on this list using python version of spark-testing-base? seems
theres support for DataFrame....

thanks in advance and regards
 Marco

On Sun, Feb 3, 2019 at 9:58 PM Hichame El Khalfi <hi...@elkhalfi.com>
wrote:

> Hi,
> You can use pysparkling => https://github.com/svenkreiss/pysparkling
> This lib is useful in case you have RDD.
>
> Hope this helps,
>
> Hichame
>
> *From:* mmistroni@gmail.com
> *Sent:* February 3, 2019 4:42 PM
> *To:* radams217@gmail.com
> *Cc:* lalle@mapflat.com; bpruitt@opentext.com; user@spark.apache.org
> *Subject:* Re: testing frameworks
>
> Hi
>  sorry to resurrect this thread
> Any spark libraries for testing code in pyspark?  the github code above
> seems related to Scala
> following links in the original threads (and also LMGFY) i found out
> pytest-spark · PyPI <https://pypi.org/project/pytest-spark/>
>
> w/kindest regards
>  Marco
>
>
>
>
> On Tue, Jun 12, 2018 at 6:44 PM Ryan Adams <ra...@gmail.com> wrote:
>
>> We use spark testing base for unit testing.  These tests execute on a
>> very small amount of data that covers all paths the code can take (or most
>> paths anyway).
>>
>> https://github.com/holdenk/spark-testing-base
>>
>> For integration testing we use automated routines to ensure that
>> aggregate values match an aggregate baseline.
>>
>> Ryan
>>
>> Ryan Adams
>> radams217@gmail.com
>>
>> On Tue, Jun 12, 2018 at 11:51 AM, Lars Albertsson <la...@mapflat.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I wrote this answer to the same question a couple of years ago:
>>> https://www.mail-archive.com/user%40spark.apache.org/msg48032.html
>>>
>>> I have made a couple of presentations on the subject. Slides and video
>>> are linked on this page: http://www.mapflat.com/presentations/
>>>
>>> You can find more material in this list of resources:
>>> http://www.mapflat.com/lands/resources/reading-list
>>>
>>> Happy testing!
>>>
>>> Regards,
>>>
>>>
>>>
>>> Lars Albertsson
>>> Data engineering consultant
>>> www.mapflat.com
>>> https://twitter.com/lalleal
>>> +46 70 7687109
>>> Calendar: http://www.mapflat.com/calendar
>>>
>>>
>>> On Mon, May 21, 2018 at 2:24 PM, Steve Pruitt <bp...@opentext.com>
>>> wrote:
>>> > Hi,
>>> >
>>> >
>>> >
>>> > Can anyone recommend testing frameworks suitable for Spark jobs.
>>> Something
>>> > that can be integrated into a CI tool would be great.
>>> >
>>> >
>>> >
>>> > Thanks.
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>

Re: testing frameworks

Posted by Hichame El Khalfi <hi...@elkhalfi.com>.
Hi,
You can use pysparkling => https://github.com/svenkreiss/pysparkling
This lib is useful in case you have RDD.

Hope this helps,

Hichame

From: mmistroni@gmail.com
Sent: February 3, 2019 4:42 PM
To: radams217@gmail.com
Cc: lalle@mapflat.com; bpruitt@opentext.com; user@spark.apache.org
Subject: Re: testing frameworks


Hi
 sorry to resurrect this thread
Any spark libraries for testing code in pyspark?  the github code above seems related to Scala
following links in the original threads (and also LMGFY) i found out
<https://pypi.org/project/pytest-spark/>
pytest-spark · PyPI


w/kindest regards
 Marco




On Tue, Jun 12, 2018 at 6:44 PM Ryan Adams <ra...@gmail.com>> wrote:
We use spark testing base for unit testing.  These tests execute on a very small amount of data that covers all paths the code can take (or most paths anyway).

https://github.com/holdenk/spark-testing-base

For integration testing we use automated routines to ensure that aggregate values match an aggregate baseline.

Ryan

Ryan Adams
radams217@gmail.com<ma...@gmail.com>

On Tue, Jun 12, 2018 at 11:51 AM, Lars Albertsson <la...@mapflat.com>> wrote:
Hi,

I wrote this answer to the same question a couple of years ago:
https://www.mail-archive.com/user%40spark.apache.org/msg48032.html

I have made a couple of presentations on the subject. Slides and video
are linked on this page: http://www.mapflat.com/presentations/

You can find more material in this list of resources:
http://www.mapflat.com/lands/resources/reading-list

Happy testing!

Regards,



Lars Albertsson
Data engineering consultant
www.mapflat.com<http://www.mapflat.com>
https://twitter.com/lalleal
+46 70 7687109
Calendar: http://www.mapflat.com/calendar


On Mon, May 21, 2018 at 2:24 PM, Steve Pruitt <bp...@opentext.com>> wrote:
> Hi,
>
>
>
> Can anyone recommend testing frameworks suitable for Spark jobs.  Something
> that can be integrated into a CI tool would be great.
>
>
>
> Thanks.
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>



Re: testing frameworks

Posted by Marco Mistroni <mm...@gmail.com>.
Hi
 sorry to resurrect this thread
Any spark libraries for testing code in pyspark?  the github code above
seems related to Scala
following links in the original threads (and also LMGFY) i found out
pytest-spark · PyPI <https://pypi.org/project/pytest-spark/>

w/kindest regards
 Marco




On Tue, Jun 12, 2018 at 6:44 PM Ryan Adams <ra...@gmail.com> wrote:

> We use spark testing base for unit testing.  These tests execute on a very
> small amount of data that covers all paths the code can take (or most paths
> anyway).
>
> https://github.com/holdenk/spark-testing-base
>
> For integration testing we use automated routines to ensure that aggregate
> values match an aggregate baseline.
>
> Ryan
>
> Ryan Adams
> radams217@gmail.com
>
> On Tue, Jun 12, 2018 at 11:51 AM, Lars Albertsson <la...@mapflat.com>
> wrote:
>
>> Hi,
>>
>> I wrote this answer to the same question a couple of years ago:
>> https://www.mail-archive.com/user%40spark.apache.org/msg48032.html
>>
>> I have made a couple of presentations on the subject. Slides and video
>> are linked on this page: http://www.mapflat.com/presentations/
>>
>> You can find more material in this list of resources:
>> http://www.mapflat.com/lands/resources/reading-list
>>
>> Happy testing!
>>
>> Regards,
>>
>>
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> https://twitter.com/lalleal
>> +46 70 7687109
>> Calendar: http://www.mapflat.com/calendar
>>
>>
>> On Mon, May 21, 2018 at 2:24 PM, Steve Pruitt <bp...@opentext.com>
>> wrote:
>> > Hi,
>> >
>> >
>> >
>> > Can anyone recommend testing frameworks suitable for Spark jobs.
>> Something
>> > that can be integrated into a CI tool would be great.
>> >
>> >
>> >
>> > Thanks.
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Re: testing frameworks

Posted by Ryan Adams <ra...@gmail.com>.
We use spark testing base for unit testing.  These tests execute on a very
small amount of data that covers all paths the code can take (or most paths
anyway).

https://github.com/holdenk/spark-testing-base

For integration testing we use automated routines to ensure that aggregate
values match an aggregate baseline.

Ryan

Ryan Adams
radams217@gmail.com

On Tue, Jun 12, 2018 at 11:51 AM, Lars Albertsson <la...@mapflat.com> wrote:

> Hi,
>
> I wrote this answer to the same question a couple of years ago:
> https://www.mail-archive.com/user%40spark.apache.org/msg48032.html
>
> I have made a couple of presentations on the subject. Slides and video
> are linked on this page: http://www.mapflat.com/presentations/
>
> You can find more material in this list of resources:
> http://www.mapflat.com/lands/resources/reading-list
>
> Happy testing!
>
> Regards,
>
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> https://twitter.com/lalleal
> +46 70 7687109
> Calendar: http://www.mapflat.com/calendar
>
>
> On Mon, May 21, 2018 at 2:24 PM, Steve Pruitt <bp...@opentext.com>
> wrote:
> > Hi,
> >
> >
> >
> > Can anyone recommend testing frameworks suitable for Spark jobs.
> Something
> > that can be integrated into a CI tool would be great.
> >
> >
> >
> > Thanks.
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: testing frameworks

Posted by Lars Albertsson <la...@mapflat.com>.
Hi,

I wrote this answer to the same question a couple of years ago:
https://www.mail-archive.com/user%40spark.apache.org/msg48032.html

I have made a couple of presentations on the subject. Slides and video
are linked on this page: http://www.mapflat.com/presentations/

You can find more material in this list of resources:
http://www.mapflat.com/lands/resources/reading-list

Happy testing!

Regards,



Lars Albertsson
Data engineering consultant
www.mapflat.com
https://twitter.com/lalleal
+46 70 7687109
Calendar: http://www.mapflat.com/calendar


On Mon, May 21, 2018 at 2:24 PM, Steve Pruitt <bp...@opentext.com> wrote:
> Hi,
>
>
>
> Can anyone recommend testing frameworks suitable for Spark jobs.  Something
> that can be integrated into a CI tool would be great.
>
>
>
> Thanks.
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: [EXTERNAL] - Re: testing frameworks

Posted by Joel D <ga...@gmail.com>.
We’ve developed our own version of testing framework consisting of
different areas of checking, sometimes providing expected data and
comparing with the resultant data from the data object.

Cheers.

On Tue, May 22, 2018 at 1:48 PM Steve Pruitt <bp...@opentext.com> wrote:

> Something more on the lines of integration I believe.  Run one or more
> Spark jobs and verify the output results.  If this makes sense.
>
>
>
> I am very new to the world of Spark.  We want to include pipeline testing
> from the get go.  I will check out spark-testing-base.
>
>
>
>
>
> Thanks.
>
>
>
> *From:* Holden Karau [mailto:holden@pigscanfly.ca]
> *Sent:* Monday, May 21, 2018 11:32 AM
> *To:* Steve Pruitt <bp...@opentext.com>
> *Cc:* user@spark.apache.org
> *Subject:* [EXTERNAL] - Re: testing frameworks
>
>
>
> So I’m biased as the author of spark-testing-base but I think it’s pretty
> ok. Are you looking for unit or integration or something else?
>
>
>
> On Mon, May 21, 2018 at 5:24 AM Steve Pruitt <bp...@opentext.com> wrote:
>
> Hi,
>
>
>
> Can anyone recommend testing frameworks suitable for Spark jobs.
> Something that can be integrated into a CI tool would be great.
>
>
>
> Thanks.
>
>
>
> --
>
> Twitter: https://twitter.com/holdenkarau
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_holdenkarau&d=DwMFaQ&c=ZgVRmm3mf2P1-XDAyDsu4A&r=ksx9qnQFG3QvxkP54EBPEzv1HHDjlk-MFO-7EONGCtY&m=YTdxEm6qmXE1TQvlRzPccMkNLcynfxhC32Uj91HcaXA&s=a_ORg1aB6eKT2ZYxtSJw3oOQnHmi07gjf9whuROeNYw&e=>
>

RE: [EXTERNAL] - Re: testing frameworks

Posted by Steve Pruitt <bp...@opentext.com>.
Something more on the lines of integration I believe.  Run one or more Spark jobs and verify the output results.  If this makes sense.

I am very new to the world of Spark.  We want to include pipeline testing from the get go.  I will check out spark-testing-base.


Thanks.

From: Holden Karau [mailto:holden@pigscanfly.ca]
Sent: Monday, May 21, 2018 11:32 AM
To: Steve Pruitt <bp...@opentext.com>
Cc: user@spark.apache.org
Subject: [EXTERNAL] - Re: testing frameworks

So I’m biased as the author of spark-testing-base but I think it’s pretty ok. Are you looking for unit or integration or something else?

On Mon, May 21, 2018 at 5:24 AM Steve Pruitt <bp...@opentext.com>> wrote:
Hi,

Can anyone recommend testing frameworks suitable for Spark jobs.  Something that can be integrated into a CI tool would be great.

Thanks.

--
Twitter: https://twitter.com/holdenkarau<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_holdenkarau&d=DwMFaQ&c=ZgVRmm3mf2P1-XDAyDsu4A&r=ksx9qnQFG3QvxkP54EBPEzv1HHDjlk-MFO-7EONGCtY&m=YTdxEm6qmXE1TQvlRzPccMkNLcynfxhC32Uj91HcaXA&s=a_ORg1aB6eKT2ZYxtSJw3oOQnHmi07gjf9whuROeNYw&e=>

Re: testing frameworks

Posted by Holden Karau <ho...@pigscanfly.ca>.
So I’m biased as the author of spark-testing-base but I think it’s pretty
ok. Are you looking for unit or integration or something else?

On Mon, May 21, 2018 at 5:24 AM Steve Pruitt <bp...@opentext.com> wrote:

> Hi,
>
>
>
> Can anyone recommend testing frameworks suitable for Spark jobs.
> Something that can be integrated into a CI tool would be great.
>
>
>
> Thanks.
>
>
>
-- 
Twitter: https://twitter.com/holdenkarau