You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2020/05/20 10:58:48 UTC

Unit testing Spark/Scala code with Mockito

Hi,

I have a spark job that reads an XML file from HDFS, process it and port
data to Hive tables, one good and one exception table

The Code itself works fine. I need to create Unit Test with Mockito
<https://www.vogella.com/tutorials/Mockito/article.html>for it.. A unit
test should test functionality in isolation. Side effects from other
classes or the system should be eliminated for a unit test, if possible. So
basically there are three classes.


   1. Class A, reads XML file and created a DF1 on it plus a DF2 on top of
   DF1. Test data for XML file is already created
   2. Class B, reads DF2 and post correct data through TempView and Spark
   SQL to the underlying Hive table
   3. Class C, read DF2 and post exception data again through TempView and
   Spark SQL to the underlying Hive exception table

I would like to know for cases covering tests for Class B and Class C what
Mockito format needs to be used..

Thanks,

Mich




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Unit testing Spark/Scala code with Mockito

Posted by ZHANG Wei <we...@outlook.com>.
AFAICT, depends on testing goals, Unit Test, Integration Test or E2E
Test.

For Unit Test, mostly, it tests individual class or class methods.
Mockito can help mock and verify dependent instances or methods.

For Integration Test, some Spark testing helper methods can setup the
environment, such as `runInterpreter`[1] for running codes in REPL. The
data source can be mocked by `Seq(...).toDS()` or reading a local file,
no need to access Hive service.

For E2E Test, the HDFS and Hive (normally, a local mini version) have
to be setup to service the real operations from Spark.

Just my 2 cents.

-- 
Cheers,
-z
[1] https://github.com/apache/spark/blob/a06768ec4d5059d1037086fe5495e5d23cde514b/repl/src/test/scala/org/apache/spark/repl/ReplSuite.scala#L49

On Wed, 20 May 2020 15:36:06 +0100
Mich Talebzadeh <mi...@gmail.com> wrote:

> On a second note with regard Spark and read writes as I understand unit
> tests are not meant to test database connections. This should be done in
> integration tests to check that all the parts work together. Unit tests are
> just meant to test the functional logic, and not spark's ability to read
> from a database.
> 
> I would have thought that if the specific connectivity through third part
> tool (in my case reading XML file using Databricks jar) is required, then
> this should be done through Read Evaluate Print Loop – REPL environment of
> Spark Shell by writing some codec to quickly establish where the API
> successfully reads from the XML file.
> 
> Does this assertion sound correct?
> 
> thanks,
> 
> Mich
> 
> 
> 
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
> 
> 
> 
> 
> 
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> 
> 
> 
> 
> On Wed, 20 May 2020 at 11:58, Mich Talebzadeh <mi...@gmail.com>
> wrote:
> 
> > Hi,
> >
> > I have a spark job that reads an XML file from HDFS, process it and port
> > data to Hive tables, one good and one exception table
> >
> > The Code itself works fine. I need to create Unit Test with Mockito
> > <https://www.vogella.com/tutorials/Mockito/article.html>for it.. A unit
> > test should test functionality in isolation. Side effects from other
> > classes or the system should be eliminated for a unit test, if possible. So
> > basically there are three classes.
> >
> >
> >    1. Class A, reads XML file and created a DF1 on it plus a DF2 on top
> >    of DF1. Test data for XML file is already created
> >    2. Class B, reads DF2 and post correct data through TempView and Spark
> >    SQL to the underlying Hive table
> >    3. Class C, read DF2 and post exception data again through TempView
> >    and Spark SQL to the underlying Hive exception table
> >
> > I would like to know for cases covering tests for Class B and Class C what
> > Mockito format needs to be used..
> >
> > Thanks,
> >
> > Mich
> >
> >
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising from
> > such loss, damage or destruction.
> >
> >
> >

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Unit testing Spark/Scala code with Mockito

Posted by Mich Talebzadeh <mi...@gmail.com>.
On a second note with regard Spark and read writes as I understand unit
tests are not meant to test database connections. This should be done in
integration tests to check that all the parts work together. Unit tests are
just meant to test the functional logic, and not spark's ability to read
from a database.

I would have thought that if the specific connectivity through third part
tool (in my case reading XML file using Databricks jar) is required, then
this should be done through Read Evaluate Print Loop – REPL environment of
Spark Shell by writing some codec to quickly establish where the API
successfully reads from the XML file.

Does this assertion sound correct?

thanks,

Mich



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 20 May 2020 at 11:58, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> I have a spark job that reads an XML file from HDFS, process it and port
> data to Hive tables, one good and one exception table
>
> The Code itself works fine. I need to create Unit Test with Mockito
> <https://www.vogella.com/tutorials/Mockito/article.html>for it.. A unit
> test should test functionality in isolation. Side effects from other
> classes or the system should be eliminated for a unit test, if possible. So
> basically there are three classes.
>
>
>    1. Class A, reads XML file and created a DF1 on it plus a DF2 on top
>    of DF1. Test data for XML file is already created
>    2. Class B, reads DF2 and post correct data through TempView and Spark
>    SQL to the underlying Hive table
>    3. Class C, read DF2 and post exception data again through TempView
>    and Spark SQL to the underlying Hive exception table
>
> I would like to know for cases covering tests for Class B and Class C what
> Mockito format needs to be used..
>
> Thanks,
>
> Mich
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>