You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Роман Ткаленко <tk...@gmail.com> on 2013/10/12 22:48:26 UTC

Test coverage of Spark

Hello.
I'm trying to dive into Spark's sources on a deeper-than-mere-glance level
and I find beginning with writing unit tests a good way to do it. So,
basically, I'm wondering if there are points to which I could specifically
apply my enthusiasm, i. e. are there some un- or not enough covered parts
for which I could write some tests?
I'm wondering as well about the state of Apache-hosted JIRA for Spark - I
currently can't see any entry in there. Should I look for them in Github
mirror or still in the antecedent JIRA instance on
http://spark-project.atlassian.net/?
Regards,
Roman.

Re: Test coverage of Spark

Posted by Matei Zaharia <ma...@gmail.com>.
Adding more tests to spark-perf is a good idea. It would be great if it covered some of the ML algorithms for example. In addition, for correctness, the test suites in core can also be enhanced. In particular I'd like to make sure we're testing all methods in the RDD API in all of Java, Python and Scala -- we recently found some methods that don't quite work in Java for example.

Our JIRA is currently still at https://spark-project.atlassian.net/secure/MyJiraHome.jspa but it's hopefully going to be imported into Apache really soon so I'd recommend holding off of creating new issues for a bit to see if the import succeeds. This is the task to import it into Apache: https://issues.apache.org/jira/browse/INFRA-6419.

Matei

On Oct 12, 2013, at 2:44 PM, Christopher Nguyen <ct...@adatao.com> wrote:

> Perfect. This is a great start of what I'm looking for.
> 
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
> 
> 
> 
> On Sat, Oct 12, 2013 at 2:31 PM, Mark Hamstra <ma...@clearstorydata.com>wrote:
> 
>> There is also spark-perf <https://github.com/amplab/spark-perf>.
>> 
>> 
>> On Sat, Oct 12, 2013 at 2:22 PM, Christopher Nguyen <ct...@adatao.com>
>> wrote:
>> 
>>> Roman, an area I think would (a) have high impact, and (b) is relatively
>>> not well covered is performance analysis. I'm sure most teams are doing
>>> this internally at their respective companies, but there is no shared
>> code
>>> base and shared wisdom about what we're finding/improving.
>>> 
>>> For example, consider the task of loading a table from disk into memory
>> by
>>> Shark. We're getting conflicting data about how much of this is cpu-bound
>>> vs I/O-bound. Our effort to track this down should be sharable somehow,
>> and
>>> would benefit from others' findings. Of course this is dependent on the
>>> particular configuration, but there is a lot of test harness code/scripts
>>> that can be shared. And individual findings, even if/especially if they
>> are
>>> conflicting, are very valuable if well documented.
>>> 
>>> There is a Benchmark effort covered here
>>> https://amplab.cs.berkeley.edu/benchmark/, but it addresses a slightly
>>> different goal. You could consider this Perf-Analysis as part of that, or
>>> as its own effort.
>>> 
>>> This may be more than you were looking to own, but given your stated
>>> enthusiasm :) I want to throw the idea out there.
>>> 
>>> --
>>> Christopher T. Nguyen
>>> Co-founder & CEO, Adatao <http://adatao.com>
>>> linkedin.com/in/ctnguyen
>>> 
>>> 
>>> 
>>> On Sat, Oct 12, 2013 at 1:48 PM, Роман Ткаленко <tkalenkoroman@gmail.com
>>>> wrote:
>>> 
>>>> Hello.
>>>> I'm trying to dive into Spark's sources on a deeper-than-mere-glance
>>> level
>>>> and I find beginning with writing unit tests a good way to do it. So,
>>>> basically, I'm wondering if there are points to which I could
>>> specifically
>>>> apply my enthusiasm, i. e. are there some un- or not enough covered
>> parts
>>>> for which I could write some tests?
>>>> I'm wondering as well about the state of Apache-hosted JIRA for Spark
>> - I
>>>> currently can't see any entry in there. Should I look for them in
>> Github
>>>> mirror or still in the antecedent JIRA instance on
>>>> http://spark-project.atlassian.net/?
>>>> Regards,
>>>> Roman.
>>>> 
>>> 
>> 


Re: Test coverage of Spark

Posted by Christopher Nguyen <ct...@adatao.com>.
Perfect. This is a great start of what I'm looking for.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Sat, Oct 12, 2013 at 2:31 PM, Mark Hamstra <ma...@clearstorydata.com>wrote:

> There is also spark-perf <https://github.com/amplab/spark-perf>.
>
>
> On Sat, Oct 12, 2013 at 2:22 PM, Christopher Nguyen <ct...@adatao.com>
> wrote:
>
> > Roman, an area I think would (a) have high impact, and (b) is relatively
> > not well covered is performance analysis. I'm sure most teams are doing
> > this internally at their respective companies, but there is no shared
> code
> > base and shared wisdom about what we're finding/improving.
> >
> > For example, consider the task of loading a table from disk into memory
> by
> > Shark. We're getting conflicting data about how much of this is cpu-bound
> > vs I/O-bound. Our effort to track this down should be sharable somehow,
> and
> > would benefit from others' findings. Of course this is dependent on the
> > particular configuration, but there is a lot of test harness code/scripts
> > that can be shared. And individual findings, even if/especially if they
> are
> > conflicting, are very valuable if well documented.
> >
> > There is a Benchmark effort covered here
> > https://amplab.cs.berkeley.edu/benchmark/, but it addresses a slightly
> > different goal. You could consider this Perf-Analysis as part of that, or
> > as its own effort.
> >
> > This may be more than you were looking to own, but given your stated
> > enthusiasm :) I want to throw the idea out there.
> >
> > --
> > Christopher T. Nguyen
> > Co-founder & CEO, Adatao <http://adatao.com>
> > linkedin.com/in/ctnguyen
> >
> >
> >
> > On Sat, Oct 12, 2013 at 1:48 PM, Роман Ткаленко <tkalenkoroman@gmail.com
> > >wrote:
> >
> > > Hello.
> > > I'm trying to dive into Spark's sources on a deeper-than-mere-glance
> > level
> > > and I find beginning with writing unit tests a good way to do it. So,
> > > basically, I'm wondering if there are points to which I could
> > specifically
> > > apply my enthusiasm, i. e. are there some un- or not enough covered
> parts
> > > for which I could write some tests?
> > > I'm wondering as well about the state of Apache-hosted JIRA for Spark
> - I
> > > currently can't see any entry in there. Should I look for them in
> Github
> > > mirror or still in the antecedent JIRA instance on
> > > http://spark-project.atlassian.net/?
> > > Regards,
> > > Roman.
> > >
> >
>

Re: Test coverage of Spark

Posted by Mark Hamstra <ma...@clearstorydata.com>.
There is also spark-perf <https://github.com/amplab/spark-perf>.


On Sat, Oct 12, 2013 at 2:22 PM, Christopher Nguyen <ct...@adatao.com> wrote:

> Roman, an area I think would (a) have high impact, and (b) is relatively
> not well covered is performance analysis. I'm sure most teams are doing
> this internally at their respective companies, but there is no shared code
> base and shared wisdom about what we're finding/improving.
>
> For example, consider the task of loading a table from disk into memory by
> Shark. We're getting conflicting data about how much of this is cpu-bound
> vs I/O-bound. Our effort to track this down should be sharable somehow, and
> would benefit from others' findings. Of course this is dependent on the
> particular configuration, but there is a lot of test harness code/scripts
> that can be shared. And individual findings, even if/especially if they are
> conflicting, are very valuable if well documented.
>
> There is a Benchmark effort covered here
> https://amplab.cs.berkeley.edu/benchmark/, but it addresses a slightly
> different goal. You could consider this Perf-Analysis as part of that, or
> as its own effort.
>
> This may be more than you were looking to own, but given your stated
> enthusiasm :) I want to throw the idea out there.
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Sat, Oct 12, 2013 at 1:48 PM, Роман Ткаленко <tkalenkoroman@gmail.com
> >wrote:
>
> > Hello.
> > I'm trying to dive into Spark's sources on a deeper-than-mere-glance
> level
> > and I find beginning with writing unit tests a good way to do it. So,
> > basically, I'm wondering if there are points to which I could
> specifically
> > apply my enthusiasm, i. e. are there some un- or not enough covered parts
> > for which I could write some tests?
> > I'm wondering as well about the state of Apache-hosted JIRA for Spark - I
> > currently can't see any entry in there. Should I look for them in Github
> > mirror or still in the antecedent JIRA instance on
> > http://spark-project.atlassian.net/?
> > Regards,
> > Roman.
> >
>

Re: Test coverage of Spark

Posted by Christopher Nguyen <ct...@adatao.com>.
Roman, an area I think would (a) have high impact, and (b) is relatively
not well covered is performance analysis. I'm sure most teams are doing
this internally at their respective companies, but there is no shared code
base and shared wisdom about what we're finding/improving.

For example, consider the task of loading a table from disk into memory by
Shark. We're getting conflicting data about how much of this is cpu-bound
vs I/O-bound. Our effort to track this down should be sharable somehow, and
would benefit from others' findings. Of course this is dependent on the
particular configuration, but there is a lot of test harness code/scripts
that can be shared. And individual findings, even if/especially if they are
conflicting, are very valuable if well documented.

There is a Benchmark effort covered here
https://amplab.cs.berkeley.edu/benchmark/, but it addresses a slightly
different goal. You could consider this Perf-Analysis as part of that, or
as its own effort.

This may be more than you were looking to own, but given your stated
enthusiasm :) I want to throw the idea out there.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Sat, Oct 12, 2013 at 1:48 PM, Роман Ткаленко <tk...@gmail.com>wrote:

> Hello.
> I'm trying to dive into Spark's sources on a deeper-than-mere-glance level
> and I find beginning with writing unit tests a good way to do it. So,
> basically, I'm wondering if there are points to which I could specifically
> apply my enthusiasm, i. e. are there some un- or not enough covered parts
> for which I could write some tests?
> I'm wondering as well about the state of Apache-hosted JIRA for Spark - I
> currently can't see any entry in there. Should I look for them in Github
> mirror or still in the antecedent JIRA instance on
> http://spark-project.atlassian.net/?
> Regards,
> Roman.
>