You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ryan Williams <ry...@gmail.com> on 2014/11/30 23:39:28 UTC

Spurious test failures, testing best practices

In the course of trying to make contributions to Spark, I have had a lot of
trouble running Spark's tests successfully. The main pain points I've
experienced are:

    1) frequent, spurious test failures
    2) high latency of running tests
    3) difficulty running specific tests in an iterative fashion

Here is an example series of failures that I encountered this weekend
(along with footnote links to the console output from each and
approximately how long each took):

- `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
before.
- `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
- `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
passed, but scala compiler crashed on the "catalyst" project.
- `mvn clean`: some attempts to run earlier commands (that previously
didn't crash the compiler) all result in the same compiler crash. Previous
discussion on this list implies this can only be solved by a `mvn clean`
[4].
- `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
BroadcastSuite can't run because assembly is not built.
- `./dev/run-tests` again [6]: pyspark tests fail, some messages about
version mismatches and python 2.6. The machine this ran on has python 2.7,
so I don't know what that's about.
- `./dev/run-tests` again [7]: "too many open files" errors in several
tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
not enough, but only some of the time? I increased it to 8192 and tried
again.
- `./dev/run-tests` again [8]: same pyspark errors as before. This seems to
be the issue from SPARK-3867 [9], which was supposedly fixed on October 14;
not sure how I'm seeing it now. In any case, switched to Python 2.6 and
installed unittest2, and python/run-tests seems to be unblocked.
- `./dev/run-tests` again [10]: finally passes!

This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
changes added on (that I wanted to test before sending out a PR), on a
macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].

Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar commands
from the same repo state:

- `./dev/run-tests` [12]: YarnClusterSuite failure.
- `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
this one before on this machine and am guessing it actually occurs every
time.
- `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
time from ceb6281, and saw the same failure.

This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to narrow
down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my mac,
from ceb6281, with java 1.7 (instead of 1.8, which the previous runs used),
and it passed [16], so the failure seems specific to my linux machine/arch.

At this point I believe that my changes don't break any tests (the
YarnClusterSuite failure on my linux presumably not being... "real"), and I
am ready to send out a PR. Whew!

However, reflecting on the 5 or 6 distinct failure-modes represented above:

- One of them (too many files open), is something I can (and did,
hopefully) fix once and for all. It cost me an ~hour this time (approximate
time of running ./dev/run-tests) and a few hours other times when I didn't
fully understand/fix it. It doesn't happen deterministically (why?), but
does happen somewhat frequently to people, having been discussed on the
user list multiple times [17] and on SO [18]. Maybe some note in the
documentation advising people to check their ulimit makes sense?
- One of them (unittest2 must be installed for python 2.6) was supposedly
fixed upstream of the commits I tested here; I don't know why I'm still
running into it. This cost me a few hours of running `./dev/run-tests`
multiple times to see if it was transient, plus some time researching and
working around it.
- The original BroadcastSuite failure cost me a few hours and went away
before I'd even run `mvn clean`.
- A new incarnation of the sbt-compiler-crash phenomenon cost me a few
hours of running `./dev/run-tests` in different ways before deciding that,
as usual, there was no way around it and that I'd need to run `mvn clean`
and start running tests from scratch.
- The YarnClusterSuite failures on my linux box have cost me hours of
trying to figure out whether they're my fault. I've seen them many times
over the past weeks/months, plus or minus other failures that have come and
gone, and was especially befuddled by them when I was seeing a disjoint set
of reproducible failures on my mac [19] (the triaging of which involved
dozens of runs of `./dev/run-tests`).

While I'm interested in digging into each of these issues, I also want to
discuss the frequency with which I've run into issues like these. This is
unfortunately not the first time in recent months that I've spent days
playing spurious-test-failure whack-a-mole with a 60-90min dev/run-tests
iteration time, which is no fun! So I am wondering/thinking:

- Do other people experience this level of flakiness from spark tests?
- Do other people bother running dev/run-tests locally, or just let Jenkins
do it during the CR process?
- Needing to run a full assembly post-clean just to continue running one
specific test case feels especially wasteful, and the failure output when
naively attempting to run a specific test without having built an assembly
jar is not always clear about what the issue is or how to fix it; even the
fact that certain tests require "building the world" is not something I
would have expected, and has cost me hours of confusion.
    - Should a person running spark tests assume that they must build an
assembly JAR before running anything?
    - Are there some proper "unit" tests that are actually self-contained /
able to be run without building an assembly jar?
    - Can we better document/demarcate which tests have which dependencies?
    - Is there something finer-grained than building an assembly JAR that
is sufficient in some cases?
        - If so, can we document that?
        - If not, can we move to a world of finer-grained dependencies for
some of these?
- Leaving all of these spurious failures aside, the process of assembling
and testing a new JAR is not a quick one (40 and 60 mins for me typically,
respectively). I would guess that there are dozens (hundreds?) of people
who build a Spark assembly from various ToTs on any given day, and who all
wait on the exact same compilation / assembly steps to occur. Expanding on
the recent work to publish nightly snapshots [20], can we do a better job
caching/sharing compilation artifacts at a more granular level (pre-built
assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA? more
granular maven modules, plus the previous two?), or otherwise save some of
the considerable amount of redundant compilation work that I had to do over
the course of my odyssey this weekend?

Ramping up on most projects involves some amount of supplementing the
documentation with trial and error to figure out what to run, which
"errors" are real errors and which can be ignored, etc., but navigating
that minefield on Spark has proved especially challenging and
time-consuming for me. Some of that comes directly from scala's relatively
slow compilation times and immature build-tooling ecosystem, but that is
the world we live in and it would be nice if Spark took the alleviation of
the resulting pain more seriously, as one of the more interesting and
well-known large scala projects around right now. The official
documentation around how to build different subsets of the codebase is
somewhat sparse [21], and there have been many mixed [22] accounts [23] on
this mailing list about preferred ways to build on mvn vs. sbt (none of
which has made it into official documentation, as far as I've seen).
Expecting new contributors to piece together all of this received
folk-wisdom about how to build/test in a sane way by trawling mailing list
archives seems suboptimal.

Thanks for reading, looking forward to hearing your ideas!

-Ryan

P.S. Is "best practice" for emailing this list to not incorporate any HTML
in the body? It seems like all of the archives I've seen strip it out, but
other people have used it and gmail displays it.


[1]
https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
(57 mins)
[2]
https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
(6 mins)
[3]
https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%20pass%20test,%20fail%20subsequent%20compile
(4 mins)
[4]
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
[5]
https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%20clean,%20need%20dependencies%20built
[6]
https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%20post%20clean
(50 mins)
[7]
https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
(1hr)
[8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
[9] https://issues.apache.org/jira/browse/SPARK-3867
[10] https://gist.github.com/ryan-williams/735adf543124c99647cc
[11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
[12]
https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#file-gistfile1-txt-L853
(~90 mins)
[13]
https://gist.github.com/ryan-williams/718f6324af358819b496#file-gistfile1-txt-L852
(91 mins)
[14]
https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#file-gistfile1-txt-L854
[15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
[16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
[17]
http://apache-spark-user-list.1001560.n3.nabble.com/quot-Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
[18]
http://stackoverflow.com/questions/25707629/why-does-spark-job-fail-with-too-many-open-files
[19] https://issues.apache.org/jira/browse/SPARK-4002
[20] https://issues.apache.org/jira/browse/SPARK-4542
[21]
https://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven
[22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
[23]
http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E

Re: Spurious test failures, testing best practices

Posted by Imran Rashid <im...@therashids.com>.

I agree we should separate out the integration tests so it's easy for dev
to just run the other fast tests locally.  I opened a jira for it

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4746
On Nov 30, 2014 3:08 PM, "Matei Zaharia" <ma...@gmail.com> wrote:

> Hi Ryan,
>
> As a tip (and maybe this isn't documented well), I normally use SBT for
> development to avoid the slow build process, and use its interactive
> console to run only specific tests. The nice advantage is that SBT can keep
> the Scala compiler loaded and JITed across builds, making it faster to
> iterate. To use it, you can do the following:
>
> - Start the SBT interactive console with sbt/sbt
> - Build your assembly by running the "assembly" target in the assembly
> project: assembly/assembly
> - Run all the tests in one module: core/test
> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
> also supports tab completion)
>
> Running all the tests does take a while, and I usually just rely on
> Jenkins for that once I've run the tests for the things I believed my patch
> could break. But this is because some of them are integration tests (e.g.
> DistributedSuite, which creates multi-process mini-clusters). Many of the
> individual suites run fast without requiring this, however, so you can pick
> the ones you want. Perhaps we should find a way to tag them so people  can
> do a "quick-test" that skips the integration ones.
>
> The assembly builds are annoying but they only take about a minute for me
> on a MacBook Pro with SBT warmed up. The assembly is actually only required
> for some of the "integration" tests (which launch new processes), but I'd
> recommend doing it all the time anyway since it would be very confusing to
> run those with an old assembly. The Scala compiler crash issue can also be
> a problem, but I don't see it very often with SBT. If it happens, I exit
> SBT and do sbt clean.
>
> Anyway, this is useful feedback and I think we should try to improve some
> of these suites, but hopefully you can also try the faster SBT process. At
> the end of the day, if we want integration tests, the whole test process
> will take an hour, but most of the developers I know leave that to Jenkins
> and only run individual tests locally before submitting a patch.
>
> Matei
>
>
> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
> ryan.blake.williams@gmail.com> wrote:
> >
> > In the course of trying to make contributions to Spark, I have had a lot
> of
> > trouble running Spark's tests successfully. The main pain points I've
> > experienced are:
> >
> >    1) frequent, spurious test failures
> >    2) high latency of running tests
> >    3) difficulty running specific tests in an iterative fashion
> >
> > Here is an example series of failures that I encountered this weekend
> > (along with footnote links to the console output from each and
> > approximately how long each took):
> >
> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
> > before.
> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
> > passed, but scala compiler crashed on the "catalyst" project.
> > - `mvn clean`: some attempts to run earlier commands (that previously
> > didn't crash the compiler) all result in the same compiler crash.
> Previous
> > discussion on this list implies this can only be solved by a `mvn clean`
> > [4].
> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
> > BroadcastSuite can't run because assembly is not built.
> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
> > version mismatches and python 2.6. The machine this ran on has python
> 2.7,
> > so I don't know what that's about.
> > - `./dev/run-tests` again [7]: "too many open files" errors in several
> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
> > not enough, but only some of the time? I increased it to 8192 and tried
> > again.
> > - `./dev/run-tests` again [8]: same pyspark errors as before. This seems
> to
> > be the issue from SPARK-3867 [9], which was supposedly fixed on October
> 14;
> > not sure how I'm seeing it now. In any case, switched to Python 2.6 and
> > installed unittest2, and python/run-tests seems to be unblocked.
> > - `./dev/run-tests` again [10]: finally passes!
> >
> > This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
> > changes added on (that I wanted to test before sending out a PR), on a
> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
> >
> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
> commands
> > from the same repo state:
> >
> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
> > this one before on this machine and am guessing it actually occurs every
> > time.
> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
> > time from ceb6281, and saw the same failure.
> >
> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
> narrow
> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my
> mac,
> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
> used),
> > and it passed [16], so the failure seems specific to my linux
> machine/arch.
> >
> > At this point I believe that my changes don't break any tests (the
> > YarnClusterSuite failure on my linux presumably not being... "real"),
> and I
> > am ready to send out a PR. Whew!
> >
> > However, reflecting on the 5 or 6 distinct failure-modes represented
> above:
> >
> > - One of them (too many files open), is something I can (and did,
> > hopefully) fix once and for all. It cost me an ~hour this time
> (approximate
> > time of running ./dev/run-tests) and a few hours other times when I
> didn't
> > fully understand/fix it. It doesn't happen deterministically (why?), but
> > does happen somewhat frequently to people, having been discussed on the
> > user list multiple times [17] and on SO [18]. Maybe some note in the
> > documentation advising people to check their ulimit makes sense?
> > - One of them (unittest2 must be installed for python 2.6) was supposedly
> > fixed upstream of the commits I tested here; I don't know why I'm still
> > running into it. This cost me a few hours of running `./dev/run-tests`
> > multiple times to see if it was transient, plus some time researching and
> > working around it.
> > - The original BroadcastSuite failure cost me a few hours and went away
> > before I'd even run `mvn clean`.
> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a few
> > hours of running `./dev/run-tests` in different ways before deciding
> that,
> > as usual, there was no way around it and that I'd need to run `mvn clean`
> > and start running tests from scratch.
> > - The YarnClusterSuite failures on my linux box have cost me hours of
> > trying to figure out whether they're my fault. I've seen them many times
> > over the past weeks/months, plus or minus other failures that have come
> and
> > gone, and was especially befuddled by them when I was seeing a disjoint
> set
> > of reproducible failures on my mac [19] (the triaging of which involved
> > dozens of runs of `./dev/run-tests`).
> >
> > While I'm interested in digging into each of these issues, I also want to
> > discuss the frequency with which I've run into issues like these. This is
> > unfortunately not the first time in recent months that I've spent days
> > playing spurious-test-failure whack-a-mole with a 60-90min dev/run-tests
> > iteration time, which is no fun! So I am wondering/thinking:
> >
> > - Do other people experience this level of flakiness from spark tests?
> > - Do other people bother running dev/run-tests locally, or just let
> Jenkins
> > do it during the CR process?
> > - Needing to run a full assembly post-clean just to continue running one
> > specific test case feels especially wasteful, and the failure output when
> > naively attempting to run a specific test without having built an
> assembly
> > jar is not always clear about what the issue is or how to fix it; even
> the
> > fact that certain tests require "building the world" is not something I
> > would have expected, and has cost me hours of confusion.
> >    - Should a person running spark tests assume that they must build an
> > assembly JAR before running anything?
> >    - Are there some proper "unit" tests that are actually self-contained
> /
> > able to be run without building an assembly jar?
> >    - Can we better document/demarcate which tests have which
> dependencies?
> >    - Is there something finer-grained than building an assembly JAR that
> > is sufficient in some cases?
> >        - If so, can we document that?
> >        - If not, can we move to a world of finer-grained dependencies for
> > some of these?
> > - Leaving all of these spurious failures aside, the process of assembling
> > and testing a new JAR is not a quick one (40 and 60 mins for me
> typically,
> > respectively). I would guess that there are dozens (hundreds?) of people
> > who build a Spark assembly from various ToTs on any given day, and who
> all
> > wait on the exact same compilation / assembly steps to occur. Expanding
> on
> > the recent work to publish nightly snapshots [20], can we do a better job
> > caching/sharing compilation artifacts at a more granular level (pre-built
> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA? more
> > granular maven modules, plus the previous two?), or otherwise save some
> of
> > the considerable amount of redundant compilation work that I had to do
> over
> > the course of my odyssey this weekend?
> >
> > Ramping up on most projects involves some amount of supplementing the
> > documentation with trial and error to figure out what to run, which
> > "errors" are real errors and which can be ignored, etc., but navigating
> > that minefield on Spark has proved especially challenging and
> > time-consuming for me. Some of that comes directly from scala's
> relatively
> > slow compilation times and immature build-tooling ecosystem, but that is
> > the world we live in and it would be nice if Spark took the alleviation
> of
> > the resulting pain more seriously, as one of the more interesting and
> > well-known large scala projects around right now. The official
> > documentation around how to build different subsets of the codebase is
> > somewhat sparse [21], and there have been many mixed [22] accounts [23]
> on
> > this mailing list about preferred ways to build on mvn vs. sbt (none of
> > which has made it into official documentation, as far as I've seen).
> > Expecting new contributors to piece together all of this received
> > folk-wisdom about how to build/test in a sane way by trawling mailing
> list
> > archives seems suboptimal.
> >
> > Thanks for reading, looking forward to hearing your ideas!
> >
> > -Ryan
> >
> > P.S. Is "best practice" for emailing this list to not incorporate any
> HTML
> > in the body? It seems like all of the archives I've seen strip it out,
> but
> > other people have used it and gmail displays it.
> >
> >
> > [1]
> >
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
> > (57 mins)
> > [2]
> >
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
> > (6 mins)
> > [3]
> >
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%20pass%20test,%20fail%20subsequent%20compile
> > (4 mins)
> > [4]
> >
> https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
> > [5]
> >
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%20clean,%20need%20dependencies%20built
> > [6]
> >
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%20post%20clean
> > (50 mins)
> > [7]
> >
> https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
> > (1hr)
> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
> > [9] https://issues.apache.org/jira/browse/SPARK-3867
> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
> > [12]
> >
> https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#file-gistfile1-txt-L853
> > (~90 mins)
> > [13]
> >
> https://gist.github.com/ryan-williams/718f6324af358819b496#file-gistfile1-txt-L852
> > (91 mins)
> > [14]
> >
> https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#file-gistfile1-txt-L854
> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
> > [17]
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/quot-Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
> > [18]
> >
> http://stackoverflow.com/questions/25707629/why-does-spark-job-fail-with-too-many-open-files
> > [19] https://issues.apache.org/jira/browse/SPARK-4002
> > [20] https://issues.apache.org/jira/browse/SPARK-4542
> > [21]
> >
> https://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven
> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
> > [23]
> >
> http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Spurious test failures, testing best practices

Posted by Ryan Williams <ry...@gmail.com>.

Thanks Marcelo, "this is just how Maven works (unfortunately)" answers my
question.

Another related question: I tried to use `mvn scala:cc` and discovered that
it only seems to work scan src/main and src/test directories (according to its
docs <http://scala-tools.org/mvnsites/maven-scala-plugin/usage_cc.html>),
and so can only be run from within submodules, not from the root directory.

I'll add a note about this to building-spark.html unless there is a way to
do it for all modules / from the root directory that I've missed. Let me
know!

On Tue Dec 02 2014 at 5:49:58 PM Marcelo Vanzin <va...@cloudera.com> wrote:

> On Tue, Dec 2, 2014 at 4:40 PM, Ryan Williams
> <ry...@gmail.com> wrote:
> >> But you only need to compile the others once.
> >
> > once... every time I rebase off master, or am obliged to `mvn clean` by
> some
> > other build-correctness bug, as I said before. In my experience this
> works
> > out to a few times per week.
>
> No, you only need to do it something upstream from core changed (i.e.,
> spark-parent, network/common or network/shuffle) in an incompatible
> way. Otherwise, you can rebase and just recompile / retest core,
> without having to install everything else. I do this kind of thing all
> the time. If you have to do "mvn clean" often you're probably doing
> something wrong somewhere else.
>
> I understand where you're coming from, but the way you're thinking is
> just not how maven works. I too find annoying that maven requires lots
> of things to be "installed" before you can use them, when they're all
> part of the same project. But well, that's the way things are.
>
> --
> Marcelo
>

Re: Spurious test failures, testing best practices

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Tue, Dec 2, 2014 at 4:40 PM, Ryan Williams
<ry...@gmail.com> wrote:
>> But you only need to compile the others once.
>
> once... every time I rebase off master, or am obliged to `mvn clean` by some
> other build-correctness bug, as I said before. In my experience this works
> out to a few times per week.

No, you only need to do it something upstream from core changed (i.e.,
spark-parent, network/common or network/shuffle) in an incompatible
way. Otherwise, you can rebase and just recompile / retest core,
without having to install everything else. I do this kind of thing all
the time. If you have to do "mvn clean" often you're probably doing
something wrong somewhere else.

I understand where you're coming from, but the way you're thinking is
just not how maven works. I too find annoying that maven requires lots
of things to be "installed" before you can use them, when they're all
part of the same project. But well, that's the way things are.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spurious test failures, testing best practices

Posted by Ryan Williams <ry...@gmail.com>.

On Tue Dec 02 2014 at 4:46:20 PM Marcelo Vanzin <va...@cloudera.com> wrote:

> On Tue, Dec 2, 2014 at 3:39 PM, Ryan Williams
> <ry...@gmail.com> wrote:
> > Marcelo: by my count, there are 19 maven modules in the codebase. I am
> > typically only concerned with "core" (and therefore its two dependencies
> as
> > well, `network/{shuffle,common}`).
>
> But you only need to compile the others once.


once... every time I rebase off master, or am obliged to `mvn clean` by
some other build-correctness bug, as I said before. In my experience this
works out to a few times per week.


> Once you've established
> the baseline, you can just compile / test "core" to your heart's
> desire.


I understand that this is a workflow that does what I want as a side effect
of doing 3-5x more work (depending whether you count [number of modules
built] or [lines of scala/java compiled]), none of the extra work being
useful to me (more on that below).


> Core tests won't even run until you build the assembly anyway,
> since some of them require the assembly to be present.


The tests you refer to are exactly the ones that I'd like to let Jenkins
run from here on out, per advice I was given elsewhere in this thread and
due to the myriad unpleasantries I've encountered in trying to run them
myself.


>
> Also, even if you work in core - I'd say especially if you work in
> core - you should still, at some point, compile and test everything
> else that depends on it.
>

Last response applies.


>
> So, do this ONCE:
>

again, s/ONCE/several times a week/, in my experience.


>
>   mvn install -DskipTests
>
> Then do this as many times as you want:
>
>   mvn -pl spark-core_2.10 something
>
> That doesn't seem too bad to me.

(Be aware of the "assembly" comment
> above, since testing spark-core means you may have to rebuild the
> assembly from time to time, if your changes affect those tests.)
>
> > re: Marcelo's comment about "missing the 'spark-parent' project", I saw
> that
> > error message too and tried to ascertain what it could mean. Why would
> > `network/shuffle` need something from the parent project?
>
> The "spark-parent" project is the main pom that defines dependencies
> and their version, along with lots of build plugins and
> configurations. It's needed by all modules to compile correctly.
>

- I understand the parent POM has that information.

- I don't understand why Maven would feel that it is unable to compile the
`network/shuffle` module without having first compiled, packaged, and
installed 17 modules (19 minus `network/shuffle` and its dependency
`network/common`) that are not transitive dependencies of `network/shuffle`.

- I am trying to understand whether my failure to get Maven to compile
`network/shuffle` stems from my not knowing the correct incantation to feed
to Maven or from Maven's having a different (and seemingly worse) model for
how it handles module dependencies than I expected.



>
> --
> Marcelo
>

Re: Spurious test failures, testing best practices

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Tue, Dec 2, 2014 at 3:39 PM, Ryan Williams
<ry...@gmail.com> wrote:
> Marcelo: by my count, there are 19 maven modules in the codebase. I am
> typically only concerned with "core" (and therefore its two dependencies as
> well, `network/{shuffle,common}`).

But you only need to compile the others once. Once you've established
the baseline, you can just compile / test "core" to your heart's
desire. Core tests won't even run until you build the assembly anyway,
since some of them require the assembly to be present.

Also, even if you work in core - I'd say especially if you work in
core - you should still, at some point, compile and test everything
else that depends on it.

So, do this ONCE:

  mvn install -DskipTests

Then do this as many times as you want:

  mvn -pl spark-core_2.10 something

That doesn't seem too bad to me. (Be aware of the "assembly" comment
above, since testing spark-core means you may have to rebuild the
assembly from time to time, if your changes affect those tests.)

> re: Marcelo's comment about "missing the 'spark-parent' project", I saw that
> error message too and tried to ascertain what it could mean. Why would
> `network/shuffle` need something from the parent project?

The "spark-parent" project is the main pom that defines dependencies
and their version, along with lots of build plugins and
configurations. It's needed by all modules to compile correctly.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spurious test failures, testing best practices

Posted by Ryan Williams <ry...@gmail.com>.

Marcelo: by my count, there are 19 maven modules in the codebase. I am
typically only concerned with "core" (and therefore its two dependencies as
well, `network/{shuffle,common}`).

The `mvn package` workflow (and its sbt equivalent) that most people
apparently use involves (for me) compiling+packaging 16 other modules that
I don't care about; I pay this cost whenever I rebase off of master or
encounter the sbt-compiler-crash bug, among other possible scenarios.

Compiling one module (after building/installing its dependencies) seems
like the sort of thing that should be possible, and I don't see why my
previously-documented attempt is failing.

re: Marcelo's comment about "missing the 'spark-parent' project", I saw
that error message too and tried to ascertain what it could mean. Why would
`network/shuffle` need something from the parent project? AFAICT
`network/common` has the same references to the parent project as
`network/shuffle` (namely just a <parent> block in its POM), and yet I can
`mvn install -pl` the former but not the latter. Why would this be? One
difference is that `network/shuffle` has a dependency on another module,
while `network/common` does not.

Does Maven not let you build modules that depend on *any* other modules
without building *all* modules, or is there a way to do this that we've not
found yet?

Patrick: per my response to Marcelo above, I am trying to avoid having to
compile and package a bunch of stuff I am not using, which both `mvn
package` and `mvn install` on the parent project do.

On Tue Dec 02 2014 at 3:45:48 PM Marcelo Vanzin <va...@cloudera.com> wrote:

> On Tue, Dec 2, 2014 at 2:40 PM, Ryan Williams
> <ry...@gmail.com> wrote:
> > Following on Mark's Maven examples, here is another related issue I'm
> > having:
> >
> > I'd like to compile just the `core` module after a `mvn clean`, without
> > building an assembly JAR first. Is this possible?
>
> Out of curiosity, may I ask why? What's the problem with running "mvn
> install -DskipTests" first (or "package" instead of "install",
> although I generally do the latter)?
>
> You can probably do what you want if you manually build / install all
> the needed dependencies first; you found two, but it seems you're also
> missing the "spark-parent" project (which is the top-level pom). That
> sounds like a lot of trouble though, for not any gains that I can
> see... after the first build you should be able to do what you want
> easily.
>
> --
> Marcelo
>

Re: Spurious test failures, testing best practices

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Tue, Dec 2, 2014 at 2:40 PM, Ryan Williams
<ry...@gmail.com> wrote:
> Following on Mark's Maven examples, here is another related issue I'm
> having:
>
> I'd like to compile just the `core` module after a `mvn clean`, without
> building an assembly JAR first. Is this possible?

Out of curiosity, may I ask why? What's the problem with running "mvn
install -DskipTests" first (or "package" instead of "install",
although I generally do the latter)?

You can probably do what you want if you manually build / install all
the needed dependencies first; you found two, but it seems you're also
missing the "spark-parent" project (which is the top-level pom). That
sounds like a lot of trouble though, for not any gains that I can
see... after the first build you should be able to do what you want
easily.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spurious test failures, testing best practices

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Ryan,

What if you run a single "mvn install" to install all libraries
locally - then can you "mvn compile -pl core"? I think this may be the
only way to make it work.

- Patrick

On Tue, Dec 2, 2014 at 2:40 PM, Ryan Williams
<ry...@gmail.com> wrote:
> Following on Mark's Maven examples, here is another related issue I'm
> having:
>
> I'd like to compile just the `core` module after a `mvn clean`, without
> building an assembly JAR first. Is this possible?
>
> Attempting to do it myself, the steps I performed were:
>
> - `mvn compile -pl core`: fails because `core` depends on `network/common`
> and `network/shuffle`, neither of which is installed in my local maven
> cache (and which don't exist in central Maven repositories, I guess? I
> thought Spark is publishing snapshot releases?)
>
> - `network/shuffle` also depends on `network/common`, so I'll `mvn install`
> the latter first: `mvn install -DskipTests -pl network/common`. That
> succeeds, and I see a newly built 1.3.0-SNAPSHOT jar in my local maven
> repository.
>
> - However, `mvn install -DskipTests -pl network/shuffle` subsequently
> fails, seemingly due to not finding network/core. Here's
> <https://gist.github.com/ryan-williams/1711189e7d0af558738d> a sample full
> output from running `mvn install -X -U -DskipTests -pl network/shuffle`
> from such a state (the -U was to get around a previous failure based on
> having cached a failed lookup of network-common-1.3.0-SNAPSHOT).
>
> - Thinking maven might be special-casing "-SNAPSHOT" versions, I tried
> replacing "1.3.0-SNAPSHOT" with "1.3.0.1" globally and repeating these
> steps, but the error seems to be the same
> <https://gist.github.com/ryan-williams/37fcdd14dd92fa562dbe>.
>
> Any ideas?
>
> Thanks,
>
> -Ryan
>
> On Sun Nov 30 2014 at 6:37:28 PM Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> >
>> > - Start the SBT interactive console with sbt/sbt
>> > - Build your assembly by running the "assembly" target in the assembly
>> > project: assembly/assembly
>> > - Run all the tests in one module: core/test
>> > - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>> (this
>> > also supports tab completion)
>>
>>
>> The equivalent using Maven:
>>
>> - Start zinc
>> - Build your assembly using the mvn "package" or "install" target
>> ("install" is actually the equivalent of SBT's "publishLocal") -- this step
>> is the first step in
>> http://spark.apache.org/docs/latest/building-with-maven.
>> html#spark-tests-in-maven
>> - Run all the tests in one module: mvn -pl core test
>> - Run a specific suite: mvn -pl core
>> -DwildcardSuites=org.apache.spark.rdd.RDDSuite test (the -pl option isn't
>> strictly necessary if you don't mind waiting for Maven to scan through all
>> the other sub-projects only to do nothing; and, of course, it needs to be
>> something other than "core" if the test you want to run is in another
>> sub-project.)
>>
>> You also typically want to carry along in each subsequent step any relevant
>> command line options you added in the "package"/"install" step.
>>
>> On Sun, Nov 30, 2014 at 3:06 PM, Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>> > Hi Ryan,
>> >
>> > As a tip (and maybe this isn't documented well), I normally use SBT for
>> > development to avoid the slow build process, and use its interactive
>> > console to run only specific tests. The nice advantage is that SBT can
>> keep
>> > the Scala compiler loaded and JITed across builds, making it faster to
>> > iterate. To use it, you can do the following:
>> >
>> > - Start the SBT interactive console with sbt/sbt
>> > - Build your assembly by running the "assembly" target in the assembly
>> > project: assembly/assembly
>> > - Run all the tests in one module: core/test
>> > - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>> (this
>> > also supports tab completion)
>> >
>> > Running all the tests does take a while, and I usually just rely on
>> > Jenkins for that once I've run the tests for the things I believed my
>> patch
>> > could break. But this is because some of them are integration tests (e.g.
>> > DistributedSuite, which creates multi-process mini-clusters). Many of the
>> > individual suites run fast without requiring this, however, so you can
>> pick
>> > the ones you want. Perhaps we should find a way to tag them so people
>> can
>> > do a "quick-test" that skips the integration ones.
>> >
>> > The assembly builds are annoying but they only take about a minute for me
>> > on a MacBook Pro with SBT warmed up. The assembly is actually only
>> required
>> > for some of the "integration" tests (which launch new processes), but I'd
>> > recommend doing it all the time anyway since it would be very confusing
>> to
>> > run those with an old assembly. The Scala compiler crash issue can also
>> be
>> > a problem, but I don't see it very often with SBT. If it happens, I exit
>> > SBT and do sbt clean.
>> >
>> > Anyway, this is useful feedback and I think we should try to improve some
>> > of these suites, but hopefully you can also try the faster SBT process.
>> At
>> > the end of the day, if we want integration tests, the whole test process
>> > will take an hour, but most of the developers I know leave that to
>> Jenkins
>> > and only run individual tests locally before submitting a patch.
>> >
>> > Matei
>> >
>> >
>> > > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
>> > ryan.blake.williams@gmail.com> wrote:
>> > >
>> > > In the course of trying to make contributions to Spark, I have had a
>> lot
>> > of
>> > > trouble running Spark's tests successfully. The main pain points I've
>> > > experienced are:
>> > >
>> > >    1) frequent, spurious test failures
>> > >    2) high latency of running tests
>> > >    3) difficulty running specific tests in an iterative fashion
>> > >
>> > > Here is an example series of failures that I encountered this weekend
>> > > (along with footnote links to the console output from each and
>> > > approximately how long each took):
>> > >
>> > > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
>> > > before.
>> > > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>> > > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
>> BroadcastSuite
>> > > passed, but scala compiler crashed on the "catalyst" project.
>> > > - `mvn clean`: some attempts to run earlier commands (that previously
>> > > didn't crash the compiler) all result in the same compiler crash.
>> > Previous
>> > > discussion on this list implies this can only be solved by a `mvn
>> clean`
>> > > [4].
>> > > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
>> > > BroadcastSuite can't run because assembly is not built.
>> > > - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
>> > > version mismatches and python 2.6. The machine this ran on has python
>> > 2.7,
>> > > so I don't know what that's about.
>> > > - `./dev/run-tests` again [7]: "too many open files" errors in several
>> > > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this
>> is
>> > > not enough, but only some of the time? I increased it to 8192 and tried
>> > > again.
>> > > - `./dev/run-tests` again [8]: same pyspark errors as before. This
>> seems
>> > to
>> > > be the issue from SPARK-3867 [9], which was supposedly fixed on October
>> > 14;
>> > > not sure how I'm seeing it now. In any case, switched to Python 2.6 and
>> > > installed unittest2, and python/run-tests seems to be unblocked.
>> > > - `./dev/run-tests` again [10]: finally passes!
>> > >
>> > > This was on a spark checkout at ceb6281 (ToT Friday), with a few
>> trivial
>> > > changes added on (that I wanted to test before sending out a PR), on a
>> > > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>> > >
>> > > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>> > commands
>> > > from the same repo state:
>> > >
>> > > - `./dev/run-tests` [12]: YarnClusterSuite failure.
>> > > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've
>> seen
>> > > this one before on this machine and am guessing it actually occurs
>> every
>> > > time.
>> > > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one
>> more
>> > > time from ceb6281, and saw the same failure.
>> > >
>> > > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>> > narrow
>> > > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my
>> > mac,
>> > > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>> > used),
>> > > and it passed [16], so the failure seems specific to my linux
>> > machine/arch.
>> > >
>> > > At this point I believe that my changes don't break any tests (the
>> > > YarnClusterSuite failure on my linux presumably not being... "real"),
>> > and I
>> > > am ready to send out a PR. Whew!
>> > >
>> > > However, reflecting on the 5 or 6 distinct failure-modes represented
>> > above:
>> > >
>> > > - One of them (too many files open), is something I can (and did,
>> > > hopefully) fix once and for all. It cost me an ~hour this time
>> > (approximate
>> > > time of running ./dev/run-tests) and a few hours other times when I
>> > didn't
>> > > fully understand/fix it. It doesn't happen deterministically (why?),
>> but
>> > > does happen somewhat frequently to people, having been discussed on the
>> > > user list multiple times [17] and on SO [18]. Maybe some note in the
>> > > documentation advising people to check their ulimit makes sense?
>> > > - One of them (unittest2 must be installed for python 2.6) was
>> supposedly
>> > > fixed upstream of the commits I tested here; I don't know why I'm still
>> > > running into it. This cost me a few hours of running `./dev/run-tests`
>> > > multiple times to see if it was transient, plus some time researching
>> and
>> > > working around it.
>> > > - The original BroadcastSuite failure cost me a few hours and went away
>> > > before I'd even run `mvn clean`.
>> > > - A new incarnation of the sbt-compiler-crash phenomenon cost me a few
>> > > hours of running `./dev/run-tests` in different ways before deciding
>> > that,
>> > > as usual, there was no way around it and that I'd need to run `mvn
>> clean`
>> > > and start running tests from scratch.
>> > > - The YarnClusterSuite failures on my linux box have cost me hours of
>> > > trying to figure out whether they're my fault. I've seen them many
>> times
>> > > over the past weeks/months, plus or minus other failures that have come
>> > and
>> > > gone, and was especially befuddled by them when I was seeing a disjoint
>> > set
>> > > of reproducible failures on my mac [19] (the triaging of which involved
>> > > dozens of runs of `./dev/run-tests`).
>> > >
>> > > While I'm interested in digging into each of these issues, I also want
>> to
>> > > discuss the frequency with which I've run into issues like these. This
>> is
>> > > unfortunately not the first time in recent months that I've spent days
>> > > playing spurious-test-failure whack-a-mole with a 60-90min
>> dev/run-tests
>> > > iteration time, which is no fun! So I am wondering/thinking:
>> > >
>> > > - Do other people experience this level of flakiness from spark tests?
>> > > - Do other people bother running dev/run-tests locally, or just let
>> > Jenkins
>> > > do it during the CR process?
>> > > - Needing to run a full assembly post-clean just to continue running
>> one
>> > > specific test case feels especially wasteful, and the failure output
>> when
>> > > naively attempting to run a specific test without having built an
>> > assembly
>> > > jar is not always clear about what the issue is or how to fix it; even
>> > the
>> > > fact that certain tests require "building the world" is not something I
>> > > would have expected, and has cost me hours of confusion.
>> > >    - Should a person running spark tests assume that they must build an
>> > > assembly JAR before running anything?
>> > >    - Are there some proper "unit" tests that are actually
>> self-contained
>> > /
>> > > able to be run without building an assembly jar?
>> > >    - Can we better document/demarcate which tests have which
>> > dependencies?
>> > >    - Is there something finer-grained than building an assembly JAR
>> that
>> > > is sufficient in some cases?
>> > >        - If so, can we document that?
>> > >        - If not, can we move to a world of finer-grained dependencies
>> for
>> > > some of these?
>> > > - Leaving all of these spurious failures aside, the process of
>> assembling
>> > > and testing a new JAR is not a quick one (40 and 60 mins for me
>> > typically,
>> > > respectively). I would guess that there are dozens (hundreds?) of
>> people
>> > > who build a Spark assembly from various ToTs on any given day, and who
>> > all
>> > > wait on the exact same compilation / assembly steps to occur. Expanding
>> > on
>> > > the recent work to publish nightly snapshots [20], can we do a better
>> job
>> > > caching/sharing compilation artifacts at a more granular level
>> (pre-built
>> > > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA?
>> more
>> > > granular maven modules, plus the previous two?), or otherwise save some
>> > of
>> > > the considerable amount of redundant compilation work that I had to do
>> > over
>> > > the course of my odyssey this weekend?
>> > >
>> > > Ramping up on most projects involves some amount of supplementing the
>> > > documentation with trial and error to figure out what to run, which
>> > > "errors" are real errors and which can be ignored, etc., but navigating
>> > > that minefield on Spark has proved especially challenging and
>> > > time-consuming for me. Some of that comes directly from scala's
>> > relatively
>> > > slow compilation times and immature build-tooling ecosystem, but that
>> is
>> > > the world we live in and it would be nice if Spark took the alleviation
>> > of
>> > > the resulting pain more seriously, as one of the more interesting and
>> > > well-known large scala projects around right now. The official
>> > > documentation around how to build different subsets of the codebase is
>> > > somewhat sparse [21], and there have been many mixed [22] accounts [23]
>> > on
>> > > this mailing list about preferred ways to build on mvn vs. sbt (none of
>> > > which has made it into official documentation, as far as I've seen).
>> > > Expecting new contributors to piece together all of this received
>> > > folk-wisdom about how to build/test in a sane way by trawling mailing
>> > list
>> > > archives seems suboptimal.
>> > >
>> > > Thanks for reading, looking forward to hearing your ideas!
>> > >
>> > > -Ryan
>> > >
>> > > P.S. Is "best practice" for emailing this list to not incorporate any
>> > HTML
>> > > in the body? It seems like all of the archives I've seen strip it out,
>> > but
>> > > other people have used it and gmail displays it.
>> > >
>> > >
>> > > [1]
>> > >
>> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>> > > (57 mins)
>> > > [2]
>> > >
>> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>> > > (6 mins)
>> > > [3]
>> > >
>> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
>> 20pass%20test,%20fail%20subsequent%20compile
>> > > (4 mins)
>> > > [4]
>> > >
>> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
>> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
>> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
>> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
>> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
>> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>> > > [5]
>> > >
>> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
>> 20clean,%20need%20dependencies%20built
>> > > [6]
>> > >
>> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
>> 20post%20clean
>> > > (50 mins)
>> > > [7]
>> > >
>> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
>> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
>> > > (1hr)
>> > > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
>> > > [9] https://issues.apache.org/jira/browse/SPARK-3867
>> > > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>> > > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>> > > [12]
>> > >
>> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
>> file-gistfile1-txt-L853
>> > > (~90 mins)
>> > > [13]
>> > >
>> > https://gist.github.com/ryan-williams/718f6324af358819b496#
>> file-gistfile1-txt-L852
>> > > (91 mins)
>> > > [14]
>> > >
>> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
>> file-gistfile1-txt-L854
>> > > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>> > > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>> > > [17]
>> > >
>> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
>> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
>> > > [18]
>> > >
>> > http://stackoverflow.com/questions/25707629/why-does-
>> spark-job-fail-with-too-many-open-files
>> > > [19] https://issues.apache.org/jira/browse/SPARK-4002
>> > > [20] https://issues.apache.org/jira/browse/SPARK-4542
>> > > [21]
>> > >
>> > https://spark.apache.org/docs/latest/building-with-maven.
>> html#spark-tests-in-maven
>> > > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
>> > > [23]
>> > >
>> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
>> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: dev-help@spark.apache.org
>> >
>> >
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spurious test failures, testing best practices

Posted by Ryan Williams <ry...@gmail.com>.

Following on Mark's Maven examples, here is another related issue I'm
having:

I'd like to compile just the `core` module after a `mvn clean`, without
building an assembly JAR first. Is this possible?

Attempting to do it myself, the steps I performed were:

- `mvn compile -pl core`: fails because `core` depends on `network/common`
and `network/shuffle`, neither of which is installed in my local maven
cache (and which don't exist in central Maven repositories, I guess? I
thought Spark is publishing snapshot releases?)

- `network/shuffle` also depends on `network/common`, so I'll `mvn install`
the latter first: `mvn install -DskipTests -pl network/common`. That
succeeds, and I see a newly built 1.3.0-SNAPSHOT jar in my local maven
repository.

- However, `mvn install -DskipTests -pl network/shuffle` subsequently
fails, seemingly due to not finding network/core. Here's
<https://gist.github.com/ryan-williams/1711189e7d0af558738d> a sample full
output from running `mvn install -X -U -DskipTests -pl network/shuffle`
from such a state (the -U was to get around a previous failure based on
having cached a failed lookup of network-common-1.3.0-SNAPSHOT).

- Thinking maven might be special-casing "-SNAPSHOT" versions, I tried
replacing "1.3.0-SNAPSHOT" with "1.3.0.1" globally and repeating these
steps, but the error seems to be the same
<https://gist.github.com/ryan-williams/37fcdd14dd92fa562dbe>.

Any ideas?

Thanks,

-Ryan

On Sun Nov 30 2014 at 6:37:28 PM Mark Hamstra <ma...@clearstorydata.com>
wrote:

> >
> > - Start the SBT interactive console with sbt/sbt
> > - Build your assembly by running the "assembly" target in the assembly
> > project: assembly/assembly
> > - Run all the tests in one module: core/test
> > - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
> (this
> > also supports tab completion)
>
>
> The equivalent using Maven:
>
> - Start zinc
> - Build your assembly using the mvn "package" or "install" target
> ("install" is actually the equivalent of SBT's "publishLocal") -- this step
> is the first step in
> http://spark.apache.org/docs/latest/building-with-maven.
> html#spark-tests-in-maven
> - Run all the tests in one module: mvn -pl core test
> - Run a specific suite: mvn -pl core
> -DwildcardSuites=org.apache.spark.rdd.RDDSuite test (the -pl option isn't
> strictly necessary if you don't mind waiting for Maven to scan through all
> the other sub-projects only to do nothing; and, of course, it needs to be
> something other than "core" if the test you want to run is in another
> sub-project.)
>
> You also typically want to carry along in each subsequent step any relevant
> command line options you added in the "package"/"install" step.
>
> On Sun, Nov 30, 2014 at 3:06 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
> > Hi Ryan,
> >
> > As a tip (and maybe this isn't documented well), I normally use SBT for
> > development to avoid the slow build process, and use its interactive
> > console to run only specific tests. The nice advantage is that SBT can
> keep
> > the Scala compiler loaded and JITed across builds, making it faster to
> > iterate. To use it, you can do the following:
> >
> > - Start the SBT interactive console with sbt/sbt
> > - Build your assembly by running the "assembly" target in the assembly
> > project: assembly/assembly
> > - Run all the tests in one module: core/test
> > - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
> (this
> > also supports tab completion)
> >
> > Running all the tests does take a while, and I usually just rely on
> > Jenkins for that once I've run the tests for the things I believed my
> patch
> > could break. But this is because some of them are integration tests (e.g.
> > DistributedSuite, which creates multi-process mini-clusters). Many of the
> > individual suites run fast without requiring this, however, so you can
> pick
> > the ones you want. Perhaps we should find a way to tag them so people
> can
> > do a "quick-test" that skips the integration ones.
> >
> > The assembly builds are annoying but they only take about a minute for me
> > on a MacBook Pro with SBT warmed up. The assembly is actually only
> required
> > for some of the "integration" tests (which launch new processes), but I'd
> > recommend doing it all the time anyway since it would be very confusing
> to
> > run those with an old assembly. The Scala compiler crash issue can also
> be
> > a problem, but I don't see it very often with SBT. If it happens, I exit
> > SBT and do sbt clean.
> >
> > Anyway, this is useful feedback and I think we should try to improve some
> > of these suites, but hopefully you can also try the faster SBT process.
> At
> > the end of the day, if we want integration tests, the whole test process
> > will take an hour, but most of the developers I know leave that to
> Jenkins
> > and only run individual tests locally before submitting a patch.
> >
> > Matei
> >
> >
> > > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
> > ryan.blake.williams@gmail.com> wrote:
> > >
> > > In the course of trying to make contributions to Spark, I have had a
> lot
> > of
> > > trouble running Spark's tests successfully. The main pain points I've
> > > experienced are:
> > >
> > >    1) frequent, spurious test failures
> > >    2) high latency of running tests
> > >    3) difficulty running specific tests in an iterative fashion
> > >
> > > Here is an example series of failures that I encountered this weekend
> > > (along with footnote links to the console output from each and
> > > approximately how long each took):
> > >
> > > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
> > > before.
> > > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
> > > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
> BroadcastSuite
> > > passed, but scala compiler crashed on the "catalyst" project.
> > > - `mvn clean`: some attempts to run earlier commands (that previously
> > > didn't crash the compiler) all result in the same compiler crash.
> > Previous
> > > discussion on this list implies this can only be solved by a `mvn
> clean`
> > > [4].
> > > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
> > > BroadcastSuite can't run because assembly is not built.
> > > - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
> > > version mismatches and python 2.6. The machine this ran on has python
> > 2.7,
> > > so I don't know what that's about.
> > > - `./dev/run-tests` again [7]: "too many open files" errors in several
> > > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this
> is
> > > not enough, but only some of the time? I increased it to 8192 and tried
> > > again.
> > > - `./dev/run-tests` again [8]: same pyspark errors as before. This
> seems
> > to
> > > be the issue from SPARK-3867 [9], which was supposedly fixed on October
> > 14;
> > > not sure how I'm seeing it now. In any case, switched to Python 2.6 and
> > > installed unittest2, and python/run-tests seems to be unblocked.
> > > - `./dev/run-tests` again [10]: finally passes!
> > >
> > > This was on a spark checkout at ceb6281 (ToT Friday), with a few
> trivial
> > > changes added on (that I wanted to test before sending out a PR), on a
> > > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
> > >
> > > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
> > commands
> > > from the same repo state:
> > >
> > > - `./dev/run-tests` [12]: YarnClusterSuite failure.
> > > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've
> seen
> > > this one before on this machine and am guessing it actually occurs
> every
> > > time.
> > > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one
> more
> > > time from ceb6281, and saw the same failure.
> > >
> > > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
> > narrow
> > > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my
> > mac,
> > > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
> > used),
> > > and it passed [16], so the failure seems specific to my linux
> > machine/arch.
> > >
> > > At this point I believe that my changes don't break any tests (the
> > > YarnClusterSuite failure on my linux presumably not being... "real"),
> > and I
> > > am ready to send out a PR. Whew!
> > >
> > > However, reflecting on the 5 or 6 distinct failure-modes represented
> > above:
> > >
> > > - One of them (too many files open), is something I can (and did,
> > > hopefully) fix once and for all. It cost me an ~hour this time
> > (approximate
> > > time of running ./dev/run-tests) and a few hours other times when I
> > didn't
> > > fully understand/fix it. It doesn't happen deterministically (why?),
> but
> > > does happen somewhat frequently to people, having been discussed on the
> > > user list multiple times [17] and on SO [18]. Maybe some note in the
> > > documentation advising people to check their ulimit makes sense?
> > > - One of them (unittest2 must be installed for python 2.6) was
> supposedly
> > > fixed upstream of the commits I tested here; I don't know why I'm still
> > > running into it. This cost me a few hours of running `./dev/run-tests`
> > > multiple times to see if it was transient, plus some time researching
> and
> > > working around it.
> > > - The original BroadcastSuite failure cost me a few hours and went away
> > > before I'd even run `mvn clean`.
> > > - A new incarnation of the sbt-compiler-crash phenomenon cost me a few
> > > hours of running `./dev/run-tests` in different ways before deciding
> > that,
> > > as usual, there was no way around it and that I'd need to run `mvn
> clean`
> > > and start running tests from scratch.
> > > - The YarnClusterSuite failures on my linux box have cost me hours of
> > > trying to figure out whether they're my fault. I've seen them many
> times
> > > over the past weeks/months, plus or minus other failures that have come
> > and
> > > gone, and was especially befuddled by them when I was seeing a disjoint
> > set
> > > of reproducible failures on my mac [19] (the triaging of which involved
> > > dozens of runs of `./dev/run-tests`).
> > >
> > > While I'm interested in digging into each of these issues, I also want
> to
> > > discuss the frequency with which I've run into issues like these. This
> is
> > > unfortunately not the first time in recent months that I've spent days
> > > playing spurious-test-failure whack-a-mole with a 60-90min
> dev/run-tests
> > > iteration time, which is no fun! So I am wondering/thinking:
> > >
> > > - Do other people experience this level of flakiness from spark tests?
> > > - Do other people bother running dev/run-tests locally, or just let
> > Jenkins
> > > do it during the CR process?
> > > - Needing to run a full assembly post-clean just to continue running
> one
> > > specific test case feels especially wasteful, and the failure output
> when
> > > naively attempting to run a specific test without having built an
> > assembly
> > > jar is not always clear about what the issue is or how to fix it; even
> > the
> > > fact that certain tests require "building the world" is not something I
> > > would have expected, and has cost me hours of confusion.
> > >    - Should a person running spark tests assume that they must build an
> > > assembly JAR before running anything?
> > >    - Are there some proper "unit" tests that are actually
> self-contained
> > /
> > > able to be run without building an assembly jar?
> > >    - Can we better document/demarcate which tests have which
> > dependencies?
> > >    - Is there something finer-grained than building an assembly JAR
> that
> > > is sufficient in some cases?
> > >        - If so, can we document that?
> > >        - If not, can we move to a world of finer-grained dependencies
> for
> > > some of these?
> > > - Leaving all of these spurious failures aside, the process of
> assembling
> > > and testing a new JAR is not a quick one (40 and 60 mins for me
> > typically,
> > > respectively). I would guess that there are dozens (hundreds?) of
> people
> > > who build a Spark assembly from various ToTs on any given day, and who
> > all
> > > wait on the exact same compilation / assembly steps to occur. Expanding
> > on
> > > the recent work to publish nightly snapshots [20], can we do a better
> job
> > > caching/sharing compilation artifacts at a more granular level
> (pre-built
> > > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA?
> more
> > > granular maven modules, plus the previous two?), or otherwise save some
> > of
> > > the considerable amount of redundant compilation work that I had to do
> > over
> > > the course of my odyssey this weekend?
> > >
> > > Ramping up on most projects involves some amount of supplementing the
> > > documentation with trial and error to figure out what to run, which
> > > "errors" are real errors and which can be ignored, etc., but navigating
> > > that minefield on Spark has proved especially challenging and
> > > time-consuming for me. Some of that comes directly from scala's
> > relatively
> > > slow compilation times and immature build-tooling ecosystem, but that
> is
> > > the world we live in and it would be nice if Spark took the alleviation
> > of
> > > the resulting pain more seriously, as one of the more interesting and
> > > well-known large scala projects around right now. The official
> > > documentation around how to build different subsets of the codebase is
> > > somewhat sparse [21], and there have been many mixed [22] accounts [23]
> > on
> > > this mailing list about preferred ways to build on mvn vs. sbt (none of
> > > which has made it into official documentation, as far as I've seen).
> > > Expecting new contributors to piece together all of this received
> > > folk-wisdom about how to build/test in a sane way by trawling mailing
> > list
> > > archives seems suboptimal.
> > >
> > > Thanks for reading, looking forward to hearing your ideas!
> > >
> > > -Ryan
> > >
> > > P.S. Is "best practice" for emailing this list to not incorporate any
> > HTML
> > > in the body? It seems like all of the archives I've seen strip it out,
> > but
> > > other people have used it and gmail displays it.
> > >
> > >
> > > [1]
> > >
> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
> > > (57 mins)
> > > [2]
> > >
> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
> > > (6 mins)
> > > [3]
> > >
> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
> 20pass%20test,%20fail%20subsequent%20compile
> > > (4 mins)
> > > [4]
> > >
> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
> > > [5]
> > >
> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
> 20clean,%20need%20dependencies%20built
> > > [6]
> > >
> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
> 20post%20clean
> > > (50 mins)
> > > [7]
> > >
> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
> > > (1hr)
> > > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
> > > [9] https://issues.apache.org/jira/browse/SPARK-3867
> > > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
> > > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
> > > [12]
> > >
> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
> file-gistfile1-txt-L853
> > > (~90 mins)
> > > [13]
> > >
> > https://gist.github.com/ryan-williams/718f6324af358819b496#
> file-gistfile1-txt-L852
> > > (91 mins)
> > > [14]
> > >
> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
> file-gistfile1-txt-L854
> > > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
> > > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
> > > [17]
> > >
> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
> > > [18]
> > >
> > http://stackoverflow.com/questions/25707629/why-does-
> spark-job-fail-with-too-many-open-files
> > > [19] https://issues.apache.org/jira/browse/SPARK-4002
> > > [20] https://issues.apache.org/jira/browse/SPARK-4542
> > > [21]
> > >
> > https://spark.apache.org/docs/latest/building-with-maven.
> html#spark-tests-in-maven
> > > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
> > > [23]
> > >
> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
> >
>

Re: Spurious test failures, testing best practices

Posted by Mark Hamstra <ma...@clearstorydata.com>.

>
> - Start the SBT interactive console with sbt/sbt
> - Build your assembly by running the "assembly" target in the assembly
> project: assembly/assembly
> - Run all the tests in one module: core/test
> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
> also supports tab completion)


The equivalent using Maven:

- Start zinc
- Build your assembly using the mvn "package" or "install" target
("install" is actually the equivalent of SBT's "publishLocal") -- this step
is the first step in
http://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven
- Run all the tests in one module: mvn -pl core test
- Run a specific suite: mvn -pl core
-DwildcardSuites=org.apache.spark.rdd.RDDSuite test (the -pl option isn't
strictly necessary if you don't mind waiting for Maven to scan through all
the other sub-projects only to do nothing; and, of course, it needs to be
something other than "core" if the test you want to run is in another
sub-project.)

You also typically want to carry along in each subsequent step any relevant
command line options you added in the "package"/"install" step.

On Sun, Nov 30, 2014 at 3:06 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> Hi Ryan,
>
> As a tip (and maybe this isn't documented well), I normally use SBT for
> development to avoid the slow build process, and use its interactive
> console to run only specific tests. The nice advantage is that SBT can keep
> the Scala compiler loaded and JITed across builds, making it faster to
> iterate. To use it, you can do the following:
>
> - Start the SBT interactive console with sbt/sbt
> - Build your assembly by running the "assembly" target in the assembly
> project: assembly/assembly
> - Run all the tests in one module: core/test
> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
> also supports tab completion)
>
> Running all the tests does take a while, and I usually just rely on
> Jenkins for that once I've run the tests for the things I believed my patch
> could break. But this is because some of them are integration tests (e.g.
> DistributedSuite, which creates multi-process mini-clusters). Many of the
> individual suites run fast without requiring this, however, so you can pick
> the ones you want. Perhaps we should find a way to tag them so people  can
> do a "quick-test" that skips the integration ones.
>
> The assembly builds are annoying but they only take about a minute for me
> on a MacBook Pro with SBT warmed up. The assembly is actually only required
> for some of the "integration" tests (which launch new processes), but I'd
> recommend doing it all the time anyway since it would be very confusing to
> run those with an old assembly. The Scala compiler crash issue can also be
> a problem, but I don't see it very often with SBT. If it happens, I exit
> SBT and do sbt clean.
>
> Anyway, this is useful feedback and I think we should try to improve some
> of these suites, but hopefully you can also try the faster SBT process. At
> the end of the day, if we want integration tests, the whole test process
> will take an hour, but most of the developers I know leave that to Jenkins
> and only run individual tests locally before submitting a patch.
>
> Matei
>
>
> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
> ryan.blake.williams@gmail.com> wrote:
> >
> > In the course of trying to make contributions to Spark, I have had a lot
> of
> > trouble running Spark's tests successfully. The main pain points I've
> > experienced are:
> >
> >    1) frequent, spurious test failures
> >    2) high latency of running tests
> >    3) difficulty running specific tests in an iterative fashion
> >
> > Here is an example series of failures that I encountered this weekend
> > (along with footnote links to the console output from each and
> > approximately how long each took):
> >
> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
> > before.
> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
> > passed, but scala compiler crashed on the "catalyst" project.
> > - `mvn clean`: some attempts to run earlier commands (that previously
> > didn't crash the compiler) all result in the same compiler crash.
> Previous
> > discussion on this list implies this can only be solved by a `mvn clean`
> > [4].
> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
> > BroadcastSuite can't run because assembly is not built.
> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
> > version mismatches and python 2.6. The machine this ran on has python
> 2.7,
> > so I don't know what that's about.
> > - `./dev/run-tests` again [7]: "too many open files" errors in several
> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
> > not enough, but only some of the time? I increased it to 8192 and tried
> > again.
> > - `./dev/run-tests` again [8]: same pyspark errors as before. This seems
> to
> > be the issue from SPARK-3867 [9], which was supposedly fixed on October
> 14;
> > not sure how I'm seeing it now. In any case, switched to Python 2.6 and
> > installed unittest2, and python/run-tests seems to be unblocked.
> > - `./dev/run-tests` again [10]: finally passes!
> >
> > This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
> > changes added on (that I wanted to test before sending out a PR), on a
> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
> >
> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
> commands
> > from the same repo state:
> >
> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
> > this one before on this machine and am guessing it actually occurs every
> > time.
> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
> > time from ceb6281, and saw the same failure.
> >
> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
> narrow
> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my
> mac,
> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
> used),
> > and it passed [16], so the failure seems specific to my linux
> machine/arch.
> >
> > At this point I believe that my changes don't break any tests (the
> > YarnClusterSuite failure on my linux presumably not being... "real"),
> and I
> > am ready to send out a PR. Whew!
> >
> > However, reflecting on the 5 or 6 distinct failure-modes represented
> above:
> >
> > - One of them (too many files open), is something I can (and did,
> > hopefully) fix once and for all. It cost me an ~hour this time
> (approximate
> > time of running ./dev/run-tests) and a few hours other times when I
> didn't
> > fully understand/fix it. It doesn't happen deterministically (why?), but
> > does happen somewhat frequently to people, having been discussed on the
> > user list multiple times [17] and on SO [18]. Maybe some note in the
> > documentation advising people to check their ulimit makes sense?
> > - One of them (unittest2 must be installed for python 2.6) was supposedly
> > fixed upstream of the commits I tested here; I don't know why I'm still
> > running into it. This cost me a few hours of running `./dev/run-tests`
> > multiple times to see if it was transient, plus some time researching and
> > working around it.
> > - The original BroadcastSuite failure cost me a few hours and went away
> > before I'd even run `mvn clean`.
> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a few
> > hours of running `./dev/run-tests` in different ways before deciding
> that,
> > as usual, there was no way around it and that I'd need to run `mvn clean`
> > and start running tests from scratch.
> > - The YarnClusterSuite failures on my linux box have cost me hours of
> > trying to figure out whether they're my fault. I've seen them many times
> > over the past weeks/months, plus or minus other failures that have come
> and
> > gone, and was especially befuddled by them when I was seeing a disjoint
> set
> > of reproducible failures on my mac [19] (the triaging of which involved
> > dozens of runs of `./dev/run-tests`).
> >
> > While I'm interested in digging into each of these issues, I also want to
> > discuss the frequency with which I've run into issues like these. This is
> > unfortunately not the first time in recent months that I've spent days
> > playing spurious-test-failure whack-a-mole with a 60-90min dev/run-tests
> > iteration time, which is no fun! So I am wondering/thinking:
> >
> > - Do other people experience this level of flakiness from spark tests?
> > - Do other people bother running dev/run-tests locally, or just let
> Jenkins
> > do it during the CR process?
> > - Needing to run a full assembly post-clean just to continue running one
> > specific test case feels especially wasteful, and the failure output when
> > naively attempting to run a specific test without having built an
> assembly
> > jar is not always clear about what the issue is or how to fix it; even
> the
> > fact that certain tests require "building the world" is not something I
> > would have expected, and has cost me hours of confusion.
> >    - Should a person running spark tests assume that they must build an
> > assembly JAR before running anything?
> >    - Are there some proper "unit" tests that are actually self-contained
> /
> > able to be run without building an assembly jar?
> >    - Can we better document/demarcate which tests have which
> dependencies?
> >    - Is there something finer-grained than building an assembly JAR that
> > is sufficient in some cases?
> >        - If so, can we document that?
> >        - If not, can we move to a world of finer-grained dependencies for
> > some of these?
> > - Leaving all of these spurious failures aside, the process of assembling
> > and testing a new JAR is not a quick one (40 and 60 mins for me
> typically,
> > respectively). I would guess that there are dozens (hundreds?) of people
> > who build a Spark assembly from various ToTs on any given day, and who
> all
> > wait on the exact same compilation / assembly steps to occur. Expanding
> on
> > the recent work to publish nightly snapshots [20], can we do a better job
> > caching/sharing compilation artifacts at a more granular level (pre-built
> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA? more
> > granular maven modules, plus the previous two?), or otherwise save some
> of
> > the considerable amount of redundant compilation work that I had to do
> over
> > the course of my odyssey this weekend?
> >
> > Ramping up on most projects involves some amount of supplementing the
> > documentation with trial and error to figure out what to run, which
> > "errors" are real errors and which can be ignored, etc., but navigating
> > that minefield on Spark has proved especially challenging and
> > time-consuming for me. Some of that comes directly from scala's
> relatively
> > slow compilation times and immature build-tooling ecosystem, but that is
> > the world we live in and it would be nice if Spark took the alleviation
> of
> > the resulting pain more seriously, as one of the more interesting and
> > well-known large scala projects around right now. The official
> > documentation around how to build different subsets of the codebase is
> > somewhat sparse [21], and there have been many mixed [22] accounts [23]
> on
> > this mailing list about preferred ways to build on mvn vs. sbt (none of
> > which has made it into official documentation, as far as I've seen).
> > Expecting new contributors to piece together all of this received
> > folk-wisdom about how to build/test in a sane way by trawling mailing
> list
> > archives seems suboptimal.
> >
> > Thanks for reading, looking forward to hearing your ideas!
> >
> > -Ryan
> >
> > P.S. Is "best practice" for emailing this list to not incorporate any
> HTML
> > in the body? It seems like all of the archives I've seen strip it out,
> but
> > other people have used it and gmail displays it.
> >
> >
> > [1]
> >
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
> > (57 mins)
> > [2]
> >
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
> > (6 mins)
> > [3]
> >
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%20pass%20test,%20fail%20subsequent%20compile
> > (4 mins)
> > [4]
> >
> https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
> > [5]
> >
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%20clean,%20need%20dependencies%20built
> > [6]
> >
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%20post%20clean
> > (50 mins)
> > [7]
> >
> https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
> > (1hr)
> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
> > [9] https://issues.apache.org/jira/browse/SPARK-3867
> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
> > [12]
> >
> https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#file-gistfile1-txt-L853
> > (~90 mins)
> > [13]
> >
> https://gist.github.com/ryan-williams/718f6324af358819b496#file-gistfile1-txt-L852
> > (91 mins)
> > [14]
> >
> https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#file-gistfile1-txt-L854
> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
> > [17]
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/quot-Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
> > [18]
> >
> http://stackoverflow.com/questions/25707629/why-does-spark-job-fail-with-too-many-open-files
> > [19] https://issues.apache.org/jira/browse/SPARK-4002
> > [20] https://issues.apache.org/jira/browse/SPARK-4542
> > [21]
> >
> https://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven
> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
> > [23]
> >
> http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Spurious test failures, testing best practices

Posted by Ryan Williams <ry...@gmail.com>.

Thanks Mark, most of those commands are things I've been using and used in
my original post except for "Start zinc". I now see the section about it on
the "unpublished" building-spark
<https://github.com/apache/spark/blob/master/docs/building-spark.md#speeding-up-compilation-with-zinc>
page and will try using it.

Even so, finding those commands took a nontrivial amount of trial and
error, I've not seen them very-well-documented outside of this list (your
and Matei's emails (and previous emails to this list) each have more info
about building/testing with Maven and SBT (resp.) than building-spark
<https://github.com/apache/spark/blob/master/docs/building-spark.md#spark-tests-in-maven>
does),
the per-suite invocation is still subject to requiring assembly in some
cases ("without warning" from my perspective, having not read up on the
names of all Spark integration tests), spurious failures still abound,
there's no good way to run only the things that a given change actually
could have broken, etc.

Anyway, hopefully zinc brings me to the world of ~minute iteration times
that have been reported on this thread.


On Sun Nov 30 2014 at 6:53:57 PM Ryan Williams <
ryan.blake.williams@gmail.com> wrote:

> Thanks Nicholas, glad to hear that some of this info will be pushed to the
> main site soon, but this brings up yet another point of confusion that I've
> struggled with, namely whether the documentation on github or that on
> spark.apache.org should be considered the primary reference for people
> seeking to learn about best practices for developing Spark.
>
> Trying to read docs starting from
> https://github.com/apache/spark/blob/master/docs/index.md right now, I
> find that all of the links to other parts of the documentation are broken:
> they point to relative paths that end in ".html", which will work when
> published on the docs-site, but that would have to end in ".md" if a person
> was to be able to navigate them on github.
>
> So expecting people to use the up-to-date docs on github (where all
> internal URLs 404 and the main github README suggests that the "latest
> Spark documentation" can be found on the actually-months-old docs-site
> <https://github.com/apache/spark#online-documentation>) is not a good
> solution. On the other hand, consulting months-old docs on the site is also
> problematic, as this thread and your last email have borne out.  The result
> is that there is no good place on the internet to learn about the most
> up-to-date best practices for using/developing Spark.
>
> Why not build http://spark.apache.org/docs/latest/ nightly (or every
> commit) off of what's in github, rather than having that URL point to the
> last release's docs (up to ~3 months old)? This way, casual users who want
> the docs for the released version they happen to be using (which is already
> frequently != "/latest" today, for many Spark users) can (still) find them
> at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
> point people to a site (/latest) that actually has up-to-date docs that
> reflect ToT and whose links work.
>
> If there are concerns about existing semantics around "/latest" URLs being
> broken, some new URL could be used, like
> http://spark.apache.org/docs/snapshot/, but given that everything under
> http://spark.apache.org/docs/latest/ is in a state of
> planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
> that serious an issue to me; anyone sending around permanent links to
> things under /latest is already going to have those links break / not make
> sense in the near future.
>
>
> On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>>
>>    - currently the docs only contain information about building with
>>    maven,
>>    and even then don’t cover many important cases
>>
>>  All other points aside, I just want to point out that the docs document
>> both how to use Maven and SBT and clearly state
>> <https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt>
>> that Maven is the “build of reference” while SBT may be preferable for
>> day-to-day development.
>>
>> I believe the main reason most people miss this documentation is that,
>> though it’s up-to-date on GitHub, it has’t been published yet to the docs
>> site. It should go out with the 1.2 release.
>>
>> Improvements to the documentation on building Spark belong here:
>> https://github.com/apache/spark/blob/master/docs/building-spark.md
>>
>> If there are clear recommendations that come out of this thread but are
>> not in that doc, they should be added in there. Other, less important
>> details may possibly be better suited for the Contributing to Spark
>> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>
>> guide.
>>
>> Nick
>> 
>>
>> On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell <pw...@gmail.com>
>> wrote:
>>
>>> Hey Ryan,
>>>
>>> A few more things here. You should feel free to send patches to
>>> Jenkins to test them, since this is the reference environment in which
>>> we regularly run tests. This is the normal workflow for most
>>> developers and we spend a lot of effort provisioning/maintaining a
>>> very large jenkins cluster to allow developers access this resource. A
>>> common development approach is to locally run tests that you've added
>>> in a patch, then send it to jenkins for the full run, and then try to
>>> debug locally if you see specific unanticipated test failures.
>>>
>>> One challenge we have is that given the proliferation of OS versions,
>>> Java versions, Python versions, ulimits, etc. there is a combinatorial
>>> number of environments in which tests could be run. It is very hard in
>>> some cases to figure out post-hoc why a given test is not working in a
>>> specific environment. I think a good solution here would be to use a
>>> standardized docker container for running Spark tests and asking folks
>>> to use that locally if they are trying to run all of the hundreds of
>>> Spark tests.
>>>
>>> Another solution would be to mock out every system interaction in
>>> Spark's tests including e.g. filesystem interactions to try and reduce
>>> variance across environments. However, that seems difficult.
>>>
>>> As the number of developers of Spark increases, it's definitely a good
>>> idea for us to invest in developer infrastructure including things
>>> like snapshot releases, better documentation, etc. Thanks for bringing
>>> this up as a pain point.
>>>
>>> - Patrick
>>>
>>>
>>> On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
>>> <ry...@gmail.com> wrote:
>>> > thanks for the info, Matei and Brennon. I will try to switch my
>>> workflow to
>>> > using sbt. Other potential action items:
>>> >
>>> > - currently the docs only contain information about building with
>>> maven,
>>> > and even then don't cover many important cases, as I described in my
>>> > previous email. If SBT is as much better as you've described then that
>>> > should be made much more obvious. Wasn't it the case recently that
>>> there
>>> > was only a page about building with SBT, and not one about building
>>> with
>>> > maven? Clearer messaging around this needs to exist in the
>>> documentation,
>>> > not just on the mailing list, imho.
>>> >
>>> > - +1 to better distinguishing between unit and integration tests,
>>> having
>>> > separate scripts for each, improving documentation around common
>>> workflows,
>>> > expectations of brittleness with each kind of test, advisability of
>>> just
>>> > relying on Jenkins for certain kinds of tests to not waste too much
>>> time,
>>> > etc. Things like the compiler crash should be discussed in the
>>> > documentation, not just in the mailing list archives, if new
>>> contributors
>>> > are likely to run into them through no fault of their own.
>>> >
>>> > - What is the algorithm you use to decide what tests you might have
>>> broken?
>>> > Can we codify it in some scripts that other people can use?
>>> >
>>> >
>>> >
>>> > On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <
>>> matei.zaharia@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi Ryan,
>>> >>
>>> >> As a tip (and maybe this isn't documented well), I normally use SBT
>>> for
>>> >> development to avoid the slow build process, and use its interactive
>>> >> console to run only specific tests. The nice advantage is that SBT
>>> can keep
>>> >> the Scala compiler loaded and JITed across builds, making it faster to
>>> >> iterate. To use it, you can do the following:
>>> >>
>>> >> - Start the SBT interactive console with sbt/sbt
>>> >> - Build your assembly by running the "assembly" target in the assembly
>>> >> project: assembly/assembly
>>> >> - Run all the tests in one module: core/test
>>> >> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>>> (this
>>> >> also supports tab completion)
>>> >>
>>> >> Running all the tests does take a while, and I usually just rely on
>>> >> Jenkins for that once I've run the tests for the things I believed my
>>> patch
>>> >> could break. But this is because some of them are integration tests
>>> (e.g.
>>> >> DistributedSuite, which creates multi-process mini-clusters). Many of
>>> the
>>> >> individual suites run fast without requiring this, however, so you
>>> can pick
>>> >> the ones you want. Perhaps we should find a way to tag them so
>>> people  can
>>> >> do a "quick-test" that skips the integration ones.
>>> >>
>>> >> The assembly builds are annoying but they only take about a minute
>>> for me
>>> >> on a MacBook Pro with SBT warmed up. The assembly is actually only
>>> required
>>> >> for some of the "integration" tests (which launch new processes), but
>>> I'd
>>> >> recommend doing it all the time anyway since it would be very
>>> confusing to
>>> >> run those with an old assembly. The Scala compiler crash issue can
>>> also be
>>> >> a problem, but I don't see it very often with SBT. If it happens, I
>>> exit
>>> >> SBT and do sbt clean.
>>> >>
>>> >> Anyway, this is useful feedback and I think we should try to improve
>>> some
>>> >> of these suites, but hopefully you can also try the faster SBT
>>> process. At
>>> >> the end of the day, if we want integration tests, the whole test
>>> process
>>> >> will take an hour, but most of the developers I know leave that to
>>> Jenkins
>>> >> and only run individual tests locally before submitting a patch.
>>> >>
>>> >> Matei
>>> >>
>>> >>
>>> >> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
>>> >> ryan.blake.williams@gmail.com> wrote:
>>> >> >
>>> >> > In the course of trying to make contributions to Spark, I have had
>>> a lot
>>> >> of
>>> >> > trouble running Spark's tests successfully. The main pain points
>>> I've
>>> >> > experienced are:
>>> >> >
>>> >> >    1) frequent, spurious test failures
>>> >> >    2) high latency of running tests
>>> >> >    3) difficulty running specific tests in an iterative fashion
>>> >> >
>>> >> > Here is an example series of failures that I encountered this
>>> weekend
>>> >> > (along with footnote links to the console output from each and
>>> >> > approximately how long each took):
>>> >> >
>>> >> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not
>>> seen
>>> >> > before.
>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>>> >> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
>>> BroadcastSuite
>>> >> > passed, but scala compiler crashed on the "catalyst" project.
>>> >> > - `mvn clean`: some attempts to run earlier commands (that
>>> previously
>>> >> > didn't crash the compiler) all result in the same compiler crash.
>>> >> Previous
>>> >> > discussion on this list implies this can only be solved by a `mvn
>>> clean`
>>> >> > [4].
>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately
>>> post-clean,
>>> >> > BroadcastSuite can't run because assembly is not built.
>>> >> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages
>>> about
>>> >> > version mismatches and python 2.6. The machine this ran on has
>>> python
>>> >> 2.7,
>>> >> > so I don't know what that's about.
>>> >> > - `./dev/run-tests` again [7]: "too many open files" errors in
>>> several
>>> >> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently
>>> this is
>>> >> > not enough, but only some of the time? I increased it to 8192 and
>>> tried
>>> >> > again.
>>> >> > - `./dev/run-tests` again [8]: same pyspark errors as before. This
>>> seems
>>> >> to
>>> >> > be the issue from SPARK-3867 [9], which was supposedly fixed on
>>> October
>>> >> 14;
>>> >> > not sure how I'm seeing it now. In any case, switched to Python 2.6
>>> and
>>> >> > installed unittest2, and python/run-tests seems to be unblocked.
>>> >> > - `./dev/run-tests` again [10]: finally passes!
>>> >> >
>>> >> > This was on a spark checkout at ceb6281 (ToT Friday), with a few
>>> trivial
>>> >> > changes added on (that I wanted to test before sending out a PR),
>>> on a
>>> >> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>>> >> >
>>> >> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>>> >> commands
>>> >> > from the same repo state:
>>> >> >
>>> >> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
>>> >> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know
>>> I've seen
>>> >> > this one before on this machine and am guessing it actually occurs
>>> every
>>> >> > time.
>>> >> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran
>>> one more
>>> >> > time from ceb6281, and saw the same failure.
>>> >> >
>>> >> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>>> >> narrow
>>> >> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on
>>> my
>>> >> mac,
>>> >> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>>> >> used),
>>> >> > and it passed [16], so the failure seems specific to my linux
>>> >> machine/arch.
>>> >> >
>>> >> > At this point I believe that my changes don't break any tests (the
>>> >> > YarnClusterSuite failure on my linux presumably not being...
>>> "real"),
>>> >> and I
>>> >> > am ready to send out a PR. Whew!
>>> >> >
>>> >> > However, reflecting on the 5 or 6 distinct failure-modes represented
>>> >> above:
>>> >> >
>>> >> > - One of them (too many files open), is something I can (and did,
>>> >> > hopefully) fix once and for all. It cost me an ~hour this time
>>> >> (approximate
>>> >> > time of running ./dev/run-tests) and a few hours other times when I
>>> >> didn't
>>> >> > fully understand/fix it. It doesn't happen deterministically
>>> (why?), but
>>> >> > does happen somewhat frequently to people, having been discussed on
>>> the
>>> >> > user list multiple times [17] and on SO [18]. Maybe some note in the
>>> >> > documentation advising people to check their ulimit makes sense?
>>> >> > - One of them (unittest2 must be installed for python 2.6) was
>>> supposedly
>>> >> > fixed upstream of the commits I tested here; I don't know why I'm
>>> still
>>> >> > running into it. This cost me a few hours of running
>>> `./dev/run-tests`
>>> >> > multiple times to see if it was transient, plus some time
>>> researching and
>>> >> > working around it.
>>> >> > - The original BroadcastSuite failure cost me a few hours and went
>>> away
>>> >> > before I'd even run `mvn clean`.
>>> >> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a
>>> few
>>> >> > hours of running `./dev/run-tests` in different ways before deciding
>>> >> that,
>>> >> > as usual, there was no way around it and that I'd need to run `mvn
>>> clean`
>>> >> > and start running tests from scratch.
>>> >> > - The YarnClusterSuite failures on my linux box have cost me hours
>>> of
>>> >> > trying to figure out whether they're my fault. I've seen them many
>>> times
>>> >> > over the past weeks/months, plus or minus other failures that have
>>> come
>>> >> and
>>> >> > gone, and was especially befuddled by them when I was seeing a
>>> disjoint
>>> >> set
>>> >> > of reproducible failures on my mac [19] (the triaging of which
>>> involved
>>> >> > dozens of runs of `./dev/run-tests`).
>>> >> >
>>> >> > While I'm interested in digging into each of these issues, I also
>>> want to
>>> >> > discuss the frequency with which I've run into issues like these.
>>> This is
>>> >> > unfortunately not the first time in recent months that I've spent
>>> days
>>> >> > playing spurious-test-failure whack-a-mole with a 60-90min
>>> dev/run-tests
>>> >> > iteration time, which is no fun! So I am wondering/thinking:
>>> >> >
>>> >> > - Do other people experience this level of flakiness from spark
>>> tests?
>>> >> > - Do other people bother running dev/run-tests locally, or just let
>>> >> Jenkins
>>> >> > do it during the CR process?
>>> >> > - Needing to run a full assembly post-clean just to continue
>>> running one
>>> >> > specific test case feels especially wasteful, and the failure
>>> output when
>>> >> > naively attempting to run a specific test without having built an
>>> >> assembly
>>> >> > jar is not always clear about what the issue is or how to fix it;
>>> even
>>> >> the
>>> >> > fact that certain tests require "building the world" is not
>>> something I
>>> >> > would have expected, and has cost me hours of confusion.
>>> >> >    - Should a person running spark tests assume that they must
>>> build an
>>> >> > assembly JAR before running anything?
>>> >> >    - Are there some proper "unit" tests that are actually
>>> self-contained
>>> >> /
>>> >> > able to be run without building an assembly jar?
>>> >> >    - Can we better document/demarcate which tests have which
>>> >> dependencies?
>>> >> >    - Is there something finer-grained than building an assembly JAR
>>> that
>>> >> > is sufficient in some cases?
>>> >> >        - If so, can we document that?
>>> >> >        - If not, can we move to a world of finer-grained
>>> dependencies for
>>> >> > some of these?
>>> >> > - Leaving all of these spurious failures aside, the process of
>>> assembling
>>> >> > and testing a new JAR is not a quick one (40 and 60 mins for me
>>> >> typically,
>>> >> > respectively). I would guess that there are dozens (hundreds?) of
>>> people
>>> >> > who build a Spark assembly from various ToTs on any given day, and
>>> who
>>> >> all
>>> >> > wait on the exact same compilation / assembly steps to occur.
>>> Expanding
>>> >> on
>>> >> > the recent work to publish nightly snapshots [20], can we do a
>>> better job
>>> >> > caching/sharing compilation artifacts at a more granular level
>>> (pre-built
>>> >> > assembly JARs at each SHA? pre-built JARs per-maven-module,
>>> per-SHA? more
>>> >> > granular maven modules, plus the previous two?), or otherwise save
>>> some
>>> >> of
>>> >> > the considerable amount of redundant compilation work that I had to
>>> do
>>> >> over
>>> >> > the course of my odyssey this weekend?
>>> >> >
>>> >> > Ramping up on most projects involves some amount of supplementing
>>> the
>>> >> > documentation with trial and error to figure out what to run, which
>>> >> > "errors" are real errors and which can be ignored, etc., but
>>> navigating
>>> >> > that minefield on Spark has proved especially challenging and
>>> >> > time-consuming for me. Some of that comes directly from scala's
>>> >> relatively
>>> >> > slow compilation times and immature build-tooling ecosystem, but
>>> that is
>>> >> > the world we live in and it would be nice if Spark took the
>>> alleviation
>>> >> of
>>> >> > the resulting pain more seriously, as one of the more interesting
>>> and
>>> >> > well-known large scala projects around right now. The official
>>> >> > documentation around how to build different subsets of the codebase
>>> is
>>> >> > somewhat sparse [21], and there have been many mixed [22] accounts
>>> [23]
>>> >> on
>>> >> > this mailing list about preferred ways to build on mvn vs. sbt
>>> (none of
>>> >> > which has made it into official documentation, as far as I've seen).
>>> >> > Expecting new contributors to piece together all of this received
>>> >> > folk-wisdom about how to build/test in a sane way by trawling
>>> mailing
>>> >> list
>>> >> > archives seems suboptimal.
>>> >> >
>>> >> > Thanks for reading, looking forward to hearing your ideas!
>>> >> >
>>> >> > -Ryan
>>> >> >
>>> >> > P.S. Is "best practice" for emailing this list to not incorporate
>>> any
>>> >> HTML
>>> >> > in the body? It seems like all of the archives I've seen strip it
>>> out,
>>> >> but
>>> >> > other people have used it and gmail displays it.
>>> >> >
>>> >> >
>>> >> > [1]
>>> >> > https://gist.githubusercontent.com/ryan-williams/8a162367c4d
>>> c157d2479/
>>> >> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>>> >> > (57 mins)
>>> >> > [2]
>>> >> > https://gist.githubusercontent.com/ryan-williams/8a162367c4d
>>> c157d2479/
>>> >> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>>> >> > (6 mins)
>>> >> > [3]
>>> >> > https://gist.githubusercontent.com/ryan-williams/8a162367c4d
>>> c157d2479/
>>> >> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
>>> >> 20pass%20test,%20fail%20subsequent%20compile
>>> >> > (4 mins)
>>> >> > [4]
>>> >> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
>>> >> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
>>> >> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
>>> >> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
>>> >> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
>>> >> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>>> >> > [5]
>>> >> > https://gist.githubusercontent.com/ryan-williams/8a162367c4d
>>> c157d2479/
>>> >> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
>>> >> 20clean,%20need%20dependencies%20built
>>> >> > [6]
>>> >> > https://gist.githubusercontent.com/ryan-williams/8a162367c4d
>>> c157d2479/
>>> >> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
>>> >> 20post%20clean
>>> >> > (50 mins)
>>> >> > [7]
>>> >> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
>>> >> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
>>> >> > (1hr)
>>> >> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f
>>> (1hr)
>>> >> > [9] https://issues.apache.org/jira/browse/SPARK-3867
>>> >> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>>> >> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>>> >> > [12]
>>> >> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
>>> >> file-gistfile1-txt-L853
>>> >> > (~90 mins)
>>> >> > [13]
>>> >> > https://gist.github.com/ryan-williams/718f6324af358819b496#
>>> >> file-gistfile1-txt-L852
>>> >> > (91 mins)
>>> >> > [14]
>>> >> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
>>> >> file-gistfile1-txt-L854
>>> >> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>>> >> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>>> >> > [17]
>>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
>>> >> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
>>> >> > [18]
>>> >> > http://stackoverflow.com/questions/25707629/why-does-
>>> >> spark-job-fail-with-too-many-open-files
>>> >> > [19] https://issues.apache.org/jira/browse/SPARK-4002
>>> >> > [20] https://issues.apache.org/jira/browse/SPARK-4542
>>> >> > [21]
>>> >> > https://spark.apache.org/docs/latest/building-with-maven.
>>> >> html#spark-tests-in-maven
>>> >> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.h
>>> tml
>>> >> > [23]
>>> >> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
>>> >> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com
>>> %3E
>>> >>
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>

Re: Spurious test failures, testing best practices

Posted by Ryan Williams <ry...@gmail.com>.

Thanks Patrick, great to hear that docs-snapshots-via-jenkins is already
JIRA'd; you can interpret some of this thread as a gigantic +1 from me on
prioritizing that, which it looks like you are doing :)

I do understand the limitations of the "github vs. official site" status
quo; I was mostly responding to a perceived implication that I should have
been getting building/testing-spark advice from the github .md files
instead of from /latest. I agree that neither one works very well
currently, and that docs-snapshots-via-jenkins is the right solution. Per
my other email, leaving /latest as-is sounds reasonable, as long as jenkins
is putting the latest docs *somewhere*.

On Sun Nov 30 2014 at 7:19:33 PM Patrick Wendell <pw...@gmail.com> wrote:

> Btw - the documnetation on github represents the source code of our
> docs, which is versioned with each release. Unfortunately github will
> always try to render ".md" files so it could look to a passerby like
> this is supposed to represent published docs. This is a feature
> limitation of github, AFAIK we cannot disable it.
>
> The official published docs are associated with each release and
> available on the apache.org website. I think "/latest" is a common
> convention for referring to the latest *published release* docs, so
> probably we can't change that (the audience for /latest is orders of
> magnitude larger than for snapshot docs). However we could just add
> /snapshot and publish docs there.
>
> - Patrick
>
> On Sun, Nov 30, 2014 at 6:15 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
> > Hey Ryan,
> >
> > The existing JIRA also covers publishing nightly docs:
> > https://issues.apache.org/jira/browse/SPARK-1517
> >
> > - Patrick
> >
> > On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
> > <ry...@gmail.com> wrote:
> >> Thanks Nicholas, glad to hear that some of this info will be pushed to
> the
> >> main site soon, but this brings up yet another point of confusion that
> I've
> >> struggled with, namely whether the documentation on github or that on
> >> spark.apache.org should be considered the primary reference for people
> >> seeking to learn about best practices for developing Spark.
> >>
> >> Trying to read docs starting from
> >> https://github.com/apache/spark/blob/master/docs/index.md right now, I
> find
> >> that all of the links to other parts of the documentation are broken:
> they
> >> point to relative paths that end in ".html", which will work when
> published
> >> on the docs-site, but that would have to end in ".md" if a person was
> to be
> >> able to navigate them on github.
> >>
> >> So expecting people to use the up-to-date docs on github (where all
> >> internal URLs 404 and the main github README suggests that the "latest
> >> Spark documentation" can be found on the actually-months-old docs-site
> >> <https://github.com/apache/spark#online-documentation>) is not a good
> >> solution. On the other hand, consulting months-old docs on the site is
> also
> >> problematic, as this thread and your last email have borne out.  The
> result
> >> is that there is no good place on the internet to learn about the most
> >> up-to-date best practices for using/developing Spark.
> >>
> >> Why not build http://spark.apache.org/docs/latest/ nightly (or every
> >> commit) off of what's in github, rather than having that URL point to
> the
> >> last release's docs (up to ~3 months old)? This way, casual users who
> want
> >> the docs for the released version they happen to be using (which is
> already
> >> frequently != "/latest" today, for many Spark users) can (still) find
> them
> >> at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
> >> point people to a site (/latest) that actually has up-to-date docs that
> >> reflect ToT and whose links work.
> >>
> >> If there are concerns about existing semantics around "/latest" URLs
> being
> >> broken, some new URL could be used, like
> >> http://spark.apache.org/docs/snapshot/, but given that everything under
> >> http://spark.apache.org/docs/latest/ is in a state of
> >> planned-backwards-incompatible-changes every ~3mos, that doesn't sound
> like
> >> that serious an issue to me; anyone sending around permanent links to
> >> things under /latest is already going to have those links break / not
> make
> >> sense in the near future.
> >>
> >>
> >> On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas <
> >> nicholas.chammas@gmail.com> wrote:
> >>
> >>>
> >>>    - currently the docs only contain information about building with
> >>>    maven,
> >>>    and even then don't cover many important cases
> >>>
> >>>  All other points aside, I just want to point out that the docs
> document
> >>> both how to use Maven and SBT and clearly state
> >>> <https://github.com/apache/spark/blob/master/docs/
> building-spark.md#building-with-sbt>
> >>> that Maven is the "build of reference" while SBT may be preferable for
> >>> day-to-day development.
> >>>
> >>> I believe the main reason most people miss this documentation is that,
> >>> though it's up-to-date on GitHub, it has't been published yet to the
> docs
> >>> site. It should go out with the 1.2 release.
> >>>
> >>> Improvements to the documentation on building Spark belong here:
> >>> https://github.com/apache/spark/blob/master/docs/building-spark.md
> >>>
> >>> If there are clear recommendations that come out of this thread but are
> >>> not in that doc, they should be added in there. Other, less important
> >>> details may possibly be better suited for the Contributing to Spark
> >>> <https://cwiki.apache.org/confluence/display/SPARK/
> Contributing+to+Spark>
> >>> guide.
> >>>
> >>> Nick
> >>>
> >>>
> >>> On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell <pw...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hey Ryan,
> >>>>
> >>>> A few more things here. You should feel free to send patches to
> >>>> Jenkins to test them, since this is the reference environment in which
> >>>> we regularly run tests. This is the normal workflow for most
> >>>> developers and we spend a lot of effort provisioning/maintaining a
> >>>> very large jenkins cluster to allow developers access this resource. A
> >>>> common development approach is to locally run tests that you've added
> >>>> in a patch, then send it to jenkins for the full run, and then try to
> >>>> debug locally if you see specific unanticipated test failures.
> >>>>
> >>>> One challenge we have is that given the proliferation of OS versions,
> >>>> Java versions, Python versions, ulimits, etc. there is a combinatorial
> >>>> number of environments in which tests could be run. It is very hard in
> >>>> some cases to figure out post-hoc why a given test is not working in a
> >>>> specific environment. I think a good solution here would be to use a
> >>>> standardized docker container for running Spark tests and asking folks
> >>>> to use that locally if they are trying to run all of the hundreds of
> >>>> Spark tests.
> >>>>
> >>>> Another solution would be to mock out every system interaction in
> >>>> Spark's tests including e.g. filesystem interactions to try and reduce
> >>>> variance across environments. However, that seems difficult.
> >>>>
> >>>> As the number of developers of Spark increases, it's definitely a good
> >>>> idea for us to invest in developer infrastructure including things
> >>>> like snapshot releases, better documentation, etc. Thanks for bringing
> >>>> this up as a pain point.
> >>>>
> >>>> - Patrick
> >>>>
> >>>>
> >>>> On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
> >>>> <ry...@gmail.com> wrote:
> >>>> > thanks for the info, Matei and Brennon. I will try to switch my
> >>>> workflow to
> >>>> > using sbt. Other potential action items:
> >>>> >
> >>>> > - currently the docs only contain information about building with
> maven,
> >>>> > and even then don't cover many important cases, as I described in my
> >>>> > previous email. If SBT is as much better as you've described then
> that
> >>>> > should be made much more obvious. Wasn't it the case recently that
> there
> >>>> > was only a page about building with SBT, and not one about building
> with
> >>>> > maven? Clearer messaging around this needs to exist in the
> >>>> documentation,
> >>>> > not just on the mailing list, imho.
> >>>> >
> >>>> > - +1 to better distinguishing between unit and integration tests,
> having
> >>>> > separate scripts for each, improving documentation around common
> >>>> workflows,
> >>>> > expectations of brittleness with each kind of test, advisability of
> just
> >>>> > relying on Jenkins for certain kinds of tests to not waste too much
> >>>> time,
> >>>> > etc. Things like the compiler crash should be discussed in the
> >>>> > documentation, not just in the mailing list archives, if new
> >>>> contributors
> >>>> > are likely to run into them through no fault of their own.
> >>>> >
> >>>> > - What is the algorithm you use to decide what tests you might have
> >>>> broken?
> >>>> > Can we codify it in some scripts that other people can use?
> >>>> >
> >>>> >
> >>>> >
> >>>> > On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <
> matei.zaharia@gmail.com
> >>>> >
> >>>> > wrote:
> >>>> >
> >>>> >> Hi Ryan,
> >>>> >>
> >>>> >> As a tip (and maybe this isn't documented well), I normally use
> SBT for
> >>>> >> development to avoid the slow build process, and use its
> interactive
> >>>> >> console to run only specific tests. The nice advantage is that SBT
> can
> >>>> keep
> >>>> >> the Scala compiler loaded and JITed across builds, making it
> faster to
> >>>> >> iterate. To use it, you can do the following:
> >>>> >>
> >>>> >> - Start the SBT interactive console with sbt/sbt
> >>>> >> - Build your assembly by running the "assembly" target in the
> assembly
> >>>> >> project: assembly/assembly
> >>>> >> - Run all the tests in one module: core/test
> >>>> >> - Run a specific suite: core/test-only
> org.apache.spark.rdd.RDDSuite
> >>>> (this
> >>>> >> also supports tab completion)
> >>>> >>
> >>>> >> Running all the tests does take a while, and I usually just rely on
> >>>> >> Jenkins for that once I've run the tests for the things I believed
> my
> >>>> patch
> >>>> >> could break. But this is because some of them are integration tests
> >>>> (e.g.
> >>>> >> DistributedSuite, which creates multi-process mini-clusters). Many
> of
> >>>> the
> >>>> >> individual suites run fast without requiring this, however, so you
> can
> >>>> pick
> >>>> >> the ones you want. Perhaps we should find a way to tag them so
> people
> >>>> can
> >>>> >> do a "quick-test" that skips the integration ones.
> >>>> >>
> >>>> >> The assembly builds are annoying but they only take about a minute
> for
> >>>> me
> >>>> >> on a MacBook Pro with SBT warmed up. The assembly is actually only
> >>>> required
> >>>> >> for some of the "integration" tests (which launch new processes),
> but
> >>>> I'd
> >>>> >> recommend doing it all the time anyway since it would be very
> >>>> confusing to
> >>>> >> run those with an old assembly. The Scala compiler crash issue can
> >>>> also be
> >>>> >> a problem, but I don't see it very often with SBT. If it happens, I
> >>>> exit
> >>>> >> SBT and do sbt clean.
> >>>> >>
> >>>> >> Anyway, this is useful feedback and I think we should try to
> improve
> >>>> some
> >>>> >> of these suites, but hopefully you can also try the faster SBT
> >>>> process. At
> >>>> >> the end of the day, if we want integration tests, the whole test
> >>>> process
> >>>> >> will take an hour, but most of the developers I know leave that to
> >>>> Jenkins
> >>>> >> and only run individual tests locally before submitting a patch.
> >>>> >>
> >>>> >> Matei
> >>>> >>
> >>>> >>
> >>>> >> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
> >>>> >> ryan.blake.williams@gmail.com> wrote:
> >>>> >> >
> >>>> >> > In the course of trying to make contributions to Spark, I have
> had a
> >>>> lot
> >>>> >> of
> >>>> >> > trouble running Spark's tests successfully. The main pain points
> I've
> >>>> >> > experienced are:
> >>>> >> >
> >>>> >> >    1) frequent, spurious test failures
> >>>> >> >    2) high latency of running tests
> >>>> >> >    3) difficulty running specific tests in an iterative fashion
> >>>> >> >
> >>>> >> > Here is an example series of failures that I encountered this
> weekend
> >>>> >> > (along with footnote links to the console output from each and
> >>>> >> > approximately how long each took):
> >>>> >> >
> >>>> >> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not
> seen
> >>>> >> > before.
> >>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
> >>>> >> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
> >>>> BroadcastSuite
> >>>> >> > passed, but scala compiler crashed on the "catalyst" project.
> >>>> >> > - `mvn clean`: some attempts to run earlier commands (that
> previously
> >>>> >> > didn't crash the compiler) all result in the same compiler crash.
> >>>> >> Previous
> >>>> >> > discussion on this list implies this can only be solved by a `mvn
> >>>> clean`
> >>>> >> > [4].
> >>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately
> post-clean,
> >>>> >> > BroadcastSuite can't run because assembly is not built.
> >>>> >> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages
> >>>> about
> >>>> >> > version mismatches and python 2.6. The machine this ran on has
> python
> >>>> >> 2.7,
> >>>> >> > so I don't know what that's about.
> >>>> >> > - `./dev/run-tests` again [7]: "too many open files" errors in
> >>>> several
> >>>> >> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently
> >>>> this is
> >>>> >> > not enough, but only some of the time? I increased it to 8192 and
> >>>> tried
> >>>> >> > again.
> >>>> >> > - `./dev/run-tests` again [8]: same pyspark errors as before.
> This
> >>>> seems
> >>>> >> to
> >>>> >> > be the issue from SPARK-3867 [9], which was supposedly fixed on
> >>>> October
> >>>> >> 14;
> >>>> >> > not sure how I'm seeing it now. In any case, switched to Python
> 2.6
> >>>> and
> >>>> >> > installed unittest2, and python/run-tests seems to be unblocked.
> >>>> >> > - `./dev/run-tests` again [10]: finally passes!
> >>>> >> >
> >>>> >> > This was on a spark checkout at ceb6281 (ToT Friday), with a few
> >>>> trivial
> >>>> >> > changes added on (that I wanted to test before sending out a
> PR), on
> >>>> a
> >>>> >> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3
> [11].
> >>>> >> >
> >>>> >> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried
> similar
> >>>> >> commands
> >>>> >> > from the same repo state:
> >>>> >> >
> >>>> >> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
> >>>> >> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know
> I've
> >>>> seen
> >>>> >> > this one before on this machine and am guessing it actually
> occurs
> >>>> every
> >>>> >> > time.
> >>>> >> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran
> one
> >>>> more
> >>>> >> > time from ceb6281, and saw the same failure.
> >>>> >> >
> >>>> >> > This was with java 1.7 and maven 3.2.3 [15]. In one final
> attempt to
> >>>> >> narrow
> >>>> >> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests`
> on
> >>>> my
> >>>> >> mac,
> >>>> >> > from ceb6281, with java 1.7 (instead of 1.8, which the previous
> runs
> >>>> >> used),
> >>>> >> > and it passed [16], so the failure seems specific to my linux
> >>>> >> machine/arch.
> >>>> >> >
> >>>> >> > At this point I believe that my changes don't break any tests
> (the
> >>>> >> > YarnClusterSuite failure on my linux presumably not being...
> "real"),
> >>>> >> and I
> >>>> >> > am ready to send out a PR. Whew!
> >>>> >> >
> >>>> >> > However, reflecting on the 5 or 6 distinct failure-modes
> represented
> >>>> >> above:
> >>>> >> >
> >>>> >> > - One of them (too many files open), is something I can (and did,
> >>>> >> > hopefully) fix once and for all. It cost me an ~hour this time
> >>>> >> (approximate
> >>>> >> > time of running ./dev/run-tests) and a few hours other times
> when I
> >>>> >> didn't
> >>>> >> > fully understand/fix it. It doesn't happen deterministically
> (why?),
> >>>> but
> >>>> >> > does happen somewhat frequently to people, having been discussed
> on
> >>>> the
> >>>> >> > user list multiple times [17] and on SO [18]. Maybe some note in
> the
> >>>> >> > documentation advising people to check their ulimit makes sense?
> >>>> >> > - One of them (unittest2 must be installed for python 2.6) was
> >>>> supposedly
> >>>> >> > fixed upstream of the commits I tested here; I don't know why I'm
> >>>> still
> >>>> >> > running into it. This cost me a few hours of running
> >>>> `./dev/run-tests`
> >>>> >> > multiple times to see if it was transient, plus some time
> >>>> researching and
> >>>> >> > working around it.
> >>>> >> > - The original BroadcastSuite failure cost me a few hours and
> went
> >>>> away
> >>>> >> > before I'd even run `mvn clean`.
> >>>> >> > - A new incarnation of the sbt-compiler-crash phenomenon cost me
> a
> >>>> few
> >>>> >> > hours of running `./dev/run-tests` in different ways before
> deciding
> >>>> >> that,
> >>>> >> > as usual, there was no way around it and that I'd need to run
> `mvn
> >>>> clean`
> >>>> >> > and start running tests from scratch.
> >>>> >> > - The YarnClusterSuite failures on my linux box have cost me
> hours of
> >>>> >> > trying to figure out whether they're my fault. I've seen them
> many
> >>>> times
> >>>> >> > over the past weeks/months, plus or minus other failures that
> have
> >>>> come
> >>>> >> and
> >>>> >> > gone, and was especially befuddled by them when I was seeing a
> >>>> disjoint
> >>>> >> set
> >>>> >> > of reproducible failures on my mac [19] (the triaging of which
> >>>> involved
> >>>> >> > dozens of runs of `./dev/run-tests`).
> >>>> >> >
> >>>> >> > While I'm interested in digging into each of these issues, I also
> >>>> want to
> >>>> >> > discuss the frequency with which I've run into issues like these.
> >>>> This is
> >>>> >> > unfortunately not the first time in recent months that I've spent
> >>>> days
> >>>> >> > playing spurious-test-failure whack-a-mole with a 60-90min
> >>>> dev/run-tests
> >>>> >> > iteration time, which is no fun! So I am wondering/thinking:
> >>>> >> >
> >>>> >> > - Do other people experience this level of flakiness from spark
> >>>> tests?
> >>>> >> > - Do other people bother running dev/run-tests locally, or just
> let
> >>>> >> Jenkins
> >>>> >> > do it during the CR process?
> >>>> >> > - Needing to run a full assembly post-clean just to continue
> running
> >>>> one
> >>>> >> > specific test case feels especially wasteful, and the failure
> output
> >>>> when
> >>>> >> > naively attempting to run a specific test without having built an
> >>>> >> assembly
> >>>> >> > jar is not always clear about what the issue is or how to fix it;
> >>>> even
> >>>> >> the
> >>>> >> > fact that certain tests require "building the world" is not
> >>>> something I
> >>>> >> > would have expected, and has cost me hours of confusion.
> >>>> >> >    - Should a person running spark tests assume that they must
> build
> >>>> an
> >>>> >> > assembly JAR before running anything?
> >>>> >> >    - Are there some proper "unit" tests that are actually
> >>>> self-contained
> >>>> >> /
> >>>> >> > able to be run without building an assembly jar?
> >>>> >> >    - Can we better document/demarcate which tests have which
> >>>> >> dependencies?
> >>>> >> >    - Is there something finer-grained than building an assembly
> JAR
> >>>> that
> >>>> >> > is sufficient in some cases?
> >>>> >> >        - If so, can we document that?
> >>>> >> >        - If not, can we move to a world of finer-grained
> >>>> dependencies for
> >>>> >> > some of these?
> >>>> >> > - Leaving all of these spurious failures aside, the process of
> >>>> assembling
> >>>> >> > and testing a new JAR is not a quick one (40 and 60 mins for me
> >>>> >> typically,
> >>>> >> > respectively). I would guess that there are dozens (hundreds?) of
> >>>> people
> >>>> >> > who build a Spark assembly from various ToTs on any given day,
> and
> >>>> who
> >>>> >> all
> >>>> >> > wait on the exact same compilation / assembly steps to occur.
> >>>> Expanding
> >>>> >> on
> >>>> >> > the recent work to publish nightly snapshots [20], can we do a
> >>>> better job
> >>>> >> > caching/sharing compilation artifacts at a more granular level
> >>>> (pre-built
> >>>> >> > assembly JARs at each SHA? pre-built JARs per-maven-module,
> per-SHA?
> >>>> more
> >>>> >> > granular maven modules, plus the previous two?), or otherwise
> save
> >>>> some
> >>>> >> of
> >>>> >> > the considerable amount of redundant compilation work that I had
> to
> >>>> do
> >>>> >> over
> >>>> >> > the course of my odyssey this weekend?
> >>>> >> >
> >>>> >> > Ramping up on most projects involves some amount of
> supplementing the
> >>>> >> > documentation with trial and error to figure out what to run,
> which
> >>>> >> > "errors" are real errors and which can be ignored, etc., but
> >>>> navigating
> >>>> >> > that minefield on Spark has proved especially challenging and
> >>>> >> > time-consuming for me. Some of that comes directly from scala's
> >>>> >> relatively
> >>>> >> > slow compilation times and immature build-tooling ecosystem, but
> >>>> that is
> >>>> >> > the world we live in and it would be nice if Spark took the
> >>>> alleviation
> >>>> >> of
> >>>> >> > the resulting pain more seriously, as one of the more
> interesting and
> >>>> >> > well-known large scala projects around right now. The official
> >>>> >> > documentation around how to build different subsets of the
> codebase
> >>>> is
> >>>> >> > somewhat sparse [21], and there have been many mixed [22]
> accounts
> >>>> [23]
> >>>> >> on
> >>>> >> > this mailing list about preferred ways to build on mvn vs. sbt
> (none
> >>>> of
> >>>> >> > which has made it into official documentation, as far as I've
> seen).
> >>>> >> > Expecting new contributors to piece together all of this received
> >>>> >> > folk-wisdom about how to build/test in a sane way by trawling
> mailing
> >>>> >> list
> >>>> >> > archives seems suboptimal.
> >>>> >> >
> >>>> >> > Thanks for reading, looking forward to hearing your ideas!
> >>>> >> >
> >>>> >> > -Ryan
> >>>> >> >
> >>>> >> > P.S. Is "best practice" for emailing this list to not
> incorporate any
> >>>> >> HTML
> >>>> >> > in the body? It seems like all of the archives I've seen strip it
> >>>> out,
> >>>> >> but
> >>>> >> > other people have used it and gmail displays it.
> >>>> >> >
> >>>> >> >
> >>>> >> > [1]
> >>>> >> > https://gist.githubusercontent.com/ryan-williams/
> >>>> 8a162367c4dc157d2479/
> >>>> >> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%
> 20fail
> >>>> >> > (57 mins)
> >>>> >> > [2]
> >>>> >> > https://gist.githubusercontent.com/ryan-williams/
> >>>> 8a162367c4dc157d2479/
> >>>> >> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
> >>>> >> > (6 mins)
> >>>> >> > [3]
> >>>> >> > https://gist.githubusercontent.com/ryan-williams/
> >>>> 8a162367c4dc157d2479/
> >>>> >> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
> >>>> >> 20pass%20test,%20fail%20subsequent%20compile
> >>>> >> > (4 mins)
> >>>> >> > [4]
> >>>> >> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
> >>>> >> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
> >>>> >> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
> >>>> >> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
> >>>> >> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
> >>>> >> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
> >>>> >> > [5]
> >>>> >> > https://gist.githubusercontent.com/ryan-williams/
> >>>> 8a162367c4dc157d2479/
> >>>> >> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
> >>>> >> 20clean,%20need%20dependencies%20built
> >>>> >> > [6]
> >>>> >> > https://gist.githubusercontent.com/ryan-williams/
> >>>> 8a162367c4dc157d2479/
> >>>> >> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
> >>>> >> 20post%20clean
> >>>> >> > (50 mins)
> >>>> >> > [7]
> >>>> >> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
> >>>> >> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
> >>>> >> > (1hr)
> >>>> >> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f
> (1hr)
> >>>> >> > [9] https://issues.apache.org/jira/browse/SPARK-3867
> >>>> >> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
> >>>> >> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
> >>>> >> > [12]
> >>>> >> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
> >>>> >> file-gistfile1-txt-L853
> >>>> >> > (~90 mins)
> >>>> >> > [13]
> >>>> >> > https://gist.github.com/ryan-williams/718f6324af358819b496#
> >>>> >> file-gistfile1-txt-L852
> >>>> >> > (91 mins)
> >>>> >> > [14]
> >>>> >> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
> >>>> >> file-gistfile1-txt-L854
> >>>> >> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
> >>>> >> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
> >>>> >> > [17]
> >>>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
> >>>> >> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
> >>>> >> > [18]
> >>>> >> > http://stackoverflow.com/questions/25707629/why-does-
> >>>> >> spark-job-fail-with-too-many-open-files
> >>>> >> > [19] https://issues.apache.org/jira/browse/SPARK-4002
> >>>> >> > [20] https://issues.apache.org/jira/browse/SPARK-4542
> >>>> >> > [21]
> >>>> >> > https://spark.apache.org/docs/latest/building-with-maven.
> >>>> >> html#spark-tests-in-maven
> >>>> >> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.
> html
> >>>> >> > [23]
> >>>> >> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
> >>>> >> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.
> gmail.com
> >>>> %3E
> >>>> >>
> >>>> >>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>
> >>>>
>

Re: Spurious test failures, testing best practices

Posted by Patrick Wendell <pw...@gmail.com>.

Btw - the documnetation on github represents the source code of our
docs, which is versioned with each release. Unfortunately github will
always try to render ".md" files so it could look to a passerby like
this is supposed to represent published docs. This is a feature
limitation of github, AFAIK we cannot disable it.

The official published docs are associated with each release and
available on the apache.org website. I think "/latest" is a common
convention for referring to the latest *published release* docs, so
probably we can't change that (the audience for /latest is orders of
magnitude larger than for snapshot docs). However we could just add
/snapshot and publish docs there.

- Patrick

On Sun, Nov 30, 2014 at 6:15 PM, Patrick Wendell <pw...@gmail.com> wrote:
> Hey Ryan,
>
> The existing JIRA also covers publishing nightly docs:
> https://issues.apache.org/jira/browse/SPARK-1517
>
> - Patrick
>
> On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
> <ry...@gmail.com> wrote:
>> Thanks Nicholas, glad to hear that some of this info will be pushed to the
>> main site soon, but this brings up yet another point of confusion that I've
>> struggled with, namely whether the documentation on github or that on
>> spark.apache.org should be considered the primary reference for people
>> seeking to learn about best practices for developing Spark.
>>
>> Trying to read docs starting from
>> https://github.com/apache/spark/blob/master/docs/index.md right now, I find
>> that all of the links to other parts of the documentation are broken: they
>> point to relative paths that end in ".html", which will work when published
>> on the docs-site, but that would have to end in ".md" if a person was to be
>> able to navigate them on github.
>>
>> So expecting people to use the up-to-date docs on github (where all
>> internal URLs 404 and the main github README suggests that the "latest
>> Spark documentation" can be found on the actually-months-old docs-site
>> <https://github.com/apache/spark#online-documentation>) is not a good
>> solution. On the other hand, consulting months-old docs on the site is also
>> problematic, as this thread and your last email have borne out.  The result
>> is that there is no good place on the internet to learn about the most
>> up-to-date best practices for using/developing Spark.
>>
>> Why not build http://spark.apache.org/docs/latest/ nightly (or every
>> commit) off of what's in github, rather than having that URL point to the
>> last release's docs (up to ~3 months old)? This way, casual users who want
>> the docs for the released version they happen to be using (which is already
>> frequently != "/latest" today, for many Spark users) can (still) find them
>> at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
>> point people to a site (/latest) that actually has up-to-date docs that
>> reflect ToT and whose links work.
>>
>> If there are concerns about existing semantics around "/latest" URLs being
>> broken, some new URL could be used, like
>> http://spark.apache.org/docs/snapshot/, but given that everything under
>> http://spark.apache.org/docs/latest/ is in a state of
>> planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
>> that serious an issue to me; anyone sending around permanent links to
>> things under /latest is already going to have those links break / not make
>> sense in the near future.
>>
>>
>> On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>>
>>>    - currently the docs only contain information about building with
>>>    maven,
>>>    and even then don't cover many important cases
>>>
>>>  All other points aside, I just want to point out that the docs document
>>> both how to use Maven and SBT and clearly state
>>> <https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt>
>>> that Maven is the "build of reference" while SBT may be preferable for
>>> day-to-day development.
>>>
>>> I believe the main reason most people miss this documentation is that,
>>> though it's up-to-date on GitHub, it has't been published yet to the docs
>>> site. It should go out with the 1.2 release.
>>>
>>> Improvements to the documentation on building Spark belong here:
>>> https://github.com/apache/spark/blob/master/docs/building-spark.md
>>>
>>> If there are clear recommendations that come out of this thread but are
>>> not in that doc, they should be added in there. Other, less important
>>> details may possibly be better suited for the Contributing to Spark
>>> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>
>>> guide.
>>>
>>> Nick
>>>
>>>
>>> On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell <pw...@gmail.com>
>>> wrote:
>>>
>>>> Hey Ryan,
>>>>
>>>> A few more things here. You should feel free to send patches to
>>>> Jenkins to test them, since this is the reference environment in which
>>>> we regularly run tests. This is the normal workflow for most
>>>> developers and we spend a lot of effort provisioning/maintaining a
>>>> very large jenkins cluster to allow developers access this resource. A
>>>> common development approach is to locally run tests that you've added
>>>> in a patch, then send it to jenkins for the full run, and then try to
>>>> debug locally if you see specific unanticipated test failures.
>>>>
>>>> One challenge we have is that given the proliferation of OS versions,
>>>> Java versions, Python versions, ulimits, etc. there is a combinatorial
>>>> number of environments in which tests could be run. It is very hard in
>>>> some cases to figure out post-hoc why a given test is not working in a
>>>> specific environment. I think a good solution here would be to use a
>>>> standardized docker container for running Spark tests and asking folks
>>>> to use that locally if they are trying to run all of the hundreds of
>>>> Spark tests.
>>>>
>>>> Another solution would be to mock out every system interaction in
>>>> Spark's tests including e.g. filesystem interactions to try and reduce
>>>> variance across environments. However, that seems difficult.
>>>>
>>>> As the number of developers of Spark increases, it's definitely a good
>>>> idea for us to invest in developer infrastructure including things
>>>> like snapshot releases, better documentation, etc. Thanks for bringing
>>>> this up as a pain point.
>>>>
>>>> - Patrick
>>>>
>>>>
>>>> On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
>>>> <ry...@gmail.com> wrote:
>>>> > thanks for the info, Matei and Brennon. I will try to switch my
>>>> workflow to
>>>> > using sbt. Other potential action items:
>>>> >
>>>> > - currently the docs only contain information about building with maven,
>>>> > and even then don't cover many important cases, as I described in my
>>>> > previous email. If SBT is as much better as you've described then that
>>>> > should be made much more obvious. Wasn't it the case recently that there
>>>> > was only a page about building with SBT, and not one about building with
>>>> > maven? Clearer messaging around this needs to exist in the
>>>> documentation,
>>>> > not just on the mailing list, imho.
>>>> >
>>>> > - +1 to better distinguishing between unit and integration tests, having
>>>> > separate scripts for each, improving documentation around common
>>>> workflows,
>>>> > expectations of brittleness with each kind of test, advisability of just
>>>> > relying on Jenkins for certain kinds of tests to not waste too much
>>>> time,
>>>> > etc. Things like the compiler crash should be discussed in the
>>>> > documentation, not just in the mailing list archives, if new
>>>> contributors
>>>> > are likely to run into them through no fault of their own.
>>>> >
>>>> > - What is the algorithm you use to decide what tests you might have
>>>> broken?
>>>> > Can we codify it in some scripts that other people can use?
>>>> >
>>>> >
>>>> >
>>>> > On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <matei.zaharia@gmail.com
>>>> >
>>>> > wrote:
>>>> >
>>>> >> Hi Ryan,
>>>> >>
>>>> >> As a tip (and maybe this isn't documented well), I normally use SBT for
>>>> >> development to avoid the slow build process, and use its interactive
>>>> >> console to run only specific tests. The nice advantage is that SBT can
>>>> keep
>>>> >> the Scala compiler loaded and JITed across builds, making it faster to
>>>> >> iterate. To use it, you can do the following:
>>>> >>
>>>> >> - Start the SBT interactive console with sbt/sbt
>>>> >> - Build your assembly by running the "assembly" target in the assembly
>>>> >> project: assembly/assembly
>>>> >> - Run all the tests in one module: core/test
>>>> >> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>>>> (this
>>>> >> also supports tab completion)
>>>> >>
>>>> >> Running all the tests does take a while, and I usually just rely on
>>>> >> Jenkins for that once I've run the tests for the things I believed my
>>>> patch
>>>> >> could break. But this is because some of them are integration tests
>>>> (e.g.
>>>> >> DistributedSuite, which creates multi-process mini-clusters). Many of
>>>> the
>>>> >> individual suites run fast without requiring this, however, so you can
>>>> pick
>>>> >> the ones you want. Perhaps we should find a way to tag them so people
>>>> can
>>>> >> do a "quick-test" that skips the integration ones.
>>>> >>
>>>> >> The assembly builds are annoying but they only take about a minute for
>>>> me
>>>> >> on a MacBook Pro with SBT warmed up. The assembly is actually only
>>>> required
>>>> >> for some of the "integration" tests (which launch new processes), but
>>>> I'd
>>>> >> recommend doing it all the time anyway since it would be very
>>>> confusing to
>>>> >> run those with an old assembly. The Scala compiler crash issue can
>>>> also be
>>>> >> a problem, but I don't see it very often with SBT. If it happens, I
>>>> exit
>>>> >> SBT and do sbt clean.
>>>> >>
>>>> >> Anyway, this is useful feedback and I think we should try to improve
>>>> some
>>>> >> of these suites, but hopefully you can also try the faster SBT
>>>> process. At
>>>> >> the end of the day, if we want integration tests, the whole test
>>>> process
>>>> >> will take an hour, but most of the developers I know leave that to
>>>> Jenkins
>>>> >> and only run individual tests locally before submitting a patch.
>>>> >>
>>>> >> Matei
>>>> >>
>>>> >>
>>>> >> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
>>>> >> ryan.blake.williams@gmail.com> wrote:
>>>> >> >
>>>> >> > In the course of trying to make contributions to Spark, I have had a
>>>> lot
>>>> >> of
>>>> >> > trouble running Spark's tests successfully. The main pain points I've
>>>> >> > experienced are:
>>>> >> >
>>>> >> >    1) frequent, spurious test failures
>>>> >> >    2) high latency of running tests
>>>> >> >    3) difficulty running specific tests in an iterative fashion
>>>> >> >
>>>> >> > Here is an example series of failures that I encountered this weekend
>>>> >> > (along with footnote links to the console output from each and
>>>> >> > approximately how long each took):
>>>> >> >
>>>> >> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
>>>> >> > before.
>>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>>>> >> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
>>>> BroadcastSuite
>>>> >> > passed, but scala compiler crashed on the "catalyst" project.
>>>> >> > - `mvn clean`: some attempts to run earlier commands (that previously
>>>> >> > didn't crash the compiler) all result in the same compiler crash.
>>>> >> Previous
>>>> >> > discussion on this list implies this can only be solved by a `mvn
>>>> clean`
>>>> >> > [4].
>>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
>>>> >> > BroadcastSuite can't run because assembly is not built.
>>>> >> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages
>>>> about
>>>> >> > version mismatches and python 2.6. The machine this ran on has python
>>>> >> 2.7,
>>>> >> > so I don't know what that's about.
>>>> >> > - `./dev/run-tests` again [7]: "too many open files" errors in
>>>> several
>>>> >> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently
>>>> this is
>>>> >> > not enough, but only some of the time? I increased it to 8192 and
>>>> tried
>>>> >> > again.
>>>> >> > - `./dev/run-tests` again [8]: same pyspark errors as before. This
>>>> seems
>>>> >> to
>>>> >> > be the issue from SPARK-3867 [9], which was supposedly fixed on
>>>> October
>>>> >> 14;
>>>> >> > not sure how I'm seeing it now. In any case, switched to Python 2.6
>>>> and
>>>> >> > installed unittest2, and python/run-tests seems to be unblocked.
>>>> >> > - `./dev/run-tests` again [10]: finally passes!
>>>> >> >
>>>> >> > This was on a spark checkout at ceb6281 (ToT Friday), with a few
>>>> trivial
>>>> >> > changes added on (that I wanted to test before sending out a PR), on
>>>> a
>>>> >> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>>>> >> >
>>>> >> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>>>> >> commands
>>>> >> > from the same repo state:
>>>> >> >
>>>> >> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
>>>> >> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've
>>>> seen
>>>> >> > this one before on this machine and am guessing it actually occurs
>>>> every
>>>> >> > time.
>>>> >> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one
>>>> more
>>>> >> > time from ceb6281, and saw the same failure.
>>>> >> >
>>>> >> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>>>> >> narrow
>>>> >> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on
>>>> my
>>>> >> mac,
>>>> >> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>>>> >> used),
>>>> >> > and it passed [16], so the failure seems specific to my linux
>>>> >> machine/arch.
>>>> >> >
>>>> >> > At this point I believe that my changes don't break any tests (the
>>>> >> > YarnClusterSuite failure on my linux presumably not being... "real"),
>>>> >> and I
>>>> >> > am ready to send out a PR. Whew!
>>>> >> >
>>>> >> > However, reflecting on the 5 or 6 distinct failure-modes represented
>>>> >> above:
>>>> >> >
>>>> >> > - One of them (too many files open), is something I can (and did,
>>>> >> > hopefully) fix once and for all. It cost me an ~hour this time
>>>> >> (approximate
>>>> >> > time of running ./dev/run-tests) and a few hours other times when I
>>>> >> didn't
>>>> >> > fully understand/fix it. It doesn't happen deterministically (why?),
>>>> but
>>>> >> > does happen somewhat frequently to people, having been discussed on
>>>> the
>>>> >> > user list multiple times [17] and on SO [18]. Maybe some note in the
>>>> >> > documentation advising people to check their ulimit makes sense?
>>>> >> > - One of them (unittest2 must be installed for python 2.6) was
>>>> supposedly
>>>> >> > fixed upstream of the commits I tested here; I don't know why I'm
>>>> still
>>>> >> > running into it. This cost me a few hours of running
>>>> `./dev/run-tests`
>>>> >> > multiple times to see if it was transient, plus some time
>>>> researching and
>>>> >> > working around it.
>>>> >> > - The original BroadcastSuite failure cost me a few hours and went
>>>> away
>>>> >> > before I'd even run `mvn clean`.
>>>> >> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a
>>>> few
>>>> >> > hours of running `./dev/run-tests` in different ways before deciding
>>>> >> that,
>>>> >> > as usual, there was no way around it and that I'd need to run `mvn
>>>> clean`
>>>> >> > and start running tests from scratch.
>>>> >> > - The YarnClusterSuite failures on my linux box have cost me hours of
>>>> >> > trying to figure out whether they're my fault. I've seen them many
>>>> times
>>>> >> > over the past weeks/months, plus or minus other failures that have
>>>> come
>>>> >> and
>>>> >> > gone, and was especially befuddled by them when I was seeing a
>>>> disjoint
>>>> >> set
>>>> >> > of reproducible failures on my mac [19] (the triaging of which
>>>> involved
>>>> >> > dozens of runs of `./dev/run-tests`).
>>>> >> >
>>>> >> > While I'm interested in digging into each of these issues, I also
>>>> want to
>>>> >> > discuss the frequency with which I've run into issues like these.
>>>> This is
>>>> >> > unfortunately not the first time in recent months that I've spent
>>>> days
>>>> >> > playing spurious-test-failure whack-a-mole with a 60-90min
>>>> dev/run-tests
>>>> >> > iteration time, which is no fun! So I am wondering/thinking:
>>>> >> >
>>>> >> > - Do other people experience this level of flakiness from spark
>>>> tests?
>>>> >> > - Do other people bother running dev/run-tests locally, or just let
>>>> >> Jenkins
>>>> >> > do it during the CR process?
>>>> >> > - Needing to run a full assembly post-clean just to continue running
>>>> one
>>>> >> > specific test case feels especially wasteful, and the failure output
>>>> when
>>>> >> > naively attempting to run a specific test without having built an
>>>> >> assembly
>>>> >> > jar is not always clear about what the issue is or how to fix it;
>>>> even
>>>> >> the
>>>> >> > fact that certain tests require "building the world" is not
>>>> something I
>>>> >> > would have expected, and has cost me hours of confusion.
>>>> >> >    - Should a person running spark tests assume that they must build
>>>> an
>>>> >> > assembly JAR before running anything?
>>>> >> >    - Are there some proper "unit" tests that are actually
>>>> self-contained
>>>> >> /
>>>> >> > able to be run without building an assembly jar?
>>>> >> >    - Can we better document/demarcate which tests have which
>>>> >> dependencies?
>>>> >> >    - Is there something finer-grained than building an assembly JAR
>>>> that
>>>> >> > is sufficient in some cases?
>>>> >> >        - If so, can we document that?
>>>> >> >        - If not, can we move to a world of finer-grained
>>>> dependencies for
>>>> >> > some of these?
>>>> >> > - Leaving all of these spurious failures aside, the process of
>>>> assembling
>>>> >> > and testing a new JAR is not a quick one (40 and 60 mins for me
>>>> >> typically,
>>>> >> > respectively). I would guess that there are dozens (hundreds?) of
>>>> people
>>>> >> > who build a Spark assembly from various ToTs on any given day, and
>>>> who
>>>> >> all
>>>> >> > wait on the exact same compilation / assembly steps to occur.
>>>> Expanding
>>>> >> on
>>>> >> > the recent work to publish nightly snapshots [20], can we do a
>>>> better job
>>>> >> > caching/sharing compilation artifacts at a more granular level
>>>> (pre-built
>>>> >> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA?
>>>> more
>>>> >> > granular maven modules, plus the previous two?), or otherwise save
>>>> some
>>>> >> of
>>>> >> > the considerable amount of redundant compilation work that I had to
>>>> do
>>>> >> over
>>>> >> > the course of my odyssey this weekend?
>>>> >> >
>>>> >> > Ramping up on most projects involves some amount of supplementing the
>>>> >> > documentation with trial and error to figure out what to run, which
>>>> >> > "errors" are real errors and which can be ignored, etc., but
>>>> navigating
>>>> >> > that minefield on Spark has proved especially challenging and
>>>> >> > time-consuming for me. Some of that comes directly from scala's
>>>> >> relatively
>>>> >> > slow compilation times and immature build-tooling ecosystem, but
>>>> that is
>>>> >> > the world we live in and it would be nice if Spark took the
>>>> alleviation
>>>> >> of
>>>> >> > the resulting pain more seriously, as one of the more interesting and
>>>> >> > well-known large scala projects around right now. The official
>>>> >> > documentation around how to build different subsets of the codebase
>>>> is
>>>> >> > somewhat sparse [21], and there have been many mixed [22] accounts
>>>> [23]
>>>> >> on
>>>> >> > this mailing list about preferred ways to build on mvn vs. sbt (none
>>>> of
>>>> >> > which has made it into official documentation, as far as I've seen).
>>>> >> > Expecting new contributors to piece together all of this received
>>>> >> > folk-wisdom about how to build/test in a sane way by trawling mailing
>>>> >> list
>>>> >> > archives seems suboptimal.
>>>> >> >
>>>> >> > Thanks for reading, looking forward to hearing your ideas!
>>>> >> >
>>>> >> > -Ryan
>>>> >> >
>>>> >> > P.S. Is "best practice" for emailing this list to not incorporate any
>>>> >> HTML
>>>> >> > in the body? It seems like all of the archives I've seen strip it
>>>> out,
>>>> >> but
>>>> >> > other people have used it and gmail displays it.
>>>> >> >
>>>> >> >
>>>> >> > [1]
>>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>>> 8a162367c4dc157d2479/
>>>> >> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>>>> >> > (57 mins)
>>>> >> > [2]
>>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>>> 8a162367c4dc157d2479/
>>>> >> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>>>> >> > (6 mins)
>>>> >> > [3]
>>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>>> 8a162367c4dc157d2479/
>>>> >> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
>>>> >> 20pass%20test,%20fail%20subsequent%20compile
>>>> >> > (4 mins)
>>>> >> > [4]
>>>> >> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
>>>> >> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
>>>> >> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
>>>> >> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
>>>> >> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
>>>> >> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>>>> >> > [5]
>>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>>> 8a162367c4dc157d2479/
>>>> >> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
>>>> >> 20clean,%20need%20dependencies%20built
>>>> >> > [6]
>>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>>> 8a162367c4dc157d2479/
>>>> >> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
>>>> >> 20post%20clean
>>>> >> > (50 mins)
>>>> >> > [7]
>>>> >> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
>>>> >> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
>>>> >> > (1hr)
>>>> >> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
>>>> >> > [9] https://issues.apache.org/jira/browse/SPARK-3867
>>>> >> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>>>> >> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>>>> >> > [12]
>>>> >> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
>>>> >> file-gistfile1-txt-L853
>>>> >> > (~90 mins)
>>>> >> > [13]
>>>> >> > https://gist.github.com/ryan-williams/718f6324af358819b496#
>>>> >> file-gistfile1-txt-L852
>>>> >> > (91 mins)
>>>> >> > [14]
>>>> >> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
>>>> >> file-gistfile1-txt-L854
>>>> >> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>>>> >> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>>>> >> > [17]
>>>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
>>>> >> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
>>>> >> > [18]
>>>> >> > http://stackoverflow.com/questions/25707629/why-does-
>>>> >> spark-job-fail-with-too-many-open-files
>>>> >> > [19] https://issues.apache.org/jira/browse/SPARK-4002
>>>> >> > [20] https://issues.apache.org/jira/browse/SPARK-4542
>>>> >> > [21]
>>>> >> > https://spark.apache.org/docs/latest/building-with-maven.
>>>> >> html#spark-tests-in-maven
>>>> >> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
>>>> >> > [23]
>>>> >> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
>>>> >> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com
>>>> %3E
>>>> >>
>>>> >>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spurious test failures, testing best practices

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Ryan,

The existing JIRA also covers publishing nightly docs:
https://issues.apache.org/jira/browse/SPARK-1517

- Patrick

On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
<ry...@gmail.com> wrote:
> Thanks Nicholas, glad to hear that some of this info will be pushed to the
> main site soon, but this brings up yet another point of confusion that I've
> struggled with, namely whether the documentation on github or that on
> spark.apache.org should be considered the primary reference for people
> seeking to learn about best practices for developing Spark.
>
> Trying to read docs starting from
> https://github.com/apache/spark/blob/master/docs/index.md right now, I find
> that all of the links to other parts of the documentation are broken: they
> point to relative paths that end in ".html", which will work when published
> on the docs-site, but that would have to end in ".md" if a person was to be
> able to navigate them on github.
>
> So expecting people to use the up-to-date docs on github (where all
> internal URLs 404 and the main github README suggests that the "latest
> Spark documentation" can be found on the actually-months-old docs-site
> <https://github.com/apache/spark#online-documentation>) is not a good
> solution. On the other hand, consulting months-old docs on the site is also
> problematic, as this thread and your last email have borne out.  The result
> is that there is no good place on the internet to learn about the most
> up-to-date best practices for using/developing Spark.
>
> Why not build http://spark.apache.org/docs/latest/ nightly (or every
> commit) off of what's in github, rather than having that URL point to the
> last release's docs (up to ~3 months old)? This way, casual users who want
> the docs for the released version they happen to be using (which is already
> frequently != "/latest" today, for many Spark users) can (still) find them
> at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
> point people to a site (/latest) that actually has up-to-date docs that
> reflect ToT and whose links work.
>
> If there are concerns about existing semantics around "/latest" URLs being
> broken, some new URL could be used, like
> http://spark.apache.org/docs/snapshot/, but given that everything under
> http://spark.apache.org/docs/latest/ is in a state of
> planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
> that serious an issue to me; anyone sending around permanent links to
> things under /latest is already going to have those links break / not make
> sense in the near future.
>
>
> On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>>
>>    - currently the docs only contain information about building with
>>    maven,
>>    and even then don't cover many important cases
>>
>>  All other points aside, I just want to point out that the docs document
>> both how to use Maven and SBT and clearly state
>> <https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt>
>> that Maven is the "build of reference" while SBT may be preferable for
>> day-to-day development.
>>
>> I believe the main reason most people miss this documentation is that,
>> though it's up-to-date on GitHub, it has't been published yet to the docs
>> site. It should go out with the 1.2 release.
>>
>> Improvements to the documentation on building Spark belong here:
>> https://github.com/apache/spark/blob/master/docs/building-spark.md
>>
>> If there are clear recommendations that come out of this thread but are
>> not in that doc, they should be added in there. Other, less important
>> details may possibly be better suited for the Contributing to Spark
>> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>
>> guide.
>>
>> Nick
>>
>>
>> On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell <pw...@gmail.com>
>> wrote:
>>
>>> Hey Ryan,
>>>
>>> A few more things here. You should feel free to send patches to
>>> Jenkins to test them, since this is the reference environment in which
>>> we regularly run tests. This is the normal workflow for most
>>> developers and we spend a lot of effort provisioning/maintaining a
>>> very large jenkins cluster to allow developers access this resource. A
>>> common development approach is to locally run tests that you've added
>>> in a patch, then send it to jenkins for the full run, and then try to
>>> debug locally if you see specific unanticipated test failures.
>>>
>>> One challenge we have is that given the proliferation of OS versions,
>>> Java versions, Python versions, ulimits, etc. there is a combinatorial
>>> number of environments in which tests could be run. It is very hard in
>>> some cases to figure out post-hoc why a given test is not working in a
>>> specific environment. I think a good solution here would be to use a
>>> standardized docker container for running Spark tests and asking folks
>>> to use that locally if they are trying to run all of the hundreds of
>>> Spark tests.
>>>
>>> Another solution would be to mock out every system interaction in
>>> Spark's tests including e.g. filesystem interactions to try and reduce
>>> variance across environments. However, that seems difficult.
>>>
>>> As the number of developers of Spark increases, it's definitely a good
>>> idea for us to invest in developer infrastructure including things
>>> like snapshot releases, better documentation, etc. Thanks for bringing
>>> this up as a pain point.
>>>
>>> - Patrick
>>>
>>>
>>> On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
>>> <ry...@gmail.com> wrote:
>>> > thanks for the info, Matei and Brennon. I will try to switch my
>>> workflow to
>>> > using sbt. Other potential action items:
>>> >
>>> > - currently the docs only contain information about building with maven,
>>> > and even then don't cover many important cases, as I described in my
>>> > previous email. If SBT is as much better as you've described then that
>>> > should be made much more obvious. Wasn't it the case recently that there
>>> > was only a page about building with SBT, and not one about building with
>>> > maven? Clearer messaging around this needs to exist in the
>>> documentation,
>>> > not just on the mailing list, imho.
>>> >
>>> > - +1 to better distinguishing between unit and integration tests, having
>>> > separate scripts for each, improving documentation around common
>>> workflows,
>>> > expectations of brittleness with each kind of test, advisability of just
>>> > relying on Jenkins for certain kinds of tests to not waste too much
>>> time,
>>> > etc. Things like the compiler crash should be discussed in the
>>> > documentation, not just in the mailing list archives, if new
>>> contributors
>>> > are likely to run into them through no fault of their own.
>>> >
>>> > - What is the algorithm you use to decide what tests you might have
>>> broken?
>>> > Can we codify it in some scripts that other people can use?
>>> >
>>> >
>>> >
>>> > On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <matei.zaharia@gmail.com
>>> >
>>> > wrote:
>>> >
>>> >> Hi Ryan,
>>> >>
>>> >> As a tip (and maybe this isn't documented well), I normally use SBT for
>>> >> development to avoid the slow build process, and use its interactive
>>> >> console to run only specific tests. The nice advantage is that SBT can
>>> keep
>>> >> the Scala compiler loaded and JITed across builds, making it faster to
>>> >> iterate. To use it, you can do the following:
>>> >>
>>> >> - Start the SBT interactive console with sbt/sbt
>>> >> - Build your assembly by running the "assembly" target in the assembly
>>> >> project: assembly/assembly
>>> >> - Run all the tests in one module: core/test
>>> >> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>>> (this
>>> >> also supports tab completion)
>>> >>
>>> >> Running all the tests does take a while, and I usually just rely on
>>> >> Jenkins for that once I've run the tests for the things I believed my
>>> patch
>>> >> could break. But this is because some of them are integration tests
>>> (e.g.
>>> >> DistributedSuite, which creates multi-process mini-clusters). Many of
>>> the
>>> >> individual suites run fast without requiring this, however, so you can
>>> pick
>>> >> the ones you want. Perhaps we should find a way to tag them so people
>>> can
>>> >> do a "quick-test" that skips the integration ones.
>>> >>
>>> >> The assembly builds are annoying but they only take about a minute for
>>> me
>>> >> on a MacBook Pro with SBT warmed up. The assembly is actually only
>>> required
>>> >> for some of the "integration" tests (which launch new processes), but
>>> I'd
>>> >> recommend doing it all the time anyway since it would be very
>>> confusing to
>>> >> run those with an old assembly. The Scala compiler crash issue can
>>> also be
>>> >> a problem, but I don't see it very often with SBT. If it happens, I
>>> exit
>>> >> SBT and do sbt clean.
>>> >>
>>> >> Anyway, this is useful feedback and I think we should try to improve
>>> some
>>> >> of these suites, but hopefully you can also try the faster SBT
>>> process. At
>>> >> the end of the day, if we want integration tests, the whole test
>>> process
>>> >> will take an hour, but most of the developers I know leave that to
>>> Jenkins
>>> >> and only run individual tests locally before submitting a patch.
>>> >>
>>> >> Matei
>>> >>
>>> >>
>>> >> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
>>> >> ryan.blake.williams@gmail.com> wrote:
>>> >> >
>>> >> > In the course of trying to make contributions to Spark, I have had a
>>> lot
>>> >> of
>>> >> > trouble running Spark's tests successfully. The main pain points I've
>>> >> > experienced are:
>>> >> >
>>> >> >    1) frequent, spurious test failures
>>> >> >    2) high latency of running tests
>>> >> >    3) difficulty running specific tests in an iterative fashion
>>> >> >
>>> >> > Here is an example series of failures that I encountered this weekend
>>> >> > (along with footnote links to the console output from each and
>>> >> > approximately how long each took):
>>> >> >
>>> >> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
>>> >> > before.
>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>>> >> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
>>> BroadcastSuite
>>> >> > passed, but scala compiler crashed on the "catalyst" project.
>>> >> > - `mvn clean`: some attempts to run earlier commands (that previously
>>> >> > didn't crash the compiler) all result in the same compiler crash.
>>> >> Previous
>>> >> > discussion on this list implies this can only be solved by a `mvn
>>> clean`
>>> >> > [4].
>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
>>> >> > BroadcastSuite can't run because assembly is not built.
>>> >> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages
>>> about
>>> >> > version mismatches and python 2.6. The machine this ran on has python
>>> >> 2.7,
>>> >> > so I don't know what that's about.
>>> >> > - `./dev/run-tests` again [7]: "too many open files" errors in
>>> several
>>> >> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently
>>> this is
>>> >> > not enough, but only some of the time? I increased it to 8192 and
>>> tried
>>> >> > again.
>>> >> > - `./dev/run-tests` again [8]: same pyspark errors as before. This
>>> seems
>>> >> to
>>> >> > be the issue from SPARK-3867 [9], which was supposedly fixed on
>>> October
>>> >> 14;
>>> >> > not sure how I'm seeing it now. In any case, switched to Python 2.6
>>> and
>>> >> > installed unittest2, and python/run-tests seems to be unblocked.
>>> >> > - `./dev/run-tests` again [10]: finally passes!
>>> >> >
>>> >> > This was on a spark checkout at ceb6281 (ToT Friday), with a few
>>> trivial
>>> >> > changes added on (that I wanted to test before sending out a PR), on
>>> a
>>> >> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>>> >> >
>>> >> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>>> >> commands
>>> >> > from the same repo state:
>>> >> >
>>> >> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
>>> >> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've
>>> seen
>>> >> > this one before on this machine and am guessing it actually occurs
>>> every
>>> >> > time.
>>> >> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one
>>> more
>>> >> > time from ceb6281, and saw the same failure.
>>> >> >
>>> >> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>>> >> narrow
>>> >> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on
>>> my
>>> >> mac,
>>> >> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>>> >> used),
>>> >> > and it passed [16], so the failure seems specific to my linux
>>> >> machine/arch.
>>> >> >
>>> >> > At this point I believe that my changes don't break any tests (the
>>> >> > YarnClusterSuite failure on my linux presumably not being... "real"),
>>> >> and I
>>> >> > am ready to send out a PR. Whew!
>>> >> >
>>> >> > However, reflecting on the 5 or 6 distinct failure-modes represented
>>> >> above:
>>> >> >
>>> >> > - One of them (too many files open), is something I can (and did,
>>> >> > hopefully) fix once and for all. It cost me an ~hour this time
>>> >> (approximate
>>> >> > time of running ./dev/run-tests) and a few hours other times when I
>>> >> didn't
>>> >> > fully understand/fix it. It doesn't happen deterministically (why?),
>>> but
>>> >> > does happen somewhat frequently to people, having been discussed on
>>> the
>>> >> > user list multiple times [17] and on SO [18]. Maybe some note in the
>>> >> > documentation advising people to check their ulimit makes sense?
>>> >> > - One of them (unittest2 must be installed for python 2.6) was
>>> supposedly
>>> >> > fixed upstream of the commits I tested here; I don't know why I'm
>>> still
>>> >> > running into it. This cost me a few hours of running
>>> `./dev/run-tests`
>>> >> > multiple times to see if it was transient, plus some time
>>> researching and
>>> >> > working around it.
>>> >> > - The original BroadcastSuite failure cost me a few hours and went
>>> away
>>> >> > before I'd even run `mvn clean`.
>>> >> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a
>>> few
>>> >> > hours of running `./dev/run-tests` in different ways before deciding
>>> >> that,
>>> >> > as usual, there was no way around it and that I'd need to run `mvn
>>> clean`
>>> >> > and start running tests from scratch.
>>> >> > - The YarnClusterSuite failures on my linux box have cost me hours of
>>> >> > trying to figure out whether they're my fault. I've seen them many
>>> times
>>> >> > over the past weeks/months, plus or minus other failures that have
>>> come
>>> >> and
>>> >> > gone, and was especially befuddled by them when I was seeing a
>>> disjoint
>>> >> set
>>> >> > of reproducible failures on my mac [19] (the triaging of which
>>> involved
>>> >> > dozens of runs of `./dev/run-tests`).
>>> >> >
>>> >> > While I'm interested in digging into each of these issues, I also
>>> want to
>>> >> > discuss the frequency with which I've run into issues like these.
>>> This is
>>> >> > unfortunately not the first time in recent months that I've spent
>>> days
>>> >> > playing spurious-test-failure whack-a-mole with a 60-90min
>>> dev/run-tests
>>> >> > iteration time, which is no fun! So I am wondering/thinking:
>>> >> >
>>> >> > - Do other people experience this level of flakiness from spark
>>> tests?
>>> >> > - Do other people bother running dev/run-tests locally, or just let
>>> >> Jenkins
>>> >> > do it during the CR process?
>>> >> > - Needing to run a full assembly post-clean just to continue running
>>> one
>>> >> > specific test case feels especially wasteful, and the failure output
>>> when
>>> >> > naively attempting to run a specific test without having built an
>>> >> assembly
>>> >> > jar is not always clear about what the issue is or how to fix it;
>>> even
>>> >> the
>>> >> > fact that certain tests require "building the world" is not
>>> something I
>>> >> > would have expected, and has cost me hours of confusion.
>>> >> >    - Should a person running spark tests assume that they must build
>>> an
>>> >> > assembly JAR before running anything?
>>> >> >    - Are there some proper "unit" tests that are actually
>>> self-contained
>>> >> /
>>> >> > able to be run without building an assembly jar?
>>> >> >    - Can we better document/demarcate which tests have which
>>> >> dependencies?
>>> >> >    - Is there something finer-grained than building an assembly JAR
>>> that
>>> >> > is sufficient in some cases?
>>> >> >        - If so, can we document that?
>>> >> >        - If not, can we move to a world of finer-grained
>>> dependencies for
>>> >> > some of these?
>>> >> > - Leaving all of these spurious failures aside, the process of
>>> assembling
>>> >> > and testing a new JAR is not a quick one (40 and 60 mins for me
>>> >> typically,
>>> >> > respectively). I would guess that there are dozens (hundreds?) of
>>> people
>>> >> > who build a Spark assembly from various ToTs on any given day, and
>>> who
>>> >> all
>>> >> > wait on the exact same compilation / assembly steps to occur.
>>> Expanding
>>> >> on
>>> >> > the recent work to publish nightly snapshots [20], can we do a
>>> better job
>>> >> > caching/sharing compilation artifacts at a more granular level
>>> (pre-built
>>> >> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA?
>>> more
>>> >> > granular maven modules, plus the previous two?), or otherwise save
>>> some
>>> >> of
>>> >> > the considerable amount of redundant compilation work that I had to
>>> do
>>> >> over
>>> >> > the course of my odyssey this weekend?
>>> >> >
>>> >> > Ramping up on most projects involves some amount of supplementing the
>>> >> > documentation with trial and error to figure out what to run, which
>>> >> > "errors" are real errors and which can be ignored, etc., but
>>> navigating
>>> >> > that minefield on Spark has proved especially challenging and
>>> >> > time-consuming for me. Some of that comes directly from scala's
>>> >> relatively
>>> >> > slow compilation times and immature build-tooling ecosystem, but
>>> that is
>>> >> > the world we live in and it would be nice if Spark took the
>>> alleviation
>>> >> of
>>> >> > the resulting pain more seriously, as one of the more interesting and
>>> >> > well-known large scala projects around right now. The official
>>> >> > documentation around how to build different subsets of the codebase
>>> is
>>> >> > somewhat sparse [21], and there have been many mixed [22] accounts
>>> [23]
>>> >> on
>>> >> > this mailing list about preferred ways to build on mvn vs. sbt (none
>>> of
>>> >> > which has made it into official documentation, as far as I've seen).
>>> >> > Expecting new contributors to piece together all of this received
>>> >> > folk-wisdom about how to build/test in a sane way by trawling mailing
>>> >> list
>>> >> > archives seems suboptimal.
>>> >> >
>>> >> > Thanks for reading, looking forward to hearing your ideas!
>>> >> >
>>> >> > -Ryan
>>> >> >
>>> >> > P.S. Is "best practice" for emailing this list to not incorporate any
>>> >> HTML
>>> >> > in the body? It seems like all of the archives I've seen strip it
>>> out,
>>> >> but
>>> >> > other people have used it and gmail displays it.
>>> >> >
>>> >> >
>>> >> > [1]
>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>> 8a162367c4dc157d2479/
>>> >> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>>> >> > (57 mins)
>>> >> > [2]
>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>> 8a162367c4dc157d2479/
>>> >> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>>> >> > (6 mins)
>>> >> > [3]
>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>> 8a162367c4dc157d2479/
>>> >> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
>>> >> 20pass%20test,%20fail%20subsequent%20compile
>>> >> > (4 mins)
>>> >> > [4]
>>> >> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
>>> >> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
>>> >> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
>>> >> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
>>> >> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
>>> >> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>>> >> > [5]
>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>> 8a162367c4dc157d2479/
>>> >> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
>>> >> 20clean,%20need%20dependencies%20built
>>> >> > [6]
>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>> 8a162367c4dc157d2479/
>>> >> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
>>> >> 20post%20clean
>>> >> > (50 mins)
>>> >> > [7]
>>> >> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
>>> >> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
>>> >> > (1hr)
>>> >> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
>>> >> > [9] https://issues.apache.org/jira/browse/SPARK-3867
>>> >> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>>> >> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>>> >> > [12]
>>> >> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
>>> >> file-gistfile1-txt-L853
>>> >> > (~90 mins)
>>> >> > [13]
>>> >> > https://gist.github.com/ryan-williams/718f6324af358819b496#
>>> >> file-gistfile1-txt-L852
>>> >> > (91 mins)
>>> >> > [14]
>>> >> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
>>> >> file-gistfile1-txt-L854
>>> >> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>>> >> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>>> >> > [17]
>>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
>>> >> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
>>> >> > [18]
>>> >> > http://stackoverflow.com/questions/25707629/why-does-
>>> >> spark-job-fail-with-too-many-open-files
>>> >> > [19] https://issues.apache.org/jira/browse/SPARK-4002
>>> >> > [20] https://issues.apache.org/jira/browse/SPARK-4542
>>> >> > [21]
>>> >> > https://spark.apache.org/docs/latest/building-with-maven.
>>> >> html#spark-tests-in-maven
>>> >> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
>>> >> > [23]
>>> >> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
>>> >> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com
>>> %3E
>>> >>
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spurious test failures, testing best practices

Posted by Ryan Williams <ry...@gmail.com>.

Thanks Nicholas, glad to hear that some of this info will be pushed to the
main site soon, but this brings up yet another point of confusion that I've
struggled with, namely whether the documentation on github or that on
spark.apache.org should be considered the primary reference for people
seeking to learn about best practices for developing Spark.

Trying to read docs starting from
https://github.com/apache/spark/blob/master/docs/index.md right now, I find
that all of the links to other parts of the documentation are broken: they
point to relative paths that end in ".html", which will work when published
on the docs-site, but that would have to end in ".md" if a person was to be
able to navigate them on github.

So expecting people to use the up-to-date docs on github (where all
internal URLs 404 and the main github README suggests that the "latest
Spark documentation" can be found on the actually-months-old docs-site
<https://github.com/apache/spark#online-documentation>) is not a good
solution. On the other hand, consulting months-old docs on the site is also
problematic, as this thread and your last email have borne out.  The result
is that there is no good place on the internet to learn about the most
up-to-date best practices for using/developing Spark.

Why not build http://spark.apache.org/docs/latest/ nightly (or every
commit) off of what's in github, rather than having that URL point to the
last release's docs (up to ~3 months old)? This way, casual users who want
the docs for the released version they happen to be using (which is already
frequently != "/latest" today, for many Spark users) can (still) find them
at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
point people to a site (/latest) that actually has up-to-date docs that
reflect ToT and whose links work.

If there are concerns about existing semantics around "/latest" URLs being
broken, some new URL could be used, like
http://spark.apache.org/docs/snapshot/, but given that everything under
http://spark.apache.org/docs/latest/ is in a state of
planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
that serious an issue to me; anyone sending around permanent links to
things under /latest is already going to have those links break / not make
sense in the near future.


On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

>
>    - currently the docs only contain information about building with
>    maven,
>    and even then don’t cover many important cases
>
>  All other points aside, I just want to point out that the docs document
> both how to use Maven and SBT and clearly state
> <https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt>
> that Maven is the “build of reference” while SBT may be preferable for
> day-to-day development.
>
> I believe the main reason most people miss this documentation is that,
> though it’s up-to-date on GitHub, it has’t been published yet to the docs
> site. It should go out with the 1.2 release.
>
> Improvements to the documentation on building Spark belong here:
> https://github.com/apache/spark/blob/master/docs/building-spark.md
>
> If there are clear recommendations that come out of this thread but are
> not in that doc, they should be added in there. Other, less important
> details may possibly be better suited for the Contributing to Spark
> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>
> guide.
>
> Nick
> 
>
> On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell <pw...@gmail.com>
> wrote:
>
>> Hey Ryan,
>>
>> A few more things here. You should feel free to send patches to
>> Jenkins to test them, since this is the reference environment in which
>> we regularly run tests. This is the normal workflow for most
>> developers and we spend a lot of effort provisioning/maintaining a
>> very large jenkins cluster to allow developers access this resource. A
>> common development approach is to locally run tests that you've added
>> in a patch, then send it to jenkins for the full run, and then try to
>> debug locally if you see specific unanticipated test failures.
>>
>> One challenge we have is that given the proliferation of OS versions,
>> Java versions, Python versions, ulimits, etc. there is a combinatorial
>> number of environments in which tests could be run. It is very hard in
>> some cases to figure out post-hoc why a given test is not working in a
>> specific environment. I think a good solution here would be to use a
>> standardized docker container for running Spark tests and asking folks
>> to use that locally if they are trying to run all of the hundreds of
>> Spark tests.
>>
>> Another solution would be to mock out every system interaction in
>> Spark's tests including e.g. filesystem interactions to try and reduce
>> variance across environments. However, that seems difficult.
>>
>> As the number of developers of Spark increases, it's definitely a good
>> idea for us to invest in developer infrastructure including things
>> like snapshot releases, better documentation, etc. Thanks for bringing
>> this up as a pain point.
>>
>> - Patrick
>>
>>
>> On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
>> <ry...@gmail.com> wrote:
>> > thanks for the info, Matei and Brennon. I will try to switch my
>> workflow to
>> > using sbt. Other potential action items:
>> >
>> > - currently the docs only contain information about building with maven,
>> > and even then don't cover many important cases, as I described in my
>> > previous email. If SBT is as much better as you've described then that
>> > should be made much more obvious. Wasn't it the case recently that there
>> > was only a page about building with SBT, and not one about building with
>> > maven? Clearer messaging around this needs to exist in the
>> documentation,
>> > not just on the mailing list, imho.
>> >
>> > - +1 to better distinguishing between unit and integration tests, having
>> > separate scripts for each, improving documentation around common
>> workflows,
>> > expectations of brittleness with each kind of test, advisability of just
>> > relying on Jenkins for certain kinds of tests to not waste too much
>> time,
>> > etc. Things like the compiler crash should be discussed in the
>> > documentation, not just in the mailing list archives, if new
>> contributors
>> > are likely to run into them through no fault of their own.
>> >
>> > - What is the algorithm you use to decide what tests you might have
>> broken?
>> > Can we codify it in some scripts that other people can use?
>> >
>> >
>> >
>> > On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <matei.zaharia@gmail.com
>> >
>> > wrote:
>> >
>> >> Hi Ryan,
>> >>
>> >> As a tip (and maybe this isn't documented well), I normally use SBT for
>> >> development to avoid the slow build process, and use its interactive
>> >> console to run only specific tests. The nice advantage is that SBT can
>> keep
>> >> the Scala compiler loaded and JITed across builds, making it faster to
>> >> iterate. To use it, you can do the following:
>> >>
>> >> - Start the SBT interactive console with sbt/sbt
>> >> - Build your assembly by running the "assembly" target in the assembly
>> >> project: assembly/assembly
>> >> - Run all the tests in one module: core/test
>> >> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>> (this
>> >> also supports tab completion)
>> >>
>> >> Running all the tests does take a while, and I usually just rely on
>> >> Jenkins for that once I've run the tests for the things I believed my
>> patch
>> >> could break. But this is because some of them are integration tests
>> (e.g.
>> >> DistributedSuite, which creates multi-process mini-clusters). Many of
>> the
>> >> individual suites run fast without requiring this, however, so you can
>> pick
>> >> the ones you want. Perhaps we should find a way to tag them so people
>> can
>> >> do a "quick-test" that skips the integration ones.
>> >>
>> >> The assembly builds are annoying but they only take about a minute for
>> me
>> >> on a MacBook Pro with SBT warmed up. The assembly is actually only
>> required
>> >> for some of the "integration" tests (which launch new processes), but
>> I'd
>> >> recommend doing it all the time anyway since it would be very
>> confusing to
>> >> run those with an old assembly. The Scala compiler crash issue can
>> also be
>> >> a problem, but I don't see it very often with SBT. If it happens, I
>> exit
>> >> SBT and do sbt clean.
>> >>
>> >> Anyway, this is useful feedback and I think we should try to improve
>> some
>> >> of these suites, but hopefully you can also try the faster SBT
>> process. At
>> >> the end of the day, if we want integration tests, the whole test
>> process
>> >> will take an hour, but most of the developers I know leave that to
>> Jenkins
>> >> and only run individual tests locally before submitting a patch.
>> >>
>> >> Matei
>> >>
>> >>
>> >> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
>> >> ryan.blake.williams@gmail.com> wrote:
>> >> >
>> >> > In the course of trying to make contributions to Spark, I have had a
>> lot
>> >> of
>> >> > trouble running Spark's tests successfully. The main pain points I've
>> >> > experienced are:
>> >> >
>> >> >    1) frequent, spurious test failures
>> >> >    2) high latency of running tests
>> >> >    3) difficulty running specific tests in an iterative fashion
>> >> >
>> >> > Here is an example series of failures that I encountered this weekend
>> >> > (along with footnote links to the console output from each and
>> >> > approximately how long each took):
>> >> >
>> >> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
>> >> > before.
>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>> >> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
>> BroadcastSuite
>> >> > passed, but scala compiler crashed on the "catalyst" project.
>> >> > - `mvn clean`: some attempts to run earlier commands (that previously
>> >> > didn't crash the compiler) all result in the same compiler crash.
>> >> Previous
>> >> > discussion on this list implies this can only be solved by a `mvn
>> clean`
>> >> > [4].
>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
>> >> > BroadcastSuite can't run because assembly is not built.
>> >> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages
>> about
>> >> > version mismatches and python 2.6. The machine this ran on has python
>> >> 2.7,
>> >> > so I don't know what that's about.
>> >> > - `./dev/run-tests` again [7]: "too many open files" errors in
>> several
>> >> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently
>> this is
>> >> > not enough, but only some of the time? I increased it to 8192 and
>> tried
>> >> > again.
>> >> > - `./dev/run-tests` again [8]: same pyspark errors as before. This
>> seems
>> >> to
>> >> > be the issue from SPARK-3867 [9], which was supposedly fixed on
>> October
>> >> 14;
>> >> > not sure how I'm seeing it now. In any case, switched to Python 2.6
>> and
>> >> > installed unittest2, and python/run-tests seems to be unblocked.
>> >> > - `./dev/run-tests` again [10]: finally passes!
>> >> >
>> >> > This was on a spark checkout at ceb6281 (ToT Friday), with a few
>> trivial
>> >> > changes added on (that I wanted to test before sending out a PR), on
>> a
>> >> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>> >> >
>> >> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>> >> commands
>> >> > from the same repo state:
>> >> >
>> >> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
>> >> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've
>> seen
>> >> > this one before on this machine and am guessing it actually occurs
>> every
>> >> > time.
>> >> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one
>> more
>> >> > time from ceb6281, and saw the same failure.
>> >> >
>> >> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>> >> narrow
>> >> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on
>> my
>> >> mac,
>> >> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>> >> used),
>> >> > and it passed [16], so the failure seems specific to my linux
>> >> machine/arch.
>> >> >
>> >> > At this point I believe that my changes don't break any tests (the
>> >> > YarnClusterSuite failure on my linux presumably not being... "real"),
>> >> and I
>> >> > am ready to send out a PR. Whew!
>> >> >
>> >> > However, reflecting on the 5 or 6 distinct failure-modes represented
>> >> above:
>> >> >
>> >> > - One of them (too many files open), is something I can (and did,
>> >> > hopefully) fix once and for all. It cost me an ~hour this time
>> >> (approximate
>> >> > time of running ./dev/run-tests) and a few hours other times when I
>> >> didn't
>> >> > fully understand/fix it. It doesn't happen deterministically (why?),
>> but
>> >> > does happen somewhat frequently to people, having been discussed on
>> the
>> >> > user list multiple times [17] and on SO [18]. Maybe some note in the
>> >> > documentation advising people to check their ulimit makes sense?
>> >> > - One of them (unittest2 must be installed for python 2.6) was
>> supposedly
>> >> > fixed upstream of the commits I tested here; I don't know why I'm
>> still
>> >> > running into it. This cost me a few hours of running
>> `./dev/run-tests`
>> >> > multiple times to see if it was transient, plus some time
>> researching and
>> >> > working around it.
>> >> > - The original BroadcastSuite failure cost me a few hours and went
>> away
>> >> > before I'd even run `mvn clean`.
>> >> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a
>> few
>> >> > hours of running `./dev/run-tests` in different ways before deciding
>> >> that,
>> >> > as usual, there was no way around it and that I'd need to run `mvn
>> clean`
>> >> > and start running tests from scratch.
>> >> > - The YarnClusterSuite failures on my linux box have cost me hours of
>> >> > trying to figure out whether they're my fault. I've seen them many
>> times
>> >> > over the past weeks/months, plus or minus other failures that have
>> come
>> >> and
>> >> > gone, and was especially befuddled by them when I was seeing a
>> disjoint
>> >> set
>> >> > of reproducible failures on my mac [19] (the triaging of which
>> involved
>> >> > dozens of runs of `./dev/run-tests`).
>> >> >
>> >> > While I'm interested in digging into each of these issues, I also
>> want to
>> >> > discuss the frequency with which I've run into issues like these.
>> This is
>> >> > unfortunately not the first time in recent months that I've spent
>> days
>> >> > playing spurious-test-failure whack-a-mole with a 60-90min
>> dev/run-tests
>> >> > iteration time, which is no fun! So I am wondering/thinking:
>> >> >
>> >> > - Do other people experience this level of flakiness from spark
>> tests?
>> >> > - Do other people bother running dev/run-tests locally, or just let
>> >> Jenkins
>> >> > do it during the CR process?
>> >> > - Needing to run a full assembly post-clean just to continue running
>> one
>> >> > specific test case feels especially wasteful, and the failure output
>> when
>> >> > naively attempting to run a specific test without having built an
>> >> assembly
>> >> > jar is not always clear about what the issue is or how to fix it;
>> even
>> >> the
>> >> > fact that certain tests require "building the world" is not
>> something I
>> >> > would have expected, and has cost me hours of confusion.
>> >> >    - Should a person running spark tests assume that they must build
>> an
>> >> > assembly JAR before running anything?
>> >> >    - Are there some proper "unit" tests that are actually
>> self-contained
>> >> /
>> >> > able to be run without building an assembly jar?
>> >> >    - Can we better document/demarcate which tests have which
>> >> dependencies?
>> >> >    - Is there something finer-grained than building an assembly JAR
>> that
>> >> > is sufficient in some cases?
>> >> >        - If so, can we document that?
>> >> >        - If not, can we move to a world of finer-grained
>> dependencies for
>> >> > some of these?
>> >> > - Leaving all of these spurious failures aside, the process of
>> assembling
>> >> > and testing a new JAR is not a quick one (40 and 60 mins for me
>> >> typically,
>> >> > respectively). I would guess that there are dozens (hundreds?) of
>> people
>> >> > who build a Spark assembly from various ToTs on any given day, and
>> who
>> >> all
>> >> > wait on the exact same compilation / assembly steps to occur.
>> Expanding
>> >> on
>> >> > the recent work to publish nightly snapshots [20], can we do a
>> better job
>> >> > caching/sharing compilation artifacts at a more granular level
>> (pre-built
>> >> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA?
>> more
>> >> > granular maven modules, plus the previous two?), or otherwise save
>> some
>> >> of
>> >> > the considerable amount of redundant compilation work that I had to
>> do
>> >> over
>> >> > the course of my odyssey this weekend?
>> >> >
>> >> > Ramping up on most projects involves some amount of supplementing the
>> >> > documentation with trial and error to figure out what to run, which
>> >> > "errors" are real errors and which can be ignored, etc., but
>> navigating
>> >> > that minefield on Spark has proved especially challenging and
>> >> > time-consuming for me. Some of that comes directly from scala's
>> >> relatively
>> >> > slow compilation times and immature build-tooling ecosystem, but
>> that is
>> >> > the world we live in and it would be nice if Spark took the
>> alleviation
>> >> of
>> >> > the resulting pain more seriously, as one of the more interesting and
>> >> > well-known large scala projects around right now. The official
>> >> > documentation around how to build different subsets of the codebase
>> is
>> >> > somewhat sparse [21], and there have been many mixed [22] accounts
>> [23]
>> >> on
>> >> > this mailing list about preferred ways to build on mvn vs. sbt (none
>> of
>> >> > which has made it into official documentation, as far as I've seen).
>> >> > Expecting new contributors to piece together all of this received
>> >> > folk-wisdom about how to build/test in a sane way by trawling mailing
>> >> list
>> >> > archives seems suboptimal.
>> >> >
>> >> > Thanks for reading, looking forward to hearing your ideas!
>> >> >
>> >> > -Ryan
>> >> >
>> >> > P.S. Is "best practice" for emailing this list to not incorporate any
>> >> HTML
>> >> > in the body? It seems like all of the archives I've seen strip it
>> out,
>> >> but
>> >> > other people have used it and gmail displays it.
>> >> >
>> >> >
>> >> > [1]
>> >> > https://gist.githubusercontent.com/ryan-williams/
>> 8a162367c4dc157d2479/
>> >> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>> >> > (57 mins)
>> >> > [2]
>> >> > https://gist.githubusercontent.com/ryan-williams/
>> 8a162367c4dc157d2479/
>> >> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>> >> > (6 mins)
>> >> > [3]
>> >> > https://gist.githubusercontent.com/ryan-williams/
>> 8a162367c4dc157d2479/
>> >> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
>> >> 20pass%20test,%20fail%20subsequent%20compile
>> >> > (4 mins)
>> >> > [4]
>> >> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
>> >> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
>> >> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
>> >> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
>> >> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
>> >> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>> >> > [5]
>> >> > https://gist.githubusercontent.com/ryan-williams/
>> 8a162367c4dc157d2479/
>> >> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
>> >> 20clean,%20need%20dependencies%20built
>> >> > [6]
>> >> > https://gist.githubusercontent.com/ryan-williams/
>> 8a162367c4dc157d2479/
>> >> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
>> >> 20post%20clean
>> >> > (50 mins)
>> >> > [7]
>> >> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
>> >> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
>> >> > (1hr)
>> >> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
>> >> > [9] https://issues.apache.org/jira/browse/SPARK-3867
>> >> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>> >> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>> >> > [12]
>> >> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
>> >> file-gistfile1-txt-L853
>> >> > (~90 mins)
>> >> > [13]
>> >> > https://gist.github.com/ryan-williams/718f6324af358819b496#
>> >> file-gistfile1-txt-L852
>> >> > (91 mins)
>> >> > [14]
>> >> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
>> >> file-gistfile1-txt-L854
>> >> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>> >> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>> >> > [17]
>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
>> >> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
>> >> > [18]
>> >> > http://stackoverflow.com/questions/25707629/why-does-
>> >> spark-job-fail-with-too-many-open-files
>> >> > [19] https://issues.apache.org/jira/browse/SPARK-4002
>> >> > [20] https://issues.apache.org/jira/browse/SPARK-4542
>> >> > [21]
>> >> > https://spark.apache.org/docs/latest/building-with-maven.
>> >> html#spark-tests-in-maven
>> >> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
>> >> > [23]
>> >> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
>> >> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com
>> %3E
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>

Re: Spurious test failures, testing best practices

Posted by Nicholas Chammas <ni...@gmail.com>.

   - currently the docs only contain information about building with maven,
   and even then don’t cover many important cases

 All other points aside, I just want to point out that the docs document
both how to use Maven and SBT and clearly state
<https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt>
that Maven is the “build of reference” while SBT may be preferable for
day-to-day development.

I believe the main reason most people miss this documentation is that,
though it’s up-to-date on GitHub, it has’t been published yet to the docs
site. It should go out with the 1.2 release.

Improvements to the documentation on building Spark belong here:
https://github.com/apache/spark/blob/master/docs/building-spark.md

If there are clear recommendations that come out of this thread but are not
in that doc, they should be added in there. Other, less important details
may possibly be better suited for the Contributing to Spark
<https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>
guide.

Nick


On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell <pw...@gmail.com> wrote:

> Hey Ryan,
>
> A few more things here. You should feel free to send patches to
> Jenkins to test them, since this is the reference environment in which
> we regularly run tests. This is the normal workflow for most
> developers and we spend a lot of effort provisioning/maintaining a
> very large jenkins cluster to allow developers access this resource. A
> common development approach is to locally run tests that you've added
> in a patch, then send it to jenkins for the full run, and then try to
> debug locally if you see specific unanticipated test failures.
>
> One challenge we have is that given the proliferation of OS versions,
> Java versions, Python versions, ulimits, etc. there is a combinatorial
> number of environments in which tests could be run. It is very hard in
> some cases to figure out post-hoc why a given test is not working in a
> specific environment. I think a good solution here would be to use a
> standardized docker container for running Spark tests and asking folks
> to use that locally if they are trying to run all of the hundreds of
> Spark tests.
>
> Another solution would be to mock out every system interaction in
> Spark's tests including e.g. filesystem interactions to try and reduce
> variance across environments. However, that seems difficult.
>
> As the number of developers of Spark increases, it's definitely a good
> idea for us to invest in developer infrastructure including things
> like snapshot releases, better documentation, etc. Thanks for bringing
> this up as a pain point.
>
> - Patrick
>
>
> On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
> <ry...@gmail.com> wrote:
> > thanks for the info, Matei and Brennon. I will try to switch my workflow
> to
> > using sbt. Other potential action items:
> >
> > - currently the docs only contain information about building with maven,
> > and even then don't cover many important cases, as I described in my
> > previous email. If SBT is as much better as you've described then that
> > should be made much more obvious. Wasn't it the case recently that there
> > was only a page about building with SBT, and not one about building with
> > maven? Clearer messaging around this needs to exist in the documentation,
> > not just on the mailing list, imho.
> >
> > - +1 to better distinguishing between unit and integration tests, having
> > separate scripts for each, improving documentation around common
> workflows,
> > expectations of brittleness with each kind of test, advisability of just
> > relying on Jenkins for certain kinds of tests to not waste too much time,
> > etc. Things like the compiler crash should be discussed in the
> > documentation, not just in the mailing list archives, if new contributors
> > are likely to run into them through no fault of their own.
> >
> > - What is the algorithm you use to decide what tests you might have
> broken?
> > Can we codify it in some scripts that other people can use?
> >
> >
> >
> > On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <ma...@gmail.com>
> > wrote:
> >
> >> Hi Ryan,
> >>
> >> As a tip (and maybe this isn't documented well), I normally use SBT for
> >> development to avoid the slow build process, and use its interactive
> >> console to run only specific tests. The nice advantage is that SBT can
> keep
> >> the Scala compiler loaded and JITed across builds, making it faster to
> >> iterate. To use it, you can do the following:
> >>
> >> - Start the SBT interactive console with sbt/sbt
> >> - Build your assembly by running the "assembly" target in the assembly
> >> project: assembly/assembly
> >> - Run all the tests in one module: core/test
> >> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
> (this
> >> also supports tab completion)
> >>
> >> Running all the tests does take a while, and I usually just rely on
> >> Jenkins for that once I've run the tests for the things I believed my
> patch
> >> could break. But this is because some of them are integration tests
> (e.g.
> >> DistributedSuite, which creates multi-process mini-clusters). Many of
> the
> >> individual suites run fast without requiring this, however, so you can
> pick
> >> the ones you want. Perhaps we should find a way to tag them so people
> can
> >> do a "quick-test" that skips the integration ones.
> >>
> >> The assembly builds are annoying but they only take about a minute for
> me
> >> on a MacBook Pro with SBT warmed up. The assembly is actually only
> required
> >> for some of the "integration" tests (which launch new processes), but
> I'd
> >> recommend doing it all the time anyway since it would be very confusing
> to
> >> run those with an old assembly. The Scala compiler crash issue can also
> be
> >> a problem, but I don't see it very often with SBT. If it happens, I exit
> >> SBT and do sbt clean.
> >>
> >> Anyway, this is useful feedback and I think we should try to improve
> some
> >> of these suites, but hopefully you can also try the faster SBT process.
> At
> >> the end of the day, if we want integration tests, the whole test process
> >> will take an hour, but most of the developers I know leave that to
> Jenkins
> >> and only run individual tests locally before submitting a patch.
> >>
> >> Matei
> >>
> >>
> >> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
> >> ryan.blake.williams@gmail.com> wrote:
> >> >
> >> > In the course of trying to make contributions to Spark, I have had a
> lot
> >> of
> >> > trouble running Spark's tests successfully. The main pain points I've
> >> > experienced are:
> >> >
> >> >    1) frequent, spurious test failures
> >> >    2) high latency of running tests
> >> >    3) difficulty running specific tests in an iterative fashion
> >> >
> >> > Here is an example series of failures that I encountered this weekend
> >> > (along with footnote links to the console output from each and
> >> > approximately how long each took):
> >> >
> >> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
> >> > before.
> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
> >> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
> BroadcastSuite
> >> > passed, but scala compiler crashed on the "catalyst" project.
> >> > - `mvn clean`: some attempts to run earlier commands (that previously
> >> > didn't crash the compiler) all result in the same compiler crash.
> >> Previous
> >> > discussion on this list implies this can only be solved by a `mvn
> clean`
> >> > [4].
> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
> >> > BroadcastSuite can't run because assembly is not built.
> >> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
> >> > version mismatches and python 2.6. The machine this ran on has python
> >> 2.7,
> >> > so I don't know what that's about.
> >> > - `./dev/run-tests` again [7]: "too many open files" errors in several
> >> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently
> this is
> >> > not enough, but only some of the time? I increased it to 8192 and
> tried
> >> > again.
> >> > - `./dev/run-tests` again [8]: same pyspark errors as before. This
> seems
> >> to
> >> > be the issue from SPARK-3867 [9], which was supposedly fixed on
> October
> >> 14;
> >> > not sure how I'm seeing it now. In any case, switched to Python 2.6
> and
> >> > installed unittest2, and python/run-tests seems to be unblocked.
> >> > - `./dev/run-tests` again [10]: finally passes!
> >> >
> >> > This was on a spark checkout at ceb6281 (ToT Friday), with a few
> trivial
> >> > changes added on (that I wanted to test before sending out a PR), on a
> >> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
> >> >
> >> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
> >> commands
> >> > from the same repo state:
> >> >
> >> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
> >> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've
> seen
> >> > this one before on this machine and am guessing it actually occurs
> every
> >> > time.
> >> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one
> more
> >> > time from ceb6281, and saw the same failure.
> >> >
> >> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
> >> narrow
> >> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my
> >> mac,
> >> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
> >> used),
> >> > and it passed [16], so the failure seems specific to my linux
> >> machine/arch.
> >> >
> >> > At this point I believe that my changes don't break any tests (the
> >> > YarnClusterSuite failure on my linux presumably not being... "real"),
> >> and I
> >> > am ready to send out a PR. Whew!
> >> >
> >> > However, reflecting on the 5 or 6 distinct failure-modes represented
> >> above:
> >> >
> >> > - One of them (too many files open), is something I can (and did,
> >> > hopefully) fix once and for all. It cost me an ~hour this time
> >> (approximate
> >> > time of running ./dev/run-tests) and a few hours other times when I
> >> didn't
> >> > fully understand/fix it. It doesn't happen deterministically (why?),
> but
> >> > does happen somewhat frequently to people, having been discussed on
> the
> >> > user list multiple times [17] and on SO [18]. Maybe some note in the
> >> > documentation advising people to check their ulimit makes sense?
> >> > - One of them (unittest2 must be installed for python 2.6) was
> supposedly
> >> > fixed upstream of the commits I tested here; I don't know why I'm
> still
> >> > running into it. This cost me a few hours of running `./dev/run-tests`
> >> > multiple times to see if it was transient, plus some time researching
> and
> >> > working around it.
> >> > - The original BroadcastSuite failure cost me a few hours and went
> away
> >> > before I'd even run `mvn clean`.
> >> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a few
> >> > hours of running `./dev/run-tests` in different ways before deciding
> >> that,
> >> > as usual, there was no way around it and that I'd need to run `mvn
> clean`
> >> > and start running tests from scratch.
> >> > - The YarnClusterSuite failures on my linux box have cost me hours of
> >> > trying to figure out whether they're my fault. I've seen them many
> times
> >> > over the past weeks/months, plus or minus other failures that have
> come
> >> and
> >> > gone, and was especially befuddled by them when I was seeing a
> disjoint
> >> set
> >> > of reproducible failures on my mac [19] (the triaging of which
> involved
> >> > dozens of runs of `./dev/run-tests`).
> >> >
> >> > While I'm interested in digging into each of these issues, I also
> want to
> >> > discuss the frequency with which I've run into issues like these.
> This is
> >> > unfortunately not the first time in recent months that I've spent days
> >> > playing spurious-test-failure whack-a-mole with a 60-90min
> dev/run-tests
> >> > iteration time, which is no fun! So I am wondering/thinking:
> >> >
> >> > - Do other people experience this level of flakiness from spark tests?
> >> > - Do other people bother running dev/run-tests locally, or just let
> >> Jenkins
> >> > do it during the CR process?
> >> > - Needing to run a full assembly post-clean just to continue running
> one
> >> > specific test case feels especially wasteful, and the failure output
> when
> >> > naively attempting to run a specific test without having built an
> >> assembly
> >> > jar is not always clear about what the issue is or how to fix it; even
> >> the
> >> > fact that certain tests require "building the world" is not something
> I
> >> > would have expected, and has cost me hours of confusion.
> >> >    - Should a person running spark tests assume that they must build
> an
> >> > assembly JAR before running anything?
> >> >    - Are there some proper "unit" tests that are actually
> self-contained
> >> /
> >> > able to be run without building an assembly jar?
> >> >    - Can we better document/demarcate which tests have which
> >> dependencies?
> >> >    - Is there something finer-grained than building an assembly JAR
> that
> >> > is sufficient in some cases?
> >> >        - If so, can we document that?
> >> >        - If not, can we move to a world of finer-grained dependencies
> for
> >> > some of these?
> >> > - Leaving all of these spurious failures aside, the process of
> assembling
> >> > and testing a new JAR is not a quick one (40 and 60 mins for me
> >> typically,
> >> > respectively). I would guess that there are dozens (hundreds?) of
> people
> >> > who build a Spark assembly from various ToTs on any given day, and who
> >> all
> >> > wait on the exact same compilation / assembly steps to occur.
> Expanding
> >> on
> >> > the recent work to publish nightly snapshots [20], can we do a better
> job
> >> > caching/sharing compilation artifacts at a more granular level
> (pre-built
> >> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA?
> more
> >> > granular maven modules, plus the previous two?), or otherwise save
> some
> >> of
> >> > the considerable amount of redundant compilation work that I had to do
> >> over
> >> > the course of my odyssey this weekend?
> >> >
> >> > Ramping up on most projects involves some amount of supplementing the
> >> > documentation with trial and error to figure out what to run, which
> >> > "errors" are real errors and which can be ignored, etc., but
> navigating
> >> > that minefield on Spark has proved especially challenging and
> >> > time-consuming for me. Some of that comes directly from scala's
> >> relatively
> >> > slow compilation times and immature build-tooling ecosystem, but that
> is
> >> > the world we live in and it would be nice if Spark took the
> alleviation
> >> of
> >> > the resulting pain more seriously, as one of the more interesting and
> >> > well-known large scala projects around right now. The official
> >> > documentation around how to build different subsets of the codebase is
> >> > somewhat sparse [21], and there have been many mixed [22] accounts
> [23]
> >> on
> >> > this mailing list about preferred ways to build on mvn vs. sbt (none
> of
> >> > which has made it into official documentation, as far as I've seen).
> >> > Expecting new contributors to piece together all of this received
> >> > folk-wisdom about how to build/test in a sane way by trawling mailing
> >> list
> >> > archives seems suboptimal.
> >> >
> >> > Thanks for reading, looking forward to hearing your ideas!
> >> >
> >> > -Ryan
> >> >
> >> > P.S. Is "best practice" for emailing this list to not incorporate any
> >> HTML
> >> > in the body? It seems like all of the archives I've seen strip it out,
> >> but
> >> > other people have used it and gmail displays it.
> >> >
> >> >
> >> > [1]
> >> > https://gist.githubusercontent.com/ryan-
> williams/8a162367c4dc157d2479/
> >> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
> >> > (57 mins)
> >> > [2]
> >> > https://gist.githubusercontent.com/ryan-
> williams/8a162367c4dc157d2479/
> >> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
> >> > (6 mins)
> >> > [3]
> >> > https://gist.githubusercontent.com/ryan-
> williams/8a162367c4dc157d2479/
> >> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
> >> 20pass%20test,%20fail%20subsequent%20compile
> >> > (4 mins)
> >> > [4]
> >> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
> >> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
> >> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
> >> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
> >> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
> >> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
> >> > [5]
> >> > https://gist.githubusercontent.com/ryan-
> williams/8a162367c4dc157d2479/
> >> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
> >> 20clean,%20need%20dependencies%20built
> >> > [6]
> >> > https://gist.githubusercontent.com/ryan-
> williams/8a162367c4dc157d2479/
> >> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
> >> 20post%20clean
> >> > (50 mins)
> >> > [7]
> >> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
> >> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
> >> > (1hr)
> >> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
> >> > [9] https://issues.apache.org/jira/browse/SPARK-3867
> >> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
> >> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
> >> > [12]
> >> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
> >> file-gistfile1-txt-L853
> >> > (~90 mins)
> >> > [13]
> >> > https://gist.github.com/ryan-williams/718f6324af358819b496#
> >> file-gistfile1-txt-L852
> >> > (91 mins)
> >> > [14]
> >> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
> >> file-gistfile1-txt-L854
> >> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
> >> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
> >> > [17]
> >> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
> >> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
> >> > [18]
> >> > http://stackoverflow.com/questions/25707629/why-does-
> >> spark-job-fail-with-too-many-open-files
> >> > [19] https://issues.apache.org/jira/browse/SPARK-4002
> >> > [20] https://issues.apache.org/jira/browse/SPARK-4542
> >> > [21]
> >> > https://spark.apache.org/docs/latest/building-with-maven.
> >> html#spark-tests-in-maven
> >> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
> >> > [23]
> >> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
> >> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Spurious test failures, testing best practices

Posted by Patrick Wendell <pw...@gmail.com>.

Hi Ilya - you can just submit a pull request and the way we test them
is to run it through jenkins. You don't need to do anything special.

On Sun, Nov 30, 2014 at 8:57 PM, Ganelin, Ilya
<Il...@capitalone.com> wrote:
> Hi, Patrick - with regards to testing on Jenkins, is the process for this
> to submit a pull request for the branch or is there another interface we
> can use to submit a build to Jenkins for testing?
>
> On 11/30/14, 6:49 PM, "Patrick Wendell" <pw...@gmail.com> wrote:
>
>>Hey Ryan,
>>
>>A few more things here. You should feel free to send patches to
>>Jenkins to test them, since this is the reference environment in which
>>we regularly run tests. This is the normal workflow for most
>>developers and we spend a lot of effort provisioning/maintaining a
>>very large jenkins cluster to allow developers access this resource. A
>>common development approach is to locally run tests that you've added
>>in a patch, then send it to jenkins for the full run, and then try to
>>debug locally if you see specific unanticipated test failures.
>>
>>One challenge we have is that given the proliferation of OS versions,
>>Java versions, Python versions, ulimits, etc. there is a combinatorial
>>number of environments in which tests could be run. It is very hard in
>>some cases to figure out post-hoc why a given test is not working in a
>>specific environment. I think a good solution here would be to use a
>>standardized docker container for running Spark tests and asking folks
>>to use that locally if they are trying to run all of the hundreds of
>>Spark tests.
>>
>>Another solution would be to mock out every system interaction in
>>Spark's tests including e.g. filesystem interactions to try and reduce
>>variance across environments. However, that seems difficult.
>>
>>As the number of developers of Spark increases, it's definitely a good
>>idea for us to invest in developer infrastructure including things
>>like snapshot releases, better documentation, etc. Thanks for bringing
>>this up as a pain point.
>>
>>- Patrick
>>
>>
>>On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
>><ry...@gmail.com> wrote:
>>> thanks for the info, Matei and Brennon. I will try to switch my
>>>workflow to
>>> using sbt. Other potential action items:
>>>
>>> - currently the docs only contain information about building with maven,
>>> and even then don't cover many important cases, as I described in my
>>> previous email. If SBT is as much better as you've described then that
>>> should be made much more obvious. Wasn't it the case recently that there
>>> was only a page about building with SBT, and not one about building with
>>> maven? Clearer messaging around this needs to exist in the
>>>documentation,
>>> not just on the mailing list, imho.
>>>
>>> - +1 to better distinguishing between unit and integration tests, having
>>> separate scripts for each, improving documentation around common
>>>workflows,
>>> expectations of brittleness with each kind of test, advisability of just
>>> relying on Jenkins for certain kinds of tests to not waste too much
>>>time,
>>> etc. Things like the compiler crash should be discussed in the
>>> documentation, not just in the mailing list archives, if new
>>>contributors
>>> are likely to run into them through no fault of their own.
>>>
>>> - What is the algorithm you use to decide what tests you might have
>>>broken?
>>> Can we codify it in some scripts that other people can use?
>>>
>>>
>>>
>>> On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> As a tip (and maybe this isn't documented well), I normally use SBT for
>>>> development to avoid the slow build process, and use its interactive
>>>> console to run only specific tests. The nice advantage is that SBT can
>>>>keep
>>>> the Scala compiler loaded and JITed across builds, making it faster to
>>>> iterate. To use it, you can do the following:
>>>>
>>>> - Start the SBT interactive console with sbt/sbt
>>>> - Build your assembly by running the "assembly" target in the assembly
>>>> project: assembly/assembly
>>>> - Run all the tests in one module: core/test
>>>> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>>>>(this
>>>> also supports tab completion)
>>>>
>>>> Running all the tests does take a while, and I usually just rely on
>>>> Jenkins for that once I've run the tests for the things I believed my
>>>>patch
>>>> could break. But this is because some of them are integration tests
>>>>(e.g.
>>>> DistributedSuite, which creates multi-process mini-clusters). Many of
>>>>the
>>>> individual suites run fast without requiring this, however, so you can
>>>>pick
>>>> the ones you want. Perhaps we should find a way to tag them so people
>>>>can
>>>> do a "quick-test" that skips the integration ones.
>>>>
>>>> The assembly builds are annoying but they only take about a minute for
>>>>me
>>>> on a MacBook Pro with SBT warmed up. The assembly is actually only
>>>>required
>>>> for some of the "integration" tests (which launch new processes), but
>>>>I'd
>>>> recommend doing it all the time anyway since it would be very
>>>>confusing to
>>>> run those with an old assembly. The Scala compiler crash issue can
>>>>also be
>>>> a problem, but I don't see it very often with SBT. If it happens, I
>>>>exit
>>>> SBT and do sbt clean.
>>>>
>>>> Anyway, this is useful feedback and I think we should try to improve
>>>>some
>>>> of these suites, but hopefully you can also try the faster SBT
>>>>process. At
>>>> the end of the day, if we want integration tests, the whole test
>>>>process
>>>> will take an hour, but most of the developers I know leave that to
>>>>Jenkins
>>>> and only run individual tests locally before submitting a patch.
>>>>
>>>> Matei
>>>>
>>>>
>>>> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
>>>> ryan.blake.williams@gmail.com> wrote:
>>>> >
>>>> > In the course of trying to make contributions to Spark, I have had a
>>>>lot
>>>> of
>>>> > trouble running Spark's tests successfully. The main pain points I've
>>>> > experienced are:
>>>> >
>>>> >    1) frequent, spurious test failures
>>>> >    2) high latency of running tests
>>>> >    3) difficulty running specific tests in an iterative fashion
>>>> >
>>>> > Here is an example series of failures that I encountered this weekend
>>>> > (along with footnote links to the console output from each and
>>>> > approximately how long each took):
>>>> >
>>>> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
>>>> > before.
>>>> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>>>> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
>>>>BroadcastSuite
>>>> > passed, but scala compiler crashed on the "catalyst" project.
>>>> > - `mvn clean`: some attempts to run earlier commands (that previously
>>>> > didn't crash the compiler) all result in the same compiler crash.
>>>> Previous
>>>> > discussion on this list implies this can only be solved by a `mvn
>>>>clean`
>>>> > [4].
>>>> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
>>>> > BroadcastSuite can't run because assembly is not built.
>>>> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages
>>>>about
>>>> > version mismatches and python 2.6. The machine this ran on has python
>>>> 2.7,
>>>> > so I don't know what that's about.
>>>> > - `./dev/run-tests` again [7]: "too many open files" errors in
>>>>several
>>>> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently
>>>>this is
>>>> > not enough, but only some of the time? I increased it to 8192 and
>>>>tried
>>>> > again.
>>>> > - `./dev/run-tests` again [8]: same pyspark errors as before. This
>>>>seems
>>>> to
>>>> > be the issue from SPARK-3867 [9], which was supposedly fixed on
>>>>October
>>>> 14;
>>>> > not sure how I'm seeing it now. In any case, switched to Python 2.6
>>>>and
>>>> > installed unittest2, and python/run-tests seems to be unblocked.
>>>> > - `./dev/run-tests` again [10]: finally passes!
>>>> >
>>>> > This was on a spark checkout at ceb6281 (ToT Friday), with a few
>>>>trivial
>>>> > changes added on (that I wanted to test before sending out a PR), on
>>>>a
>>>> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>>>> >
>>>> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>>>> commands
>>>> > from the same repo state:
>>>> >
>>>> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
>>>> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've
>>>>seen
>>>> > this one before on this machine and am guessing it actually occurs
>>>>every
>>>> > time.
>>>> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one
>>>>more
>>>> > time from ceb6281, and saw the same failure.
>>>> >
>>>> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>>>> narrow
>>>> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on
>>>>my
>>>> mac,
>>>> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>>>> used),
>>>> > and it passed [16], so the failure seems specific to my linux
>>>> machine/arch.
>>>> >
>>>> > At this point I believe that my changes don't break any tests (the
>>>> > YarnClusterSuite failure on my linux presumably not being... "real"),
>>>> and I
>>>> > am ready to send out a PR. Whew!
>>>> >
>>>> > However, reflecting on the 5 or 6 distinct failure-modes represented
>>>> above:
>>>> >
>>>> > - One of them (too many files open), is something I can (and did,
>>>> > hopefully) fix once and for all. It cost me an ~hour this time
>>>> (approximate
>>>> > time of running ./dev/run-tests) and a few hours other times when I
>>>> didn't
>>>> > fully understand/fix it. It doesn't happen deterministically (why?),
>>>>but
>>>> > does happen somewhat frequently to people, having been discussed on
>>>>the
>>>> > user list multiple times [17] and on SO [18]. Maybe some note in the
>>>> > documentation advising people to check their ulimit makes sense?
>>>> > - One of them (unittest2 must be installed for python 2.6) was
>>>>supposedly
>>>> > fixed upstream of the commits I tested here; I don't know why I'm
>>>>still
>>>> > running into it. This cost me a few hours of running
>>>>`./dev/run-tests`
>>>> > multiple times to see if it was transient, plus some time
>>>>researching and
>>>> > working around it.
>>>> > - The original BroadcastSuite failure cost me a few hours and went
>>>>away
>>>> > before I'd even run `mvn clean`.
>>>> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a
>>>>few
>>>> > hours of running `./dev/run-tests` in different ways before deciding
>>>> that,
>>>> > as usual, there was no way around it and that I'd need to run `mvn
>>>>clean`
>>>> > and start running tests from scratch.
>>>> > - The YarnClusterSuite failures on my linux box have cost me hours of
>>>> > trying to figure out whether they're my fault. I've seen them many
>>>>times
>>>> > over the past weeks/months, plus or minus other failures that have
>>>>come
>>>> and
>>>> > gone, and was especially befuddled by them when I was seeing a
>>>>disjoint
>>>> set
>>>> > of reproducible failures on my mac [19] (the triaging of which
>>>>involved
>>>> > dozens of runs of `./dev/run-tests`).
>>>> >
>>>> > While I'm interested in digging into each of these issues, I also
>>>>want to
>>>> > discuss the frequency with which I've run into issues like these.
>>>>This is
>>>> > unfortunately not the first time in recent months that I've spent
>>>>days
>>>> > playing spurious-test-failure whack-a-mole with a 60-90min
>>>>dev/run-tests
>>>> > iteration time, which is no fun! So I am wondering/thinking:
>>>> >
>>>> > - Do other people experience this level of flakiness from spark
>>>>tests?
>>>> > - Do other people bother running dev/run-tests locally, or just let
>>>> Jenkins
>>>> > do it during the CR process?
>>>> > - Needing to run a full assembly post-clean just to continue running
>>>>one
>>>> > specific test case feels especially wasteful, and the failure output
>>>>when
>>>> > naively attempting to run a specific test without having built an
>>>> assembly
>>>> > jar is not always clear about what the issue is or how to fix it;
>>>>even
>>>> the
>>>> > fact that certain tests require "building the world" is not
>>>>something I
>>>> > would have expected, and has cost me hours of confusion.
>>>> >    - Should a person running spark tests assume that they must build
>>>>an
>>>> > assembly JAR before running anything?
>>>> >    - Are there some proper "unit" tests that are actually
>>>>self-contained
>>>> /
>>>> > able to be run without building an assembly jar?
>>>> >    - Can we better document/demarcate which tests have which
>>>> dependencies?
>>>> >    - Is there something finer-grained than building an assembly JAR
>>>>that
>>>> > is sufficient in some cases?
>>>> >        - If so, can we document that?
>>>> >        - If not, can we move to a world of finer-grained
>>>>dependencies for
>>>> > some of these?
>>>> > - Leaving all of these spurious failures aside, the process of
>>>>assembling
>>>> > and testing a new JAR is not a quick one (40 and 60 mins for me
>>>> typically,
>>>> > respectively). I would guess that there are dozens (hundreds?) of
>>>>people
>>>> > who build a Spark assembly from various ToTs on any given day, and
>>>>who
>>>> all
>>>> > wait on the exact same compilation / assembly steps to occur.
>>>>Expanding
>>>> on
>>>> > the recent work to publish nightly snapshots [20], can we do a
>>>>better job
>>>> > caching/sharing compilation artifacts at a more granular level
>>>>(pre-built
>>>> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA?
>>>>more
>>>> > granular maven modules, plus the previous two?), or otherwise save
>>>>some
>>>> of
>>>> > the considerable amount of redundant compilation work that I had to
>>>>do
>>>> over
>>>> > the course of my odyssey this weekend?
>>>> >
>>>> > Ramping up on most projects involves some amount of supplementing the
>>>> > documentation with trial and error to figure out what to run, which
>>>> > "errors" are real errors and which can be ignored, etc., but
>>>>navigating
>>>> > that minefield on Spark has proved especially challenging and
>>>> > time-consuming for me. Some of that comes directly from scala's
>>>> relatively
>>>> > slow compilation times and immature build-tooling ecosystem, but
>>>>that is
>>>> > the world we live in and it would be nice if Spark took the
>>>>alleviation
>>>> of
>>>> > the resulting pain more seriously, as one of the more interesting and
>>>> > well-known large scala projects around right now. The official
>>>> > documentation around how to build different subsets of the codebase
>>>>is
>>>> > somewhat sparse [21], and there have been many mixed [22] accounts
>>>>[23]
>>>> on
>>>> > this mailing list about preferred ways to build on mvn vs. sbt (none
>>>>of
>>>> > which has made it into official documentation, as far as I've seen).
>>>> > Expecting new contributors to piece together all of this received
>>>> > folk-wisdom about how to build/test in a sane way by trawling mailing
>>>> list
>>>> > archives seems suboptimal.
>>>> >
>>>> > Thanks for reading, looking forward to hearing your ideas!
>>>> >
>>>> > -Ryan
>>>> >
>>>> > P.S. Is "best practice" for emailing this list to not incorporate any
>>>> HTML
>>>> > in the body? It seems like all of the archives I've seen strip it
>>>>out,
>>>> but
>>>> > other people have used it and gmail displays it.
>>>> >
>>>> >
>>>> > [1]
>>>> >
>>>>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>>>> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>>>> > (57 mins)
>>>> > [2]
>>>> >
>>>>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>>>> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>>>> > (6 mins)
>>>> > [3]
>>>> >
>>>>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>>>> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
>>>> 20pass%20test,%20fail%20subsequent%20compile
>>>> > (4 mins)
>>>> > [4]
>>>> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
>>>> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
>>>> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
>>>> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
>>>> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
>>>> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>>>> > [5]
>>>> >
>>>>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>>>> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
>>>> 20clean,%20need%20dependencies%20built
>>>> > [6]
>>>> >
>>>>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>>>> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
>>>> 20post%20clean
>>>> > (50 mins)
>>>> > [7]
>>>> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
>>>> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
>>>> > (1hr)
>>>> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
>>>> > [9] https://issues.apache.org/jira/browse/SPARK-3867
>>>> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>>>> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>>>> > [12]
>>>> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
>>>> file-gistfile1-txt-L853
>>>> > (~90 mins)
>>>> > [13]
>>>> > https://gist.github.com/ryan-williams/718f6324af358819b496#
>>>> file-gistfile1-txt-L852
>>>> > (91 mins)
>>>> > [14]
>>>> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
>>>> file-gistfile1-txt-L854
>>>> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>>>> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>>>> > [17]
>>>> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
>>>> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
>>>> > [18]
>>>> > http://stackoverflow.com/questions/25707629/why-does-
>>>> spark-job-fail-with-too-many-open-files
>>>> > [19] https://issues.apache.org/jira/browse/SPARK-4002
>>>> > [20] https://issues.apache.org/jira/browse/SPARK-4542
>>>> > [21]
>>>> > https://spark.apache.org/docs/latest/building-with-maven.
>>>> html#spark-tests-in-maven
>>>> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
>>>> > [23]
>>>> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
>>>> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E
>>>>
>>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>For additional commands, e-mail: dev-help@spark.apache.org
>>
>
> ________________________________________________________
>
> The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed.  If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spurious test failures, testing best practices

Posted by "Ganelin, Ilya" <Il...@capitalone.com>.

Hi, Patrick - with regards to testing on Jenkins, is the process for this
to submit a pull request for the branch or is there another interface we
can use to submit a build to Jenkins for testing?

On 11/30/14, 6:49 PM, "Patrick Wendell" <pw...@gmail.com> wrote:

>Hey Ryan,
>
>A few more things here. You should feel free to send patches to
>Jenkins to test them, since this is the reference environment in which
>we regularly run tests. This is the normal workflow for most
>developers and we spend a lot of effort provisioning/maintaining a
>very large jenkins cluster to allow developers access this resource. A
>common development approach is to locally run tests that you've added
>in a patch, then send it to jenkins for the full run, and then try to
>debug locally if you see specific unanticipated test failures.
>
>One challenge we have is that given the proliferation of OS versions,
>Java versions, Python versions, ulimits, etc. there is a combinatorial
>number of environments in which tests could be run. It is very hard in
>some cases to figure out post-hoc why a given test is not working in a
>specific environment. I think a good solution here would be to use a
>standardized docker container for running Spark tests and asking folks
>to use that locally if they are trying to run all of the hundreds of
>Spark tests.
>
>Another solution would be to mock out every system interaction in
>Spark's tests including e.g. filesystem interactions to try and reduce
>variance across environments. However, that seems difficult.
>
>As the number of developers of Spark increases, it's definitely a good
>idea for us to invest in developer infrastructure including things
>like snapshot releases, better documentation, etc. Thanks for bringing
>this up as a pain point.
>
>- Patrick
>
>
>On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
><ry...@gmail.com> wrote:
>> thanks for the info, Matei and Brennon. I will try to switch my
>>workflow to
>> using sbt. Other potential action items:
>>
>> - currently the docs only contain information about building with maven,
>> and even then don't cover many important cases, as I described in my
>> previous email. If SBT is as much better as you've described then that
>> should be made much more obvious. Wasn't it the case recently that there
>> was only a page about building with SBT, and not one about building with
>> maven? Clearer messaging around this needs to exist in the
>>documentation,
>> not just on the mailing list, imho.
>>
>> - +1 to better distinguishing between unit and integration tests, having
>> separate scripts for each, improving documentation around common
>>workflows,
>> expectations of brittleness with each kind of test, advisability of just
>> relying on Jenkins for certain kinds of tests to not waste too much
>>time,
>> etc. Things like the compiler crash should be discussed in the
>> documentation, not just in the mailing list archives, if new
>>contributors
>> are likely to run into them through no fault of their own.
>>
>> - What is the algorithm you use to decide what tests you might have
>>broken?
>> Can we codify it in some scripts that other people can use?
>>
>>
>>
>> On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> As a tip (and maybe this isn't documented well), I normally use SBT for
>>> development to avoid the slow build process, and use its interactive
>>> console to run only specific tests. The nice advantage is that SBT can
>>>keep
>>> the Scala compiler loaded and JITed across builds, making it faster to
>>> iterate. To use it, you can do the following:
>>>
>>> - Start the SBT interactive console with sbt/sbt
>>> - Build your assembly by running the "assembly" target in the assembly
>>> project: assembly/assembly
>>> - Run all the tests in one module: core/test
>>> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>>>(this
>>> also supports tab completion)
>>>
>>> Running all the tests does take a while, and I usually just rely on
>>> Jenkins for that once I've run the tests for the things I believed my
>>>patch
>>> could break. But this is because some of them are integration tests
>>>(e.g.
>>> DistributedSuite, which creates multi-process mini-clusters). Many of
>>>the
>>> individual suites run fast without requiring this, however, so you can
>>>pick
>>> the ones you want. Perhaps we should find a way to tag them so people
>>>can
>>> do a "quick-test" that skips the integration ones.
>>>
>>> The assembly builds are annoying but they only take about a minute for
>>>me
>>> on a MacBook Pro with SBT warmed up. The assembly is actually only
>>>required
>>> for some of the "integration" tests (which launch new processes), but
>>>I'd
>>> recommend doing it all the time anyway since it would be very
>>>confusing to
>>> run those with an old assembly. The Scala compiler crash issue can
>>>also be
>>> a problem, but I don't see it very often with SBT. If it happens, I
>>>exit
>>> SBT and do sbt clean.
>>>
>>> Anyway, this is useful feedback and I think we should try to improve
>>>some
>>> of these suites, but hopefully you can also try the faster SBT
>>>process. At
>>> the end of the day, if we want integration tests, the whole test
>>>process
>>> will take an hour, but most of the developers I know leave that to
>>>Jenkins
>>> and only run individual tests locally before submitting a patch.
>>>
>>> Matei
>>>
>>>
>>> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
>>> ryan.blake.williams@gmail.com> wrote:
>>> >
>>> > In the course of trying to make contributions to Spark, I have had a
>>>lot
>>> of
>>> > trouble running Spark's tests successfully. The main pain points I've
>>> > experienced are:
>>> >
>>> >    1) frequent, spurious test failures
>>> >    2) high latency of running tests
>>> >    3) difficulty running specific tests in an iterative fashion
>>> >
>>> > Here is an example series of failures that I encountered this weekend
>>> > (along with footnote links to the console output from each and
>>> > approximately how long each took):
>>> >
>>> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
>>> > before.
>>> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>>> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
>>>BroadcastSuite
>>> > passed, but scala compiler crashed on the "catalyst" project.
>>> > - `mvn clean`: some attempts to run earlier commands (that previously
>>> > didn't crash the compiler) all result in the same compiler crash.
>>> Previous
>>> > discussion on this list implies this can only be solved by a `mvn
>>>clean`
>>> > [4].
>>> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
>>> > BroadcastSuite can't run because assembly is not built.
>>> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages
>>>about
>>> > version mismatches and python 2.6. The machine this ran on has python
>>> 2.7,
>>> > so I don't know what that's about.
>>> > - `./dev/run-tests` again [7]: "too many open files" errors in
>>>several
>>> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently
>>>this is
>>> > not enough, but only some of the time? I increased it to 8192 and
>>>tried
>>> > again.
>>> > - `./dev/run-tests` again [8]: same pyspark errors as before. This
>>>seems
>>> to
>>> > be the issue from SPARK-3867 [9], which was supposedly fixed on
>>>October
>>> 14;
>>> > not sure how I'm seeing it now. In any case, switched to Python 2.6
>>>and
>>> > installed unittest2, and python/run-tests seems to be unblocked.
>>> > - `./dev/run-tests` again [10]: finally passes!
>>> >
>>> > This was on a spark checkout at ceb6281 (ToT Friday), with a few
>>>trivial
>>> > changes added on (that I wanted to test before sending out a PR), on
>>>a
>>> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>>> >
>>> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>>> commands
>>> > from the same repo state:
>>> >
>>> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
>>> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've
>>>seen
>>> > this one before on this machine and am guessing it actually occurs
>>>every
>>> > time.
>>> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one
>>>more
>>> > time from ceb6281, and saw the same failure.
>>> >
>>> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>>> narrow
>>> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on
>>>my
>>> mac,
>>> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>>> used),
>>> > and it passed [16], so the failure seems specific to my linux
>>> machine/arch.
>>> >
>>> > At this point I believe that my changes don't break any tests (the
>>> > YarnClusterSuite failure on my linux presumably not being... "real"),
>>> and I
>>> > am ready to send out a PR. Whew!
>>> >
>>> > However, reflecting on the 5 or 6 distinct failure-modes represented
>>> above:
>>> >
>>> > - One of them (too many files open), is something I can (and did,
>>> > hopefully) fix once and for all. It cost me an ~hour this time
>>> (approximate
>>> > time of running ./dev/run-tests) and a few hours other times when I
>>> didn't
>>> > fully understand/fix it. It doesn't happen deterministically (why?),
>>>but
>>> > does happen somewhat frequently to people, having been discussed on
>>>the
>>> > user list multiple times [17] and on SO [18]. Maybe some note in the
>>> > documentation advising people to check their ulimit makes sense?
>>> > - One of them (unittest2 must be installed for python 2.6) was
>>>supposedly
>>> > fixed upstream of the commits I tested here; I don't know why I'm
>>>still
>>> > running into it. This cost me a few hours of running
>>>`./dev/run-tests`
>>> > multiple times to see if it was transient, plus some time
>>>researching and
>>> > working around it.
>>> > - The original BroadcastSuite failure cost me a few hours and went
>>>away
>>> > before I'd even run `mvn clean`.
>>> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a
>>>few
>>> > hours of running `./dev/run-tests` in different ways before deciding
>>> that,
>>> > as usual, there was no way around it and that I'd need to run `mvn
>>>clean`
>>> > and start running tests from scratch.
>>> > - The YarnClusterSuite failures on my linux box have cost me hours of
>>> > trying to figure out whether they're my fault. I've seen them many
>>>times
>>> > over the past weeks/months, plus or minus other failures that have
>>>come
>>> and
>>> > gone, and was especially befuddled by them when I was seeing a
>>>disjoint
>>> set
>>> > of reproducible failures on my mac [19] (the triaging of which
>>>involved
>>> > dozens of runs of `./dev/run-tests`).
>>> >
>>> > While I'm interested in digging into each of these issues, I also
>>>want to
>>> > discuss the frequency with which I've run into issues like these.
>>>This is
>>> > unfortunately not the first time in recent months that I've spent
>>>days
>>> > playing spurious-test-failure whack-a-mole with a 60-90min
>>>dev/run-tests
>>> > iteration time, which is no fun! So I am wondering/thinking:
>>> >
>>> > - Do other people experience this level of flakiness from spark
>>>tests?
>>> > - Do other people bother running dev/run-tests locally, or just let
>>> Jenkins
>>> > do it during the CR process?
>>> > - Needing to run a full assembly post-clean just to continue running
>>>one
>>> > specific test case feels especially wasteful, and the failure output
>>>when
>>> > naively attempting to run a specific test without having built an
>>> assembly
>>> > jar is not always clear about what the issue is or how to fix it;
>>>even
>>> the
>>> > fact that certain tests require "building the world" is not
>>>something I
>>> > would have expected, and has cost me hours of confusion.
>>> >    - Should a person running spark tests assume that they must build
>>>an
>>> > assembly JAR before running anything?
>>> >    - Are there some proper "unit" tests that are actually
>>>self-contained
>>> /
>>> > able to be run without building an assembly jar?
>>> >    - Can we better document/demarcate which tests have which
>>> dependencies?
>>> >    - Is there something finer-grained than building an assembly JAR
>>>that
>>> > is sufficient in some cases?
>>> >        - If so, can we document that?
>>> >        - If not, can we move to a world of finer-grained
>>>dependencies for
>>> > some of these?
>>> > - Leaving all of these spurious failures aside, the process of
>>>assembling
>>> > and testing a new JAR is not a quick one (40 and 60 mins for me
>>> typically,
>>> > respectively). I would guess that there are dozens (hundreds?) of
>>>people
>>> > who build a Spark assembly from various ToTs on any given day, and
>>>who
>>> all
>>> > wait on the exact same compilation / assembly steps to occur.
>>>Expanding
>>> on
>>> > the recent work to publish nightly snapshots [20], can we do a
>>>better job
>>> > caching/sharing compilation artifacts at a more granular level
>>>(pre-built
>>> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA?
>>>more
>>> > granular maven modules, plus the previous two?), or otherwise save
>>>some
>>> of
>>> > the considerable amount of redundant compilation work that I had to
>>>do
>>> over
>>> > the course of my odyssey this weekend?
>>> >
>>> > Ramping up on most projects involves some amount of supplementing the
>>> > documentation with trial and error to figure out what to run, which
>>> > "errors" are real errors and which can be ignored, etc., but
>>>navigating
>>> > that minefield on Spark has proved especially challenging and
>>> > time-consuming for me. Some of that comes directly from scala's
>>> relatively
>>> > slow compilation times and immature build-tooling ecosystem, but
>>>that is
>>> > the world we live in and it would be nice if Spark took the
>>>alleviation
>>> of
>>> > the resulting pain more seriously, as one of the more interesting and
>>> > well-known large scala projects around right now. The official
>>> > documentation around how to build different subsets of the codebase
>>>is
>>> > somewhat sparse [21], and there have been many mixed [22] accounts
>>>[23]
>>> on
>>> > this mailing list about preferred ways to build on mvn vs. sbt (none
>>>of
>>> > which has made it into official documentation, as far as I've seen).
>>> > Expecting new contributors to piece together all of this received
>>> > folk-wisdom about how to build/test in a sane way by trawling mailing
>>> list
>>> > archives seems suboptimal.
>>> >
>>> > Thanks for reading, looking forward to hearing your ideas!
>>> >
>>> > -Ryan
>>> >
>>> > P.S. Is "best practice" for emailing this list to not incorporate any
>>> HTML
>>> > in the body? It seems like all of the archives I've seen strip it
>>>out,
>>> but
>>> > other people have used it and gmail displays it.
>>> >
>>> >
>>> > [1]
>>> >
>>>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>>> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>>> > (57 mins)
>>> > [2]
>>> >
>>>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>>> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>>> > (6 mins)
>>> > [3]
>>> >
>>>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>>> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
>>> 20pass%20test,%20fail%20subsequent%20compile
>>> > (4 mins)
>>> > [4]
>>> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
>>> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
>>> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
>>> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
>>> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
>>> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>>> > [5]
>>> >
>>>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>>> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
>>> 20clean,%20need%20dependencies%20built
>>> > [6]
>>> >
>>>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>>> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
>>> 20post%20clean
>>> > (50 mins)
>>> > [7]
>>> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
>>> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
>>> > (1hr)
>>> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
>>> > [9] https://issues.apache.org/jira/browse/SPARK-3867
>>> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>>> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>>> > [12]
>>> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
>>> file-gistfile1-txt-L853
>>> > (~90 mins)
>>> > [13]
>>> > https://gist.github.com/ryan-williams/718f6324af358819b496#
>>> file-gistfile1-txt-L852
>>> > (91 mins)
>>> > [14]
>>> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
>>> file-gistfile1-txt-L854
>>> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>>> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>>> > [17]
>>> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
>>> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
>>> > [18]
>>> > http://stackoverflow.com/questions/25707629/why-does-
>>> spark-job-fail-with-too-many-open-files
>>> > [19] https://issues.apache.org/jira/browse/SPARK-4002
>>> > [20] https://issues.apache.org/jira/browse/SPARK-4542
>>> > [21]
>>> > https://spark.apache.org/docs/latest/building-with-maven.
>>> html#spark-tests-in-maven
>>> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
>>> > [23]
>>> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
>>> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E
>>>
>>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>For additional commands, e-mail: dev-help@spark.apache.org
>

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed.  If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spurious test failures, testing best practices

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Ryan,

A few more things here. You should feel free to send patches to
Jenkins to test them, since this is the reference environment in which
we regularly run tests. This is the normal workflow for most
developers and we spend a lot of effort provisioning/maintaining a
very large jenkins cluster to allow developers access this resource. A
common development approach is to locally run tests that you've added
in a patch, then send it to jenkins for the full run, and then try to
debug locally if you see specific unanticipated test failures.

One challenge we have is that given the proliferation of OS versions,
Java versions, Python versions, ulimits, etc. there is a combinatorial
number of environments in which tests could be run. It is very hard in
some cases to figure out post-hoc why a given test is not working in a
specific environment. I think a good solution here would be to use a
standardized docker container for running Spark tests and asking folks
to use that locally if they are trying to run all of the hundreds of
Spark tests.

Another solution would be to mock out every system interaction in
Spark's tests including e.g. filesystem interactions to try and reduce
variance across environments. However, that seems difficult.

As the number of developers of Spark increases, it's definitely a good
idea for us to invest in developer infrastructure including things
like snapshot releases, better documentation, etc. Thanks for bringing
this up as a pain point.

- Patrick


On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
<ry...@gmail.com> wrote:
> thanks for the info, Matei and Brennon. I will try to switch my workflow to
> using sbt. Other potential action items:
>
> - currently the docs only contain information about building with maven,
> and even then don't cover many important cases, as I described in my
> previous email. If SBT is as much better as you've described then that
> should be made much more obvious. Wasn't it the case recently that there
> was only a page about building with SBT, and not one about building with
> maven? Clearer messaging around this needs to exist in the documentation,
> not just on the mailing list, imho.
>
> - +1 to better distinguishing between unit and integration tests, having
> separate scripts for each, improving documentation around common workflows,
> expectations of brittleness with each kind of test, advisability of just
> relying on Jenkins for certain kinds of tests to not waste too much time,
> etc. Things like the compiler crash should be discussed in the
> documentation, not just in the mailing list archives, if new contributors
> are likely to run into them through no fault of their own.
>
> - What is the algorithm you use to decide what tests you might have broken?
> Can we codify it in some scripts that other people can use?
>
>
>
> On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> Hi Ryan,
>>
>> As a tip (and maybe this isn't documented well), I normally use SBT for
>> development to avoid the slow build process, and use its interactive
>> console to run only specific tests. The nice advantage is that SBT can keep
>> the Scala compiler loaded and JITed across builds, making it faster to
>> iterate. To use it, you can do the following:
>>
>> - Start the SBT interactive console with sbt/sbt
>> - Build your assembly by running the "assembly" target in the assembly
>> project: assembly/assembly
>> - Run all the tests in one module: core/test
>> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
>> also supports tab completion)
>>
>> Running all the tests does take a while, and I usually just rely on
>> Jenkins for that once I've run the tests for the things I believed my patch
>> could break. But this is because some of them are integration tests (e.g.
>> DistributedSuite, which creates multi-process mini-clusters). Many of the
>> individual suites run fast without requiring this, however, so you can pick
>> the ones you want. Perhaps we should find a way to tag them so people  can
>> do a "quick-test" that skips the integration ones.
>>
>> The assembly builds are annoying but they only take about a minute for me
>> on a MacBook Pro with SBT warmed up. The assembly is actually only required
>> for some of the "integration" tests (which launch new processes), but I'd
>> recommend doing it all the time anyway since it would be very confusing to
>> run those with an old assembly. The Scala compiler crash issue can also be
>> a problem, but I don't see it very often with SBT. If it happens, I exit
>> SBT and do sbt clean.
>>
>> Anyway, this is useful feedback and I think we should try to improve some
>> of these suites, but hopefully you can also try the faster SBT process. At
>> the end of the day, if we want integration tests, the whole test process
>> will take an hour, but most of the developers I know leave that to Jenkins
>> and only run individual tests locally before submitting a patch.
>>
>> Matei
>>
>>
>> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
>> ryan.blake.williams@gmail.com> wrote:
>> >
>> > In the course of trying to make contributions to Spark, I have had a lot
>> of
>> > trouble running Spark's tests successfully. The main pain points I've
>> > experienced are:
>> >
>> >    1) frequent, spurious test failures
>> >    2) high latency of running tests
>> >    3) difficulty running specific tests in an iterative fashion
>> >
>> > Here is an example series of failures that I encountered this weekend
>> > (along with footnote links to the console output from each and
>> > approximately how long each took):
>> >
>> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
>> > before.
>> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
>> > passed, but scala compiler crashed on the "catalyst" project.
>> > - `mvn clean`: some attempts to run earlier commands (that previously
>> > didn't crash the compiler) all result in the same compiler crash.
>> Previous
>> > discussion on this list implies this can only be solved by a `mvn clean`
>> > [4].
>> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
>> > BroadcastSuite can't run because assembly is not built.
>> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
>> > version mismatches and python 2.6. The machine this ran on has python
>> 2.7,
>> > so I don't know what that's about.
>> > - `./dev/run-tests` again [7]: "too many open files" errors in several
>> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
>> > not enough, but only some of the time? I increased it to 8192 and tried
>> > again.
>> > - `./dev/run-tests` again [8]: same pyspark errors as before. This seems
>> to
>> > be the issue from SPARK-3867 [9], which was supposedly fixed on October
>> 14;
>> > not sure how I'm seeing it now. In any case, switched to Python 2.6 and
>> > installed unittest2, and python/run-tests seems to be unblocked.
>> > - `./dev/run-tests` again [10]: finally passes!
>> >
>> > This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
>> > changes added on (that I wanted to test before sending out a PR), on a
>> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>> >
>> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>> commands
>> > from the same repo state:
>> >
>> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
>> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
>> > this one before on this machine and am guessing it actually occurs every
>> > time.
>> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
>> > time from ceb6281, and saw the same failure.
>> >
>> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>> narrow
>> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my
>> mac,
>> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>> used),
>> > and it passed [16], so the failure seems specific to my linux
>> machine/arch.
>> >
>> > At this point I believe that my changes don't break any tests (the
>> > YarnClusterSuite failure on my linux presumably not being... "real"),
>> and I
>> > am ready to send out a PR. Whew!
>> >
>> > However, reflecting on the 5 or 6 distinct failure-modes represented
>> above:
>> >
>> > - One of them (too many files open), is something I can (and did,
>> > hopefully) fix once and for all. It cost me an ~hour this time
>> (approximate
>> > time of running ./dev/run-tests) and a few hours other times when I
>> didn't
>> > fully understand/fix it. It doesn't happen deterministically (why?), but
>> > does happen somewhat frequently to people, having been discussed on the
>> > user list multiple times [17] and on SO [18]. Maybe some note in the
>> > documentation advising people to check their ulimit makes sense?
>> > - One of them (unittest2 must be installed for python 2.6) was supposedly
>> > fixed upstream of the commits I tested here; I don't know why I'm still
>> > running into it. This cost me a few hours of running `./dev/run-tests`
>> > multiple times to see if it was transient, plus some time researching and
>> > working around it.
>> > - The original BroadcastSuite failure cost me a few hours and went away
>> > before I'd even run `mvn clean`.
>> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a few
>> > hours of running `./dev/run-tests` in different ways before deciding
>> that,
>> > as usual, there was no way around it and that I'd need to run `mvn clean`
>> > and start running tests from scratch.
>> > - The YarnClusterSuite failures on my linux box have cost me hours of
>> > trying to figure out whether they're my fault. I've seen them many times
>> > over the past weeks/months, plus or minus other failures that have come
>> and
>> > gone, and was especially befuddled by them when I was seeing a disjoint
>> set
>> > of reproducible failures on my mac [19] (the triaging of which involved
>> > dozens of runs of `./dev/run-tests`).
>> >
>> > While I'm interested in digging into each of these issues, I also want to
>> > discuss the frequency with which I've run into issues like these. This is
>> > unfortunately not the first time in recent months that I've spent days
>> > playing spurious-test-failure whack-a-mole with a 60-90min dev/run-tests
>> > iteration time, which is no fun! So I am wondering/thinking:
>> >
>> > - Do other people experience this level of flakiness from spark tests?
>> > - Do other people bother running dev/run-tests locally, or just let
>> Jenkins
>> > do it during the CR process?
>> > - Needing to run a full assembly post-clean just to continue running one
>> > specific test case feels especially wasteful, and the failure output when
>> > naively attempting to run a specific test without having built an
>> assembly
>> > jar is not always clear about what the issue is or how to fix it; even
>> the
>> > fact that certain tests require "building the world" is not something I
>> > would have expected, and has cost me hours of confusion.
>> >    - Should a person running spark tests assume that they must build an
>> > assembly JAR before running anything?
>> >    - Are there some proper "unit" tests that are actually self-contained
>> /
>> > able to be run without building an assembly jar?
>> >    - Can we better document/demarcate which tests have which
>> dependencies?
>> >    - Is there something finer-grained than building an assembly JAR that
>> > is sufficient in some cases?
>> >        - If so, can we document that?
>> >        - If not, can we move to a world of finer-grained dependencies for
>> > some of these?
>> > - Leaving all of these spurious failures aside, the process of assembling
>> > and testing a new JAR is not a quick one (40 and 60 mins for me
>> typically,
>> > respectively). I would guess that there are dozens (hundreds?) of people
>> > who build a Spark assembly from various ToTs on any given day, and who
>> all
>> > wait on the exact same compilation / assembly steps to occur. Expanding
>> on
>> > the recent work to publish nightly snapshots [20], can we do a better job
>> > caching/sharing compilation artifacts at a more granular level (pre-built
>> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA? more
>> > granular maven modules, plus the previous two?), or otherwise save some
>> of
>> > the considerable amount of redundant compilation work that I had to do
>> over
>> > the course of my odyssey this weekend?
>> >
>> > Ramping up on most projects involves some amount of supplementing the
>> > documentation with trial and error to figure out what to run, which
>> > "errors" are real errors and which can be ignored, etc., but navigating
>> > that minefield on Spark has proved especially challenging and
>> > time-consuming for me. Some of that comes directly from scala's
>> relatively
>> > slow compilation times and immature build-tooling ecosystem, but that is
>> > the world we live in and it would be nice if Spark took the alleviation
>> of
>> > the resulting pain more seriously, as one of the more interesting and
>> > well-known large scala projects around right now. The official
>> > documentation around how to build different subsets of the codebase is
>> > somewhat sparse [21], and there have been many mixed [22] accounts [23]
>> on
>> > this mailing list about preferred ways to build on mvn vs. sbt (none of
>> > which has made it into official documentation, as far as I've seen).
>> > Expecting new contributors to piece together all of this received
>> > folk-wisdom about how to build/test in a sane way by trawling mailing
>> list
>> > archives seems suboptimal.
>> >
>> > Thanks for reading, looking forward to hearing your ideas!
>> >
>> > -Ryan
>> >
>> > P.S. Is "best practice" for emailing this list to not incorporate any
>> HTML
>> > in the body? It seems like all of the archives I've seen strip it out,
>> but
>> > other people have used it and gmail displays it.
>> >
>> >
>> > [1]
>> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>> > (57 mins)
>> > [2]
>> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>> > (6 mins)
>> > [3]
>> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
>> 20pass%20test,%20fail%20subsequent%20compile
>> > (4 mins)
>> > [4]
>> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
>> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
>> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
>> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
>> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
>> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>> > [5]
>> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
>> 20clean,%20need%20dependencies%20built
>> > [6]
>> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
>> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
>> 20post%20clean
>> > (50 mins)
>> > [7]
>> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
>> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
>> > (1hr)
>> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
>> > [9] https://issues.apache.org/jira/browse/SPARK-3867
>> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>> > [12]
>> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
>> file-gistfile1-txt-L853
>> > (~90 mins)
>> > [13]
>> > https://gist.github.com/ryan-williams/718f6324af358819b496#
>> file-gistfile1-txt-L852
>> > (91 mins)
>> > [14]
>> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
>> file-gistfile1-txt-L854
>> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>> > [17]
>> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
>> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
>> > [18]
>> > http://stackoverflow.com/questions/25707629/why-does-
>> spark-job-fail-with-too-many-open-files
>> > [19] https://issues.apache.org/jira/browse/SPARK-4002
>> > [20] https://issues.apache.org/jira/browse/SPARK-4542
>> > [21]
>> > https://spark.apache.org/docs/latest/building-with-maven.
>> html#spark-tests-in-maven
>> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
>> > [23]
>> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
>> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spurious test failures, testing best practices

Posted by Ryan Williams <ry...@gmail.com>.

thanks for the info, Matei and Brennon. I will try to switch my workflow to
using sbt. Other potential action items:

- currently the docs only contain information about building with maven,
and even then don't cover many important cases, as I described in my
previous email. If SBT is as much better as you've described then that
should be made much more obvious. Wasn't it the case recently that there
was only a page about building with SBT, and not one about building with
maven? Clearer messaging around this needs to exist in the documentation,
not just on the mailing list, imho.

- +1 to better distinguishing between unit and integration tests, having
separate scripts for each, improving documentation around common workflows,
expectations of brittleness with each kind of test, advisability of just
relying on Jenkins for certain kinds of tests to not waste too much time,
etc. Things like the compiler crash should be discussed in the
documentation, not just in the mailing list archives, if new contributors
are likely to run into them through no fault of their own.

- What is the algorithm you use to decide what tests you might have broken?
Can we codify it in some scripts that other people can use?



On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <ma...@gmail.com>
wrote:

> Hi Ryan,
>
> As a tip (and maybe this isn't documented well), I normally use SBT for
> development to avoid the slow build process, and use its interactive
> console to run only specific tests. The nice advantage is that SBT can keep
> the Scala compiler loaded and JITed across builds, making it faster to
> iterate. To use it, you can do the following:
>
> - Start the SBT interactive console with sbt/sbt
> - Build your assembly by running the "assembly" target in the assembly
> project: assembly/assembly
> - Run all the tests in one module: core/test
> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
> also supports tab completion)
>
> Running all the tests does take a while, and I usually just rely on
> Jenkins for that once I've run the tests for the things I believed my patch
> could break. But this is because some of them are integration tests (e.g.
> DistributedSuite, which creates multi-process mini-clusters). Many of the
> individual suites run fast without requiring this, however, so you can pick
> the ones you want. Perhaps we should find a way to tag them so people  can
> do a "quick-test" that skips the integration ones.
>
> The assembly builds are annoying but they only take about a minute for me
> on a MacBook Pro with SBT warmed up. The assembly is actually only required
> for some of the "integration" tests (which launch new processes), but I'd
> recommend doing it all the time anyway since it would be very confusing to
> run those with an old assembly. The Scala compiler crash issue can also be
> a problem, but I don't see it very often with SBT. If it happens, I exit
> SBT and do sbt clean.
>
> Anyway, this is useful feedback and I think we should try to improve some
> of these suites, but hopefully you can also try the faster SBT process. At
> the end of the day, if we want integration tests, the whole test process
> will take an hour, but most of the developers I know leave that to Jenkins
> and only run individual tests locally before submitting a patch.
>
> Matei
>
>
> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
> ryan.blake.williams@gmail.com> wrote:
> >
> > In the course of trying to make contributions to Spark, I have had a lot
> of
> > trouble running Spark's tests successfully. The main pain points I've
> > experienced are:
> >
> >    1) frequent, spurious test failures
> >    2) high latency of running tests
> >    3) difficulty running specific tests in an iterative fashion
> >
> > Here is an example series of failures that I encountered this weekend
> > (along with footnote links to the console output from each and
> > approximately how long each took):
> >
> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
> > before.
> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
> > passed, but scala compiler crashed on the "catalyst" project.
> > - `mvn clean`: some attempts to run earlier commands (that previously
> > didn't crash the compiler) all result in the same compiler crash.
> Previous
> > discussion on this list implies this can only be solved by a `mvn clean`
> > [4].
> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
> > BroadcastSuite can't run because assembly is not built.
> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
> > version mismatches and python 2.6. The machine this ran on has python
> 2.7,
> > so I don't know what that's about.
> > - `./dev/run-tests` again [7]: "too many open files" errors in several
> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
> > not enough, but only some of the time? I increased it to 8192 and tried
> > again.
> > - `./dev/run-tests` again [8]: same pyspark errors as before. This seems
> to
> > be the issue from SPARK-3867 [9], which was supposedly fixed on October
> 14;
> > not sure how I'm seeing it now. In any case, switched to Python 2.6 and
> > installed unittest2, and python/run-tests seems to be unblocked.
> > - `./dev/run-tests` again [10]: finally passes!
> >
> > This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
> > changes added on (that I wanted to test before sending out a PR), on a
> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
> >
> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
> commands
> > from the same repo state:
> >
> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
> > this one before on this machine and am guessing it actually occurs every
> > time.
> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
> > time from ceb6281, and saw the same failure.
> >
> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
> narrow
> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my
> mac,
> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
> used),
> > and it passed [16], so the failure seems specific to my linux
> machine/arch.
> >
> > At this point I believe that my changes don't break any tests (the
> > YarnClusterSuite failure on my linux presumably not being... "real"),
> and I
> > am ready to send out a PR. Whew!
> >
> > However, reflecting on the 5 or 6 distinct failure-modes represented
> above:
> >
> > - One of them (too many files open), is something I can (and did,
> > hopefully) fix once and for all. It cost me an ~hour this time
> (approximate
> > time of running ./dev/run-tests) and a few hours other times when I
> didn't
> > fully understand/fix it. It doesn't happen deterministically (why?), but
> > does happen somewhat frequently to people, having been discussed on the
> > user list multiple times [17] and on SO [18]. Maybe some note in the
> > documentation advising people to check their ulimit makes sense?
> > - One of them (unittest2 must be installed for python 2.6) was supposedly
> > fixed upstream of the commits I tested here; I don't know why I'm still
> > running into it. This cost me a few hours of running `./dev/run-tests`
> > multiple times to see if it was transient, plus some time researching and
> > working around it.
> > - The original BroadcastSuite failure cost me a few hours and went away
> > before I'd even run `mvn clean`.
> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a few
> > hours of running `./dev/run-tests` in different ways before deciding
> that,
> > as usual, there was no way around it and that I'd need to run `mvn clean`
> > and start running tests from scratch.
> > - The YarnClusterSuite failures on my linux box have cost me hours of
> > trying to figure out whether they're my fault. I've seen them many times
> > over the past weeks/months, plus or minus other failures that have come
> and
> > gone, and was especially befuddled by them when I was seeing a disjoint
> set
> > of reproducible failures on my mac [19] (the triaging of which involved
> > dozens of runs of `./dev/run-tests`).
> >
> > While I'm interested in digging into each of these issues, I also want to
> > discuss the frequency with which I've run into issues like these. This is
> > unfortunately not the first time in recent months that I've spent days
> > playing spurious-test-failure whack-a-mole with a 60-90min dev/run-tests
> > iteration time, which is no fun! So I am wondering/thinking:
> >
> > - Do other people experience this level of flakiness from spark tests?
> > - Do other people bother running dev/run-tests locally, or just let
> Jenkins
> > do it during the CR process?
> > - Needing to run a full assembly post-clean just to continue running one
> > specific test case feels especially wasteful, and the failure output when
> > naively attempting to run a specific test without having built an
> assembly
> > jar is not always clear about what the issue is or how to fix it; even
> the
> > fact that certain tests require "building the world" is not something I
> > would have expected, and has cost me hours of confusion.
> >    - Should a person running spark tests assume that they must build an
> > assembly JAR before running anything?
> >    - Are there some proper "unit" tests that are actually self-contained
> /
> > able to be run without building an assembly jar?
> >    - Can we better document/demarcate which tests have which
> dependencies?
> >    - Is there something finer-grained than building an assembly JAR that
> > is sufficient in some cases?
> >        - If so, can we document that?
> >        - If not, can we move to a world of finer-grained dependencies for
> > some of these?
> > - Leaving all of these spurious failures aside, the process of assembling
> > and testing a new JAR is not a quick one (40 and 60 mins for me
> typically,
> > respectively). I would guess that there are dozens (hundreds?) of people
> > who build a Spark assembly from various ToTs on any given day, and who
> all
> > wait on the exact same compilation / assembly steps to occur. Expanding
> on
> > the recent work to publish nightly snapshots [20], can we do a better job
> > caching/sharing compilation artifacts at a more granular level (pre-built
> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA? more
> > granular maven modules, plus the previous two?), or otherwise save some
> of
> > the considerable amount of redundant compilation work that I had to do
> over
> > the course of my odyssey this weekend?
> >
> > Ramping up on most projects involves some amount of supplementing the
> > documentation with trial and error to figure out what to run, which
> > "errors" are real errors and which can be ignored, etc., but navigating
> > that minefield on Spark has proved especially challenging and
> > time-consuming for me. Some of that comes directly from scala's
> relatively
> > slow compilation times and immature build-tooling ecosystem, but that is
> > the world we live in and it would be nice if Spark took the alleviation
> of
> > the resulting pain more seriously, as one of the more interesting and
> > well-known large scala projects around right now. The official
> > documentation around how to build different subsets of the codebase is
> > somewhat sparse [21], and there have been many mixed [22] accounts [23]
> on
> > this mailing list about preferred ways to build on mvn vs. sbt (none of
> > which has made it into official documentation, as far as I've seen).
> > Expecting new contributors to piece together all of this received
> > folk-wisdom about how to build/test in a sane way by trawling mailing
> list
> > archives seems suboptimal.
> >
> > Thanks for reading, looking forward to hearing your ideas!
> >
> > -Ryan
> >
> > P.S. Is "best practice" for emailing this list to not incorporate any
> HTML
> > in the body? It seems like all of the archives I've seen strip it out,
> but
> > other people have used it and gmail displays it.
> >
> >
> > [1]
> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
> > (57 mins)
> > [2]
> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
> > (6 mins)
> > [3]
> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
> 20pass%20test,%20fail%20subsequent%20compile
> > (4 mins)
> > [4]
> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
> > [5]
> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
> 20clean,%20need%20dependencies%20built
> > [6]
> > https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/
> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
> 20post%20clean
> > (50 mins)
> > [7]
> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
> > (1hr)
> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
> > [9] https://issues.apache.org/jira/browse/SPARK-3867
> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
> > [12]
> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
> file-gistfile1-txt-L853
> > (~90 mins)
> > [13]
> > https://gist.github.com/ryan-williams/718f6324af358819b496#
> file-gistfile1-txt-L852
> > (91 mins)
> > [14]
> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
> file-gistfile1-txt-L854
> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
> > [17]
> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
> > [18]
> > http://stackoverflow.com/questions/25707629/why-does-
> spark-job-fail-with-too-many-open-files
> > [19] https://issues.apache.org/jira/browse/SPARK-4002
> > [20] https://issues.apache.org/jira/browse/SPARK-4542
> > [21]
> > https://spark.apache.org/docs/latest/building-with-maven.
> html#spark-tests-in-maven
> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
> > [23]
> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E
>
>

Re: Spurious test failures, testing best practices

Posted by Matei Zaharia <ma...@gmail.com>.

Hi Ryan,

As a tip (and maybe this isn't documented well), I normally use SBT for development to avoid the slow build process, and use its interactive console to run only specific tests. The nice advantage is that SBT can keep the Scala compiler loaded and JITed across builds, making it faster to iterate. To use it, you can do the following:

- Start the SBT interactive console with sbt/sbt
- Build your assembly by running the "assembly" target in the assembly project: assembly/assembly
- Run all the tests in one module: core/test
- Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this also supports tab completion)

Running all the tests does take a while, and I usually just rely on Jenkins for that once I've run the tests for the things I believed my patch could break. But this is because some of them are integration tests (e.g. DistributedSuite, which creates multi-process mini-clusters). Many of the individual suites run fast without requiring this, however, so you can pick the ones you want. Perhaps we should find a way to tag them so people  can do a "quick-test" that skips the integration ones.

The assembly builds are annoying but they only take about a minute for me on a MacBook Pro with SBT warmed up. The assembly is actually only required for some of the "integration" tests (which launch new processes), but I'd recommend doing it all the time anyway since it would be very confusing to run those with an old assembly. The Scala compiler crash issue can also be a problem, but I don't see it very often with SBT. If it happens, I exit SBT and do sbt clean.

Anyway, this is useful feedback and I think we should try to improve some of these suites, but hopefully you can also try the faster SBT process. At the end of the day, if we want integration tests, the whole test process will take an hour, but most of the developers I know leave that to Jenkins and only run individual tests locally before submitting a patch.

Matei


> On Nov 30, 2014, at 2:39 PM, Ryan Williams <ry...@gmail.com> wrote:
> 
> In the course of trying to make contributions to Spark, I have had a lot of
> trouble running Spark's tests successfully. The main pain points I've
> experienced are:
> 
>    1) frequent, spurious test failures
>    2) high latency of running tests
>    3) difficulty running specific tests in an iterative fashion
> 
> Here is an example series of failures that I encountered this weekend
> (along with footnote links to the console output from each and
> approximately how long each took):
> 
> - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
> before.
> - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
> - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
> passed, but scala compiler crashed on the "catalyst" project.
> - `mvn clean`: some attempts to run earlier commands (that previously
> didn't crash the compiler) all result in the same compiler crash. Previous
> discussion on this list implies this can only be solved by a `mvn clean`
> [4].
> - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
> BroadcastSuite can't run because assembly is not built.
> - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
> version mismatches and python 2.6. The machine this ran on has python 2.7,
> so I don't know what that's about.
> - `./dev/run-tests` again [7]: "too many open files" errors in several
> tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
> not enough, but only some of the time? I increased it to 8192 and tried
> again.
> - `./dev/run-tests` again [8]: same pyspark errors as before. This seems to
> be the issue from SPARK-3867 [9], which was supposedly fixed on October 14;
> not sure how I'm seeing it now. In any case, switched to Python 2.6 and
> installed unittest2, and python/run-tests seems to be unblocked.
> - `./dev/run-tests` again [10]: finally passes!
> 
> This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
> changes added on (that I wanted to test before sending out a PR), on a
> macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
> 
> Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar commands
> from the same repo state:
> 
> - `./dev/run-tests` [12]: YarnClusterSuite failure.
> - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
> this one before on this machine and am guessing it actually occurs every
> time.
> - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
> time from ceb6281, and saw the same failure.
> 
> This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to narrow
> down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my mac,
> from ceb6281, with java 1.7 (instead of 1.8, which the previous runs used),
> and it passed [16], so the failure seems specific to my linux machine/arch.
> 
> At this point I believe that my changes don't break any tests (the
> YarnClusterSuite failure on my linux presumably not being... "real"), and I
> am ready to send out a PR. Whew!
> 
> However, reflecting on the 5 or 6 distinct failure-modes represented above:
> 
> - One of them (too many files open), is something I can (and did,
> hopefully) fix once and for all. It cost me an ~hour this time (approximate
> time of running ./dev/run-tests) and a few hours other times when I didn't
> fully understand/fix it. It doesn't happen deterministically (why?), but
> does happen somewhat frequently to people, having been discussed on the
> user list multiple times [17] and on SO [18]. Maybe some note in the
> documentation advising people to check their ulimit makes sense?
> - One of them (unittest2 must be installed for python 2.6) was supposedly
> fixed upstream of the commits I tested here; I don't know why I'm still
> running into it. This cost me a few hours of running `./dev/run-tests`
> multiple times to see if it was transient, plus some time researching and
> working around it.
> - The original BroadcastSuite failure cost me a few hours and went away
> before I'd even run `mvn clean`.
> - A new incarnation of the sbt-compiler-crash phenomenon cost me a few
> hours of running `./dev/run-tests` in different ways before deciding that,
> as usual, there was no way around it and that I'd need to run `mvn clean`
> and start running tests from scratch.
> - The YarnClusterSuite failures on my linux box have cost me hours of
> trying to figure out whether they're my fault. I've seen them many times
> over the past weeks/months, plus or minus other failures that have come and
> gone, and was especially befuddled by them when I was seeing a disjoint set
> of reproducible failures on my mac [19] (the triaging of which involved
> dozens of runs of `./dev/run-tests`).
> 
> While I'm interested in digging into each of these issues, I also want to
> discuss the frequency with which I've run into issues like these. This is
> unfortunately not the first time in recent months that I've spent days
> playing spurious-test-failure whack-a-mole with a 60-90min dev/run-tests
> iteration time, which is no fun! So I am wondering/thinking:
> 
> - Do other people experience this level of flakiness from spark tests?
> - Do other people bother running dev/run-tests locally, or just let Jenkins
> do it during the CR process?
> - Needing to run a full assembly post-clean just to continue running one
> specific test case feels especially wasteful, and the failure output when
> naively attempting to run a specific test without having built an assembly
> jar is not always clear about what the issue is or how to fix it; even the
> fact that certain tests require "building the world" is not something I
> would have expected, and has cost me hours of confusion.
>    - Should a person running spark tests assume that they must build an
> assembly JAR before running anything?
>    - Are there some proper "unit" tests that are actually self-contained /
> able to be run without building an assembly jar?
>    - Can we better document/demarcate which tests have which dependencies?
>    - Is there something finer-grained than building an assembly JAR that
> is sufficient in some cases?
>        - If so, can we document that?
>        - If not, can we move to a world of finer-grained dependencies for
> some of these?
> - Leaving all of these spurious failures aside, the process of assembling
> and testing a new JAR is not a quick one (40 and 60 mins for me typically,
> respectively). I would guess that there are dozens (hundreds?) of people
> who build a Spark assembly from various ToTs on any given day, and who all
> wait on the exact same compilation / assembly steps to occur. Expanding on
> the recent work to publish nightly snapshots [20], can we do a better job
> caching/sharing compilation artifacts at a more granular level (pre-built
> assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA? more
> granular maven modules, plus the previous two?), or otherwise save some of
> the considerable amount of redundant compilation work that I had to do over
> the course of my odyssey this weekend?
> 
> Ramping up on most projects involves some amount of supplementing the
> documentation with trial and error to figure out what to run, which
> "errors" are real errors and which can be ignored, etc., but navigating
> that minefield on Spark has proved especially challenging and
> time-consuming for me. Some of that comes directly from scala's relatively
> slow compilation times and immature build-tooling ecosystem, but that is
> the world we live in and it would be nice if Spark took the alleviation of
> the resulting pain more seriously, as one of the more interesting and
> well-known large scala projects around right now. The official
> documentation around how to build different subsets of the codebase is
> somewhat sparse [21], and there have been many mixed [22] accounts [23] on
> this mailing list about preferred ways to build on mvn vs. sbt (none of
> which has made it into official documentation, as far as I've seen).
> Expecting new contributors to piece together all of this received
> folk-wisdom about how to build/test in a sane way by trawling mailing list
> archives seems suboptimal.
> 
> Thanks for reading, looking forward to hearing your ideas!
> 
> -Ryan
> 
> P.S. Is "best practice" for emailing this list to not incorporate any HTML
> in the body? It seems like all of the archives I've seen strip it out, but
> other people have used it and gmail displays it.
> 
> 
> [1]
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
> (57 mins)
> [2]
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
> (6 mins)
> [3]
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%20pass%20test,%20fail%20subsequent%20compile
> (4 mins)
> [4]
> https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
> [5]
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%20clean,%20need%20dependencies%20built
> [6]
> https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%20post%20clean
> (50 mins)
> [7]
> https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
> (1hr)
> [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
> [9] https://issues.apache.org/jira/browse/SPARK-3867
> [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
> [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
> [12]
> https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#file-gistfile1-txt-L853
> (~90 mins)
> [13]
> https://gist.github.com/ryan-williams/718f6324af358819b496#file-gistfile1-txt-L852
> (91 mins)
> [14]
> https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#file-gistfile1-txt-L854
> [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
> [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
> [17]
> http://apache-spark-user-list.1001560.n3.nabble.com/quot-Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
> [18]
> http://stackoverflow.com/questions/25707629/why-does-spark-job-fail-with-too-many-open-files
> [19] https://issues.apache.org/jira/browse/SPARK-4002
> [20] https://issues.apache.org/jira/browse/SPARK-4542
> [21]
> https://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven
> [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
> [23]
> http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spurious test failures, testing best practices

Posted by "York, Brennon" <Br...@capitalone.com>.

+1, you aren¹t alone in this. I certainly would like some clarity in these
things well, but, as its been said on this listserv a few times (and you
noted), most developers use `sbt` for their day-to-day compilations to
greatly speed up the iterative testing process. I personally use `sbt` for
all builds until I¹m ready to submit a PR and *then* run ./dev/run-tests
to ensure all the tests / code I¹ve written still pass (i.e. nothing
breaks in the code I¹ve changed or downstream). Sometimes, like you¹ve
said, you still get errors with the ./dev/run-tests script, but, for me,
it comes down to where the errors initiate from and whether I¹m confident
the code I wrote caused it or not as the delimiter to whether I submit the
PR.

Again, not a great answer and hoping others can shed more light, but thats
my 2c on the problem.

On 11/30/14, 5:39 PM, "Ryan Williams" <ry...@gmail.com>
wrote:

>In the course of trying to make contributions to Spark, I have had a lot
>of
>trouble running Spark's tests successfully. The main pain points I've
>experienced are:
>
>    1) frequent, spurious test failures
>    2) high latency of running tests
>    3) difficulty running specific tests in an iterative fashion
>
>Here is an example series of failures that I encountered this weekend
>(along with footnote links to the console output from each and
>approximately how long each took):
>
>- `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
>before.
>- `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>- `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
>passed, but scala compiler crashed on the "catalyst" project.
>- `mvn clean`: some attempts to run earlier commands (that previously
>didn't crash the compiler) all result in the same compiler crash. Previous
>discussion on this list implies this can only be solved by a `mvn clean`
>[4].
>- `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
>BroadcastSuite can't run because assembly is not built.
>- `./dev/run-tests` again [6]: pyspark tests fail, some messages about
>version mismatches and python 2.6. The machine this ran on has python 2.7,
>so I don't know what that's about.
>- `./dev/run-tests` again [7]: "too many open files" errors in several
>tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
>not enough, but only some of the time? I increased it to 8192 and tried
>again.
>- `./dev/run-tests` again [8]: same pyspark errors as before. This seems
>to
>be the issue from SPARK-3867 [9], which was supposedly fixed on October
>14;
>not sure how I'm seeing it now. In any case, switched to Python 2.6 and
>installed unittest2, and python/run-tests seems to be unblocked.
>- `./dev/run-tests` again [10]: finally passes!
>
>This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
>changes added on (that I wanted to test before sending out a PR), on a
>macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>
>Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>commands
>from the same repo state:
>
>- `./dev/run-tests` [12]: YarnClusterSuite failure.
>- `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
>this one before on this machine and am guessing it actually occurs every
>time.
>- `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
>time from ceb6281, and saw the same failure.
>
>This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>narrow
>down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my
>mac,
>from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>used),
>and it passed [16], so the failure seems specific to my linux
>machine/arch.
>
>At this point I believe that my changes don't break any tests (the
>YarnClusterSuite failure on my linux presumably not being... "real"), and
>I
>am ready to send out a PR. Whew!
>
>However, reflecting on the 5 or 6 distinct failure-modes represented
>above:
>
>- One of them (too many files open), is something I can (and did,
>hopefully) fix once and for all. It cost me an ~hour this time
>(approximate
>time of running ./dev/run-tests) and a few hours other times when I didn't
>fully understand/fix it. It doesn't happen deterministically (why?), but
>does happen somewhat frequently to people, having been discussed on the
>user list multiple times [17] and on SO [18]. Maybe some note in the
>documentation advising people to check their ulimit makes sense?
>- One of them (unittest2 must be installed for python 2.6) was supposedly
>fixed upstream of the commits I tested here; I don't know why I'm still
>running into it. This cost me a few hours of running `./dev/run-tests`
>multiple times to see if it was transient, plus some time researching and
>working around it.
>- The original BroadcastSuite failure cost me a few hours and went away
>before I'd even run `mvn clean`.
>- A new incarnation of the sbt-compiler-crash phenomenon cost me a few
>hours of running `./dev/run-tests` in different ways before deciding that,
>as usual, there was no way around it and that I'd need to run `mvn clean`
>and start running tests from scratch.
>- The YarnClusterSuite failures on my linux box have cost me hours of
>trying to figure out whether they're my fault. I've seen them many times
>over the past weeks/months, plus or minus other failures that have come
>and
>gone, and was especially befuddled by them when I was seeing a disjoint
>set
>of reproducible failures on my mac [19] (the triaging of which involved
>dozens of runs of `./dev/run-tests`).
>
>While I'm interested in digging into each of these issues, I also want to
>discuss the frequency with which I've run into issues like these. This is
>unfortunately not the first time in recent months that I've spent days
>playing spurious-test-failure whack-a-mole with a 60-90min dev/run-tests
>iteration time, which is no fun! So I am wondering/thinking:
>
>- Do other people experience this level of flakiness from spark tests?
>- Do other people bother running dev/run-tests locally, or just let
>Jenkins
>do it during the CR process?
>- Needing to run a full assembly post-clean just to continue running one
>specific test case feels especially wasteful, and the failure output when
>naively attempting to run a specific test without having built an assembly
>jar is not always clear about what the issue is or how to fix it; even the
>fact that certain tests require "building the world" is not something I
>would have expected, and has cost me hours of confusion.
>    - Should a person running spark tests assume that they must build an
>assembly JAR before running anything?
>    - Are there some proper "unit" tests that are actually self-contained
>/
>able to be run without building an assembly jar?
>    - Can we better document/demarcate which tests have which
>dependencies?
>    - Is there something finer-grained than building an assembly JAR that
>is sufficient in some cases?
>        - If so, can we document that?
>        - If not, can we move to a world of finer-grained dependencies for
>some of these?
>- Leaving all of these spurious failures aside, the process of assembling
>and testing a new JAR is not a quick one (40 and 60 mins for me typically,
>respectively). I would guess that there are dozens (hundreds?) of people
>who build a Spark assembly from various ToTs on any given day, and who all
>wait on the exact same compilation / assembly steps to occur. Expanding on
>the recent work to publish nightly snapshots [20], can we do a better job
>caching/sharing compilation artifacts at a more granular level (pre-built
>assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA? more
>granular maven modules, plus the previous two?), or otherwise save some of
>the considerable amount of redundant compilation work that I had to do
>over
>the course of my odyssey this weekend?
>
>Ramping up on most projects involves some amount of supplementing the
>documentation with trial and error to figure out what to run, which
>"errors" are real errors and which can be ignored, etc., but navigating
>that minefield on Spark has proved especially challenging and
>time-consuming for me. Some of that comes directly from scala's relatively
>slow compilation times and immature build-tooling ecosystem, but that is
>the world we live in and it would be nice if Spark took the alleviation of
>the resulting pain more seriously, as one of the more interesting and
>well-known large scala projects around right now. The official
>documentation around how to build different subsets of the codebase is
>somewhat sparse [21], and there have been many mixed [22] accounts [23] on
>this mailing list about preferred ways to build on mvn vs. sbt (none of
>which has made it into official documentation, as far as I've seen).
>Expecting new contributors to piece together all of this received
>folk-wisdom about how to build/test in a sane way by trawling mailing list
>archives seems suboptimal.
>
>Thanks for reading, looking forward to hearing your ideas!
>
>-Ryan
>
>P.S. Is "best practice" for emailing this list to not incorporate any HTML
>in the body? It seems like all of the archives I've seen strip it out, but
>other people have used it and gmail displays it.
>
>
>[1]
>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/
>484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>(57 mins)
>[2]
>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/
>ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>(6 mins)
>[3]
>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/
>6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%20pass%20test,
>%20fail%20subsequent%20compile
>(4 mins)
>[4]
>https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCUQF
>jAB&url=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2Fscalac
>-crash-when-compiling-DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNK
>r-iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=zDeSqOgs02AXJXj78
>w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>[5]
>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/
>4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%20clean,%20nee
>d%20dependencies%20built
>[6]
>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/
>f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%20post%20clean
>(50 mins)
>[7]
>https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#file-dev-run-te
>sts-failure-too-many-files-open-then-hang-L5260
>(1hr)
>[8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
>[9] https://issues.apache.org/jira/browse/SPARK-3867
>[10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>[11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>[12]
>https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#file-gistfile1-
>txt-L853
>(~90 mins)
>[13]
>https://gist.github.com/ryan-williams/718f6324af358819b496#file-gistfile1-
>txt-L852
>(91 mins)
>[14]
>https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#file-gistfile1-
>txt-L854
>[15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>[16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>[17]
>http://apache-spark-user-list.1001560.n3.nabble.com/quot-Too-many-open-fil
>es-quot-exception-on-reduceByKey-td2462.html
>[18]
>http://stackoverflow.com/questions/25707629/why-does-spark-job-fail-with-t
>oo-many-open-files
>[19] https://issues.apache.org/jira/browse/SPARK-4002
>[20] https://issues.apache.org/jira/browse/SPARK-4542
>[21]
>https://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-
>in-maven
>[22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
>[23]
>http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3CCAOhmDze
>UNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed.  If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org