You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "York, Brennon" <Br...@capitalone.com> on 2014/12/01 00:03:05 UTC
Re: Spurious test failures, testing best practices

+1, you aren¹t alone in this. I certainly would like some clarity in these
things well, but, as its been said on this listserv a few times (and you
noted), most developers use `sbt` for their day-to-day compilations to
greatly speed up the iterative testing process. I personally use `sbt` for
all builds until I¹m ready to submit a PR and *then* run ./dev/run-tests
to ensure all the tests / code I¹ve written still pass (i.e. nothing
breaks in the code I¹ve changed or downstream). Sometimes, like you¹ve
said, you still get errors with the ./dev/run-tests script, but, for me,
it comes down to where the errors initiate from and whether I¹m confident
the code I wrote caused it or not as the delimiter to whether I submit the
PR.

Again, not a great answer and hoping others can shed more light, but thats
my 2c on the problem.

On 11/30/14, 5:39 PM, "Ryan Williams" <ry...@gmail.com>
wrote:

>In the course of trying to make contributions to Spark, I have had a lot
>of
>trouble running Spark's tests successfully. The main pain points I've
>experienced are:
>
>    1) frequent, spurious test failures
>    2) high latency of running tests
>    3) difficulty running specific tests in an iterative fashion
>
>Here is an example series of failures that I encountered this weekend
>(along with footnote links to the console output from each and
>approximately how long each took):
>
>- `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
>before.
>- `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>- `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
>passed, but scala compiler crashed on the "catalyst" project.
>- `mvn clean`: some attempts to run earlier commands (that previously
>didn't crash the compiler) all result in the same compiler crash. Previous
>discussion on this list implies this can only be solved by a `mvn clean`
>[4].
>- `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
>BroadcastSuite can't run because assembly is not built.
>- `./dev/run-tests` again [6]: pyspark tests fail, some messages about
>version mismatches and python 2.6. The machine this ran on has python 2.7,
>so I don't know what that's about.
>- `./dev/run-tests` again [7]: "too many open files" errors in several
>tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
>not enough, but only some of the time? I increased it to 8192 and tried
>again.
>- `./dev/run-tests` again [8]: same pyspark errors as before. This seems
>to
>be the issue from SPARK-3867 [9], which was supposedly fixed on October
>14;
>not sure how I'm seeing it now. In any case, switched to Python 2.6 and
>installed unittest2, and python/run-tests seems to be unblocked.
>- `./dev/run-tests` again [10]: finally passes!
>
>This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
>changes added on (that I wanted to test before sending out a PR), on a
>macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>
>Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>commands
>from the same repo state:
>
>- `./dev/run-tests` [12]: YarnClusterSuite failure.
>- `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
>this one before on this machine and am guessing it actually occurs every
>time.
>- `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
>time from ceb6281, and saw the same failure.
>
>This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>narrow
>down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my
>mac,
>from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>used),
>and it passed [16], so the failure seems specific to my linux
>machine/arch.
>
>At this point I believe that my changes don't break any tests (the
>YarnClusterSuite failure on my linux presumably not being... "real"), and
>I
>am ready to send out a PR. Whew!
>
>However, reflecting on the 5 or 6 distinct failure-modes represented
>above:
>
>- One of them (too many files open), is something I can (and did,
>hopefully) fix once and for all. It cost me an ~hour this time
>(approximate
>time of running ./dev/run-tests) and a few hours other times when I didn't
>fully understand/fix it. It doesn't happen deterministically (why?), but
>does happen somewhat frequently to people, having been discussed on the
>user list multiple times [17] and on SO [18]. Maybe some note in the
>documentation advising people to check their ulimit makes sense?
>- One of them (unittest2 must be installed for python 2.6) was supposedly
>fixed upstream of the commits I tested here; I don't know why I'm still
>running into it. This cost me a few hours of running `./dev/run-tests`
>multiple times to see if it was transient, plus some time researching and
>working around it.
>- The original BroadcastSuite failure cost me a few hours and went away
>before I'd even run `mvn clean`.
>- A new incarnation of the sbt-compiler-crash phenomenon cost me a few
>hours of running `./dev/run-tests` in different ways before deciding that,
>as usual, there was no way around it and that I'd need to run `mvn clean`
>and start running tests from scratch.
>- The YarnClusterSuite failures on my linux box have cost me hours of
>trying to figure out whether they're my fault. I've seen them many times
>over the past weeks/months, plus or minus other failures that have come
>and
>gone, and was especially befuddled by them when I was seeing a disjoint
>set
>of reproducible failures on my mac [19] (the triaging of which involved
>dozens of runs of `./dev/run-tests`).
>
>While I'm interested in digging into each of these issues, I also want to
>discuss the frequency with which I've run into issues like these. This is
>unfortunately not the first time in recent months that I've spent days
>playing spurious-test-failure whack-a-mole with a 60-90min dev/run-tests
>iteration time, which is no fun! So I am wondering/thinking:
>
>- Do other people experience this level of flakiness from spark tests?
>- Do other people bother running dev/run-tests locally, or just let
>Jenkins
>do it during the CR process?
>- Needing to run a full assembly post-clean just to continue running one
>specific test case feels especially wasteful, and the failure output when
>naively attempting to run a specific test without having built an assembly
>jar is not always clear about what the issue is or how to fix it; even the
>fact that certain tests require "building the world" is not something I
>would have expected, and has cost me hours of confusion.
>    - Should a person running spark tests assume that they must build an
>assembly JAR before running anything?
>    - Are there some proper "unit" tests that are actually self-contained
>/
>able to be run without building an assembly jar?
>    - Can we better document/demarcate which tests have which
>dependencies?
>    - Is there something finer-grained than building an assembly JAR that
>is sufficient in some cases?
>        - If so, can we document that?
>        - If not, can we move to a world of finer-grained dependencies for
>some of these?
>- Leaving all of these spurious failures aside, the process of assembling
>and testing a new JAR is not a quick one (40 and 60 mins for me typically,
>respectively). I would guess that there are dozens (hundreds?) of people
>who build a Spark assembly from various ToTs on any given day, and who all
>wait on the exact same compilation / assembly steps to occur. Expanding on
>the recent work to publish nightly snapshots [20], can we do a better job
>caching/sharing compilation artifacts at a more granular level (pre-built
>assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA? more
>granular maven modules, plus the previous two?), or otherwise save some of
>the considerable amount of redundant compilation work that I had to do
>over
>the course of my odyssey this weekend?
>
>Ramping up on most projects involves some amount of supplementing the
>documentation with trial and error to figure out what to run, which
>"errors" are real errors and which can be ignored, etc., but navigating
>that minefield on Spark has proved especially challenging and
>time-consuming for me. Some of that comes directly from scala's relatively
>slow compilation times and immature build-tooling ecosystem, but that is
>the world we live in and it would be nice if Spark took the alleviation of
>the resulting pain more seriously, as one of the more interesting and
>well-known large scala projects around right now. The official
>documentation around how to build different subsets of the codebase is
>somewhat sparse [21], and there have been many mixed [22] accounts [23] on
>this mailing list about preferred ways to build on mvn vs. sbt (none of
>which has made it into official documentation, as far as I've seen).
>Expecting new contributors to piece together all of this received
>folk-wisdom about how to build/test in a sane way by trawling mailing list
>archives seems suboptimal.
>
>Thanks for reading, looking forward to hearing your ideas!
>
>-Ryan
>
>P.S. Is "best practice" for emailing this list to not incorporate any HTML
>in the body? It seems like all of the archives I've seen strip it out, but
>other people have used it and gmail displays it.
>
>
>[1]
>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/
>484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>(57 mins)
>[2]
>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/
>ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>(6 mins)
>[3]
>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/
>6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%20pass%20test,
>%20fail%20subsequent%20compile
>(4 mins)
>[4]
>https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCUQF
>jAB&url=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2Fscalac
>-crash-when-compiling-DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNK
>r-iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=zDeSqOgs02AXJXj78
>w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>[5]
>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/
>4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%20clean,%20nee
>d%20dependencies%20built
>[6]
>https://gist.githubusercontent.com/ryan-williams/8a162367c4dc157d2479/raw/
>f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%20post%20clean
>(50 mins)
>[7]
>https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#file-dev-run-te
>sts-failure-too-many-files-open-then-hang-L5260
>(1hr)
>[8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
>[9] https://issues.apache.org/jira/browse/SPARK-3867
>[10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>[11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>[12]
>https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#file-gistfile1-
>txt-L853
>(~90 mins)
>[13]
>https://gist.github.com/ryan-williams/718f6324af358819b496#file-gistfile1-
>txt-L852
>(91 mins)
>[14]
>https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#file-gistfile1-
>txt-L854
>[15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>[16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>[17]
>http://apache-spark-user-list.1001560.n3.nabble.com/quot-Too-many-open-fil
>es-quot-exception-on-reduceByKey-td2462.html
>[18]
>http://stackoverflow.com/questions/25707629/why-does-spark-job-fail-with-t
>oo-many-open-files
>[19] https://issues.apache.org/jira/browse/SPARK-4002
>[20] https://issues.apache.org/jira/browse/SPARK-4542
>[21]
>https://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-
>in-maven
>[22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
>[23]
>http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3CCAOhmDze
>UNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail.gmail.com%3E

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed.  If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org