You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tinkerpop.apache.org by Marko Rodriguez <ok...@gmail.com> on 2016/04/05 16:32:18 UTC

Blade testing 3.2.0-SNAPSHOT (master/)

Hi,

So yesterday and this morning I manually tested TinkerPop 3.2.0-SNAPSHOT for our VOTE release on Friday on 4 Blades using Friendster (2.5 billion edges). I noticed that Spark 1.6.1 is fickle and Netty-based network errors occur "easily." I dropped back down to 1.5.2 and no errors. I think one of the problems is GC in Spark 1.6.1 and using MEMORY_XXX storage levels. I did DISK_ONLY and the issues went away on the simple query of g.V().count() (which only repartitions -- no message passing). In 1.5.2 you get GC stalls with MEMORY_XXX storage levels, but no [ERROR]s (and no stack traces w/ failed tasks). Next, I did a more complex query -- g.V().out().out().count() -- and Spark 1.6.1 had failed tasks even with DISK_ONLY. Bummer. As a last check, I changed the proportion of SPARK_WORKER_INSTANCES to SPARK_WORKER_CORES from 4/6 to 6/4 and everything started to work again with Spark 1.6.1.

In short, the memory management and workers/core-ratio in Spark 1.6.1 is "different" than Spark 1.5.2. I was able to get the same speeds on 1.6.1 as with 1.5.2, I just had to do things a little differently. In fact, 1.6.1 seems a bit faster -- a 55 minute job on 1.5.2 taking 50 minutes on 1.6.1.

I think it is safe to release TinkerPop 3.2.0 with Spark 1.6.1, but we will just have to be ready to tell people to reduce the number of workers and to use DISK_ONLY if they are GC stalling a lot. Finally, with this testing, I ensured that our bump to Hadoop 2.7.2 didn't cause any problems and moreover, there were a few nick nack bugs around FileSystemStorage that I was able to confirm no longer existed.

Thanks,
Marko.

http://markorodriguez.com

Re: Blade testing 3.2.0-SNAPSHOT (master/)

Posted by Dylan Millikin <dy...@gmail.com>.

Cheers Marko, good work.

On Tue, Apr 5, 2016 at 4:32 PM, Marko Rodriguez <ok...@gmail.com>
wrote:

> Hi,
>
> So yesterday and this morning I manually tested TinkerPop 3.2.0-SNAPSHOT
> for our VOTE release on Friday on 4 Blades using Friendster (2.5 billion
> edges). I noticed that Spark 1.6.1 is fickle and Netty-based network errors
> occur "easily." I dropped back down to 1.5.2 and no errors. I think one of
> the problems is GC in Spark 1.6.1 and using MEMORY_XXX storage levels. I
> did DISK_ONLY and the issues went away on the simple query of g.V().count()
> (which only repartitions -- no message passing). In 1.5.2 you get GC stalls
> with MEMORY_XXX storage levels, but no [ERROR]s (and no stack traces w/
> failed tasks). Next, I did a more complex query --
> g.V().out().out().count() -- and Spark 1.6.1 had failed tasks even with
> DISK_ONLY. Bummer. As a last check, I changed the proportion of
> SPARK_WORKER_INSTANCES to SPARK_WORKER_CORES from 4/6 to 6/4 and everything
> started to work again with Spark 1.6.1.
>
> In short, the memory management and workers/core-ratio in Spark 1.6.1 is
> "different" than Spark 1.5.2. I was able to get the same speeds on 1.6.1 as
> with 1.5.2, I just had to do things a little differently. In fact, 1.6.1
> seems a bit faster -- a 55 minute job on 1.5.2 taking 50 minutes on 1.6.1.
>
> I think it is safe to release TinkerPop 3.2.0 with Spark 1.6.1, but we
> will just have to be ready to tell people to reduce the number of workers
> and to use DISK_ONLY if they are GC stalling a lot. Finally, with this
> testing, I ensured that our bump to Hadoop 2.7.2 didn't cause any problems
> and moreover, there were a few nick nack bugs around FileSystemStorage that
> I was able to confirm no longer existed.
>
> Thanks,
> Marko.
>
> http://markorodriguez.com
>
>