You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bharath Ravi Kumar <re...@gmail.com> on 2014/07/02 18:34:19 UTC

Re: Execution stalls in LogisticRegressionWithSGD

Hi Xiangrui,

The issue with aggergating/counting over large feature vectors (as part of
LogisticRegressionWithSGD) continues to exist, but now in another form:
while the execution doesn't freeze (due to SPARK-1112), it now fails at the
second or third gradient descent iteration consistently with an error level
log message, but no stacktrace. I'm running against 1.0.1-rc1, and have
tried setting spark.akka.frameSize as high as 500. When the execution
fails, each of the two executors log the following message (corresponding
to aggregate at GradientDescent.scala:178) :

14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Getting 2 non-empty blocks out of 2 blocks
14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
Started 1 remote fetches in 0 ms
14/07/02 14:09:11 INFO Executor: Serialized size of result for 737 is
5959086
14/07/02 14:09:11 INFO Executor: Sending result for 737 directly to driver
14/07/02 14:09:11 INFO Executor: Finished task ID 737
14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
[akka.tcp://spark@master:59487]
disassociated! Shutting down.


There is no separate stacktrace on the driver side.

Each input record is of the form p1, p2, (p1,p2) where p1, p2 & (p1,p2) are
categorical features with large cardinality, and X is the double label with
a continuous value. The categorical variables are converted to binary
variables which results in a feature vector of size 741092 (composed of all
unique categories across p1, p2 and (p1,p2)). Thus, the labeled point for
input record is a sparse vector of size 741092 with only 3 variables set in
the record. The total number of records is 683233 after aggregating the
input data on (p1, p2). When attempting to train on the unaggregated
records (1337907 in number spread across 455 files), the execution fails at
count, GradientDescent.scala:161 with the following log


(Snipped lines corresponding to other input files)
14/07/02 16:02:03 INFO HadoopRDD: Input split:
file:~/part-r-00012:2834590+2834590
14/07/02 16:02:03 INFO HadoopRDD: Input split: file:~/part-r-00005:0+2845559
14/07/02 16:02:03 INFO HadoopRDD: Input split:
file:~/part-r-00005:2845559+2845560
14/07/02 16:02:03 INFO Executor: Serialized size of result for 726 is 615
14/07/02 16:02:03 INFO Executor: Sending result for 726 directly to driver
14/07/02 16:02:03 INFO Executor: Finished task ID 726
14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@slave1:48423] -> [akka.tcp://spark@master:55792]
disassociated! Shutting down.

A count() attempted on the input RDD before beginning training has the
following metrics:

Metric            Min        25th    Median    75th     Max

Result
serialization
time            0 ms    0 ms    0 ms    0 ms    0 ms

Duration        33 s    33 s    35 s    35 s    35 s

Time spent
fetching task
results            0 ms    0 ms    0 ms    0 ms    0 ms

Scheduler
delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s

Aggregated Metrics by Executor

ID     Address Task             Time Total Failed Succeeded Shuffle Read
    Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
0     CANNOT FIND ADDRESS     34 s     1     0         1         0.0 B
        0.0 B             0.0 B                 0.0 B
1     CANNOT FIND ADDRESS     36 s     1     0         1         0.0 B
        0.0 B             0.0 B                 0.0 B

Tasks

Task Index    Task ID    Status    Locality Level    Executor    Launch
Time            Duration    GC Time    Result Ser Time    Errors
0     726     SUCCESS         PROCESS_LOCAL     slave1         2014/07/02
16:01:28 35 s         0.1 s
1     727     SUCCESS         PROCESS_LOCAL     slave2         2014/07/02
16:01:28 33 s         99 ms

Any pointers / diagnosis please?



On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar <re...@gmail.com>
wrote:

> Thanks. I'll await the fix to re-run my test.
>
>
> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Hi Bharath,
>>
>> This is related to SPARK-1112, which we already found the root cause.
>> I will let you know when this is fixed.
>>
>> Best,
>> Xiangrui
>>
>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar <re...@gmail.com>
>> wrote:
>> > Couple more points:
>> > 1)The inexplicable stalling of execution with large feature sets appears
>> > similar to that reported with the news-20 dataset:
>> >
>> http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
>> >
>> > 2) The NPE trying to call mapToPair convert an RDD<Long, Long, Integer,
>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>, Tuple2<Integer,Integer>>
>> is
>> > unrelated to mllib.
>> >
>> > Thanks,
>> > Bharath
>> >
>> >
>> >
>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar <
>> reachbach@gmail.com>
>> > wrote:
>> >>
>> >> Hi  Xiangrui ,
>> >>
>> >> I'm using 1.0.0.
>> >>
>> >> Thanks,
>> >> Bharath
>> >>
>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <me...@gmail.com> wrote:
>> >>>
>> >>> Hi Bharath,
>> >>>
>> >>> Thanks for posting the details! Which Spark version are you using?
>> >>>
>> >>> Best,
>> >>> Xiangrui
>> >>>
>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar <
>> reachbach@gmail.com>
>> >>> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > (Apologies for the long mail, but it's necessary to provide
>> sufficient
>> >>> > details considering the number of issues faced.)
>> >>> >
>> >>> > I'm running into issues testing LogisticRegressionWithSGD a two node
>> >>> > cluster
>> >>> > (each node with 24 cores and 16G available to slaves out of 24G on
>> the
>> >>> > system). Here's a description of the application:
>> >>> >
>> >>> > The model is being trained based on categorical features x, y, and
>> >>> > (x,y).
>> >>> > The categorical features are mapped to binary features by converting
>> >>> > each
>> >>> > distinct value in the category enum into a binary feature by itself
>> >>> > (i.e
>> >>> > presence of that value in a record implies corresponding feature =
>> 1,
>> >>> > else
>> >>> > feature = 0. So, there'd be as many distinct features as enum
>> values) .
>> >>> > The
>> >>> > training vector is laid out as
>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)]. Each record in
>> the
>> >>> > training data has only one combination (Xk,Yk) and a label
>> appearing in
>> >>> > the
>> >>> > record. Thus, the corresponding labeledpoint sparse vector would
>> only
>> >>> > have 3
>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total length of the
>> vector
>> >>> > (though parse) would be nearly 614000.  The number of records is
>> about
>> >>> > 1.33
>> >>> > million. The records have been coalesced into 20 partitions across
>> two
>> >>> > nodes. The input data has not been cached.
>> >>> > (NOTE: I do realize the records & features may seem large for a two
>> >>> > node
>> >>> > setup, but given the memory & cpu, and the fact that I'm willing to
>> >>> > give up
>> >>> > some turnaround time, I don't see why tasks should inexplicably
>> fail)
>> >>> >
>> >>> > Additional parameters include:
>> >>> >
>> >>> > spark.executor.memory = 14G
>> >>> > spark.default.parallelism = 1
>> >>> > spark.cores.max=20
>> >>> > spark.storage.memoryFraction=0.8 //No cache space required
>> >>> > (Trying to set spark.akka.frameSize to a larger number, say, 20
>> didn't
>> >>> > help
>> >>> > either)
>> >>> >
>> >>> > The model training was initialized as : new
>> >>> > LogisticRegressionWithSGD(1,
>> >>> > maxIterations, 0.0, 0.05)
>> >>> >
>> >>> > However, after 4 iterations of gradient descent, the entire
>> execution
>> >>> > appeared to stall inexplicably. The corresponding executor details
>> and
>> >>> > details of the stalled stage (number 14) are as follows:
>> >>> >
>> >>> > Metric                        Min        25th     Median    75th
>> >>> > Max
>> >>> > Result serialization time    12 ms    13 ms    14 ms    16 ms    18
>> ms
>> >>> > Duration                    4 s        4 s        5 s        5 s
>> >>> > 5 s
>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0 ms    0 ms
>> >>> > results
>> >>> > Scheduler delay                6 s        6 s        6 s        6 s
>> >>> > 12 s
>> >>> >
>> >>> >
>> >>> > Stage Id
>> >>> > 14 aggregate at GradientDescent.scala:178
>> >>> >
>> >>> > Task Index    Task ID    Status    Locality Level     Executor
>> >>> > Launch Time                Duration    GC     Result Ser Time
>>  Errors
>> >>> >
>> >>> > Time
>> >>> >
>> >>> > 0     600     RUNNING     PROCESS_LOCAL
>> serious.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27     1.1 h
>> >>> > 1     601     RUNNING     PROCESS_LOCAL
>> casual.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27         1.1 h
>> >>> > 2     602     RUNNING     PROCESS_LOCAL
>> serious.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27     1.1 h
>> >>> > 3     603     RUNNING     PROCESS_LOCAL
>> casual.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27         1.1 h
>> >>> > 4     604     RUNNING     PROCESS_LOCAL
>> serious.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27     1.1 h
>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
>> casual.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
>> serious.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
>> casual.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
>> serious.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
>> casual.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
>> >>> > serious.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
>> casual.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
>> >>> > serious.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
>> casual.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
>> >>> > serious.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
>> casual.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
>> >>> > serious.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
>> casual.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
>> >>> > serious.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
>> casual.dataone.foo.bar.com
>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
>> >>> >
>> >>> > Executor stats:
>> >>> >
>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks    Failed
>> Tasks
>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle Read
>>  Shuffle
>> >>> > Write
>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2                 0
>> >>> > 307         309         23.2 m         0.0 B             0.0 B
>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3                 0
>> >>> > 308         311         22.4 m         0.0 B             0.0 B
>> >>> >
>> >>> >
>> >>> > Executor jmap output:
>> >>> >
>> >>> > Server compiler detected.
>> >>> > JVM version is 24.55-b03
>> >>> >
>> >>> > using thread-local object allocation.
>> >>> > Parallel GC with 18 thread(s)
>> >>> >
>> >>> > Heap Configuration:
>> >>> >    MinHeapFreeRatio = 40
>> >>> >    MaxHeapFreeRatio = 70
>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
>> >>> >    NewSize          = 1310720 (1.25MB)
>> >>> >    MaxNewSize       = 17592186044415 MB
>> >>> >    OldSize          = 5439488 (5.1875MB)
>> >>> >    NewRatio         = 2
>> >>> >    SurvivorRatio    = 8
>> >>> >    PermSize         = 21757952 (20.75MB)
>> >>> >    MaxPermSize      = 134217728 (128.0MB)
>> >>> >    G1HeapRegionSize = 0 (0.0MB)
>> >>> >
>> >>> > Heap Usage:
>> >>> > PS Young Generation
>> >>> > Eden Space:
>> >>> >    capacity = 2783969280 (2655.0MB)
>> >>> >    used     = 192583816 (183.66223907470703MB)
>> >>> >    free     = 2591385464 (2471.337760925293MB)
>> >>> >    6.917598458557704% used
>> >>> > From Space:
>> >>> >    capacity = 409993216 (391.0MB)
>> >>> >    used     = 1179808 (1.125152587890625MB)
>> >>> >    free     = 408813408 (389.8748474121094MB)
>> >>> >    0.2877628102022059% used
>> >>> > To Space:
>> >>> >    capacity = 385351680 (367.5MB)
>> >>> >    used     = 0 (0.0MB)
>> >>> >    free     = 385351680 (367.5MB)
>> >>> >    0.0% used
>> >>> > PS Old Generation
>> >>> >    capacity = 7158628352 (6827.0MB)
>> >>> >    used     = 4455093024 (4248.707794189453MB)
>> >>> >    free     = 2703535328 (2578.292205810547MB)
>> >>> >    62.2338918146983% used
>> >>> > PS Perm Generation
>> >>> >    capacity = 90701824 (86.5MB)
>> >>> >    used     = 45348832 (43.248016357421875MB)
>> >>> >    free     = 45352992 (43.251983642578125MB)
>> >>> >    49.99770677158598% used
>> >>> >
>> >>> > 8432 interned Strings occupying 714672 bytes.
>> >>> >
>> >>> >
>> >>> > Executor GC log snippet:
>> >>> >
>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times: user=5.13
>> >>> > sys=0.39,
>> >>> > real=0.32 secs]
>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)] [ParOldGen:
>> >>> > 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K)
>> [PSPermGen:
>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times: user=24.22
>> sys=0.18,
>> >>> > real=4.55 secs]
>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times: user=1.66
>> >>> > sys=0.00,
>> >>> > real=0.09 secs]
>> >>> >
>> >>> > I tried to map partitions to cores on the nodes. Increasing the
>> number
>> >>> > of
>> >>> > partitions (say to 80 or 100) would result in progress till the 6th
>> >>> > iteration or so, but the next stage would stall as before with
>> apparent
>> >>> > root
>> >>> > cause / logs. With increased partitions, the last stage that
>> completed
>> >>> > had
>> >>> > the following task times:
>> >>> >
>> >>> > Metric                        Min        25th     Median    75th
>>  Max
>> >>> > Result serialization time    11 ms    12 ms    13 ms    15 ms
>>  0.4 s
>> >>> > Duration                    0.5 s    0.9 s    1 s        3 s
>>  7 s
>> >>> > Time spent fetching            0 ms    0 ms    0 ms    0 ms    0 ms
>> >>> > task results
>> >>> > Scheduler delay                5 s        6 s        6 s        7 s
>> >>> > 12 s
>> >>> >
>> >>> > My hypothesis is that as the coefficient array becomes less sparse
>> >>> > (with
>> >>> > successive iterations), the cost of the aggregate goes up to the
>> point
>> >>> > that
>> >>> > it stalls (which I failed to explain). Reducing the batch fraction
>> to a
>> >>> > very
>> >>> > low number like 0.01 saw the iterations progress further, but the
>> model
>> >>> > failed to converge in that case after a small number of iterations.
>> >>> >
>> >>> >
>> >>> > I also tried reducing the number of records by aggregating on (x,y)
>> as
>> >>> > the
>> >>> > key (i.e. using aggregations instead of training on every raw
>> record),
>> >>> > but
>> >>> > encountered by the following exception:
>> >>> >
>> >>> > Loss was due to java.lang.NullPointerException
>> >>> > java.lang.NullPointerException
>> >>> >         at
>> >>> >
>> >>> >
>> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >>> >         at
>> >>> >
>> >>> >
>> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >>> >         at
>> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> >>> >         at
>> >>> > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>> >>> >         at
>> >>> >
>> >>> >
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>> >>> >         at
>> >>> >
>> >>> >
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>> >>> >         at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >>> >         at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >>> >         at
>> >>> >
>> >>> >
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> >>> >         at
>> >>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> >>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> >>> >         at
>> >>> >
>> >>> >
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>> >>> >         at
>> >>> >
>> >>> >
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>> >>> >         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>> >>> >         at
>> >>> >
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>> >>> >         at
>> >>> >
>> >>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>> >         at
>> >>> >
>> >>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>> >         at java.lang.Thread.run(Thread.java:745)
>> >>> >
>> >>> >
>> >>> > I'd appreciate any insights/comments about what may be causing the
>> >>> > execution
>> >>> > to stall.
>> >>> >
>> >>> > If logs/tables appear poorly indented in the email, here's a gist
>> with
>> >>> > relevant details:
>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
>> >>> >
>> >>> > Thanks,
>> >>> > Bharath
>> >
>> >
>>
>
>

Re: Execution stalls in LogisticRegressionWithSGD

Posted by Bharath Ravi Kumar <re...@gmail.com>.

Xiangrui,

The build did throw an error with maven2. Clearing all existing spark
binaries and deploying the binaries explicitly built with maven3 is what
fixed the issue. I'm trying to trace back the origin environment of the
earlier binaries. Will update this thread if I find the ultimate root
cause, but building with m3 appears to have fixed the issue.

Thanks,
Bharath



On Wed, Jul 9, 2014 at 11:13 PM, Xiangrui Meng <me...@gmail.com> wrote:

> We have maven-enforcer-plugin defined in the pom. I don't know why it
> didn't work for you. Could you try rebuild with maven2 and confirm
> that there is no error message? If that is the case, please create a
> JIRA for it. Thanks! -Xiangrui
>
> On Wed, Jul 9, 2014 at 3:53 AM, Bharath Ravi Kumar <re...@gmail.com>
> wrote:
> > Xiangrui,
> >
> > Thanks for all the help in resolving this issue. The  cause turned out to
> > bethe build environment rather than runtime configuration. The build
> process
> > had picked up maven2 while building spark. Using binaries that were
> rebuilt
> > using m3, the entire processing went through fine. While I'm aware that
> the
> > build instruction page specifies m3 as the min requirement, declaratively
> > preventing accidental m2 usage (e.g. through something like the maven
> > enforcer plugin?) might help other developers avoid such issues.
> >
> > -Bharath
> >
> >
> >
> > On Mon, Jul 7, 2014 at 9:43 PM, Xiangrui Meng <me...@gmail.com> wrote:
> >>
> >> It seems to me a setup issue. I just tested news20.binary (1355191
> >> features) on a 2-node EC2 cluster and it worked well. I added one line
> >> to conf/spark-env.sh:
> >>
> >> export SPARK_JAVA_OPTS=" -Dspark.akka.frameSize=20 "
> >>
> >> and launched spark-shell with "--driver-memory 20g". Could you re-try
> >> with an EC2 setup? If it still doesn't work, please attach all your
> >> code and logs.
> >>
> >> Best,
> >> Xiangrui
> >>
> >> On Sun, Jul 6, 2014 at 1:35 AM, Bharath Ravi Kumar <reachbach@gmail.com
> >
> >> wrote:
> >> > Hi Xiangrui,
> >> >
> >> > 1) Yes, I used the same build (compiled locally from source) to the
> host
> >> > that has (master, slave1) and the second host with slave2.
> >> >
> >> > 2) The execution was successful when run in local mode with reduced
> >> > number
> >> > of partitions. Does this imply issues communicating/coordinating
> across
> >> > processes (i.e. driver, master and workers)?
> >> >
> >> > Thanks,
> >> > Bharath
> >> >
> >> >
> >> >
> >> > On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >> >>
> >> >> Hi Bharath,
> >> >>
> >> >> 1) Did you sync the spark jar and conf to the worker nodes after
> build?
> >> >> 2) Since the dataset is not large, could you try local mode first
> >> >> using `spark-summit --driver-memory 12g --master local[*]`?
> >> >> 3) Try to use less number of partitions, say 5.
> >> >>
> >> >> If the problem is still there, please attach the full master/worker
> log
> >> >> files.
> >> >>
> >> >> Best,
> >> >> Xiangrui
> >> >>
> >> >> On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar
> >> >> <re...@gmail.com>
> >> >> wrote:
> >> >> > Xiangrui,
> >> >> >
> >> >> > Leaving the frameSize unspecified led to an error message (and
> >> >> > failure)
> >> >> > stating that the task size (~11M) was larger. I hence set it to an
> >> >> > arbitrarily large value ( I realize 500 was unrealistic &
> unnecessary
> >> >> > in
> >> >> > this case). I've now set the size to 20M and repeated the runs. The
> >> >> > earlier
> >> >> > runs were on an uncached RDD. Caching the RDD (and setting
> >> >> > spark.storage.memoryFraction=0.5) resulted in marginal speed up of
> >> >> > execution, but the end result remained the same. The cached RDD
> size
> >> >> > is
> >> >> > as
> >> >> > follows:
> >> >> >
> >> >> > RDD Name    Storage Level                        Cached Partitions
> >> >> > Fraction Cached    Size in Memory    Size in Tachyon        Size on
> >> >> > Disk
> >> >> > 1084         Memory Deserialized 1x Replicated     80
> >> >> > 100%         165.9 MB             0.0 B                 0.0 B
> >> >> >
> >> >> >
> >> >> >
> >> >> > The corresponding master logs were:
> >> >> >
> >> >> > 14/07/04 06:29:34 INFO Master: Removing executor
> >> >> > app-20140704062238-0033/1
> >> >> > because it is EXITED
> >> >> > 14/07/04 06:29:34 INFO Master: Launching executor
> >> >> > app-20140704062238-0033/2
> >> >> > on worker worker-20140630124441-slave1-40182
> >> >> > 14/07/04 06:29:34 INFO Master: Removing executor
> >> >> > app-20140704062238-0033/0
> >> >> > because it is EXITED
> >> >> > 14/07/04 06:29:34 INFO Master: Launching executor
> >> >> > app-20140704062238-0033/3
> >> >> > on worker worker-20140630102913-slave2-44735
> >> >> > 14/07/04 06:29:37 INFO Master: Removing executor
> >> >> > app-20140704062238-0033/2
> >> >> > because it is EXITED
> >> >> > 14/07/04 06:29:37 INFO Master: Launching executor
> >> >> > app-20140704062238-0033/4
> >> >> > on worker worker-20140630124441-slave1-40182
> >> >> > 14/07/04 06:29:37 INFO Master: Removing executor
> >> >> > app-20140704062238-0033/3
> >> >> > because it is EXITED
> >> >> > 14/07/04 06:29:37 INFO Master: Launching executor
> >> >> > app-20140704062238-0033/5
> >> >> > on worker worker-20140630102913-slave2-44735
> >> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> >> >> > disassociated, removing it.
> >> >> > 14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
> >> >> > 14/07/04 06:29:39 INFO LocalActorRef: Message
> >> >> >
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
> >> >> > from
> >> >> > Actor[akka://sparkMaster/deadLetters] to
> >> >> >
> >> >> >
> >> >> >
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
> >> >> > was not delivered. [39] dead letters encountered. This logging can
> be
> >> >> > turned
> >> >> > off or adjusted with configuration settings 'akka.log-dead-letters'
> >> >> > and
> >> >> > 'akka.log-dead-letters-during-shutdown'.
> >> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> >> >> > disassociated, removing it.
> >> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> >> >> > disassociated, removing it.
> >> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> >> >> > [akka.tcp://sparkMaster@master:7077] ->
> >> >> > [akka.tcp://spark@slave2:45172]:
> >> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> >> >> > akka.remote.EndpointAssociationException: Association failed with
> >> >> > [akka.tcp://spark@slave2:45172]
> >> >> > Caused by:
> >> >> >
> >> >> >
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> >> >> > Connection refused: slave2/10.3.1.135:45172
> >> >> > ]
> >> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> >> >> > disassociated, removing it.
> >> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> >> >> > [akka.tcp://sparkMaster@master:7077] ->
> >> >> > [akka.tcp://spark@slave2:45172]:
> >> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> >> >> > akka.remote.EndpointAssociationException: Association failed with
> >> >> > [akka.tcp://spark@slave2:45172]
> >> >> > Caused by:
> >> >> >
> >> >> >
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> >> >> > Connection refused: slave2/10.3.1.135:45172
> >> >> > ]
> >> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> >> >> > [akka.tcp://sparkMaster@master:7077] ->
> >> >> > [akka.tcp://spark@slave2:45172]:
> >> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> >> >> > akka.remote.EndpointAssociationException: Association failed with
> >> >> > [akka.tcp://spark@slave2:45172]
> >> >> > Caused by:
> >> >> >
> >> >> >
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> >> >> > Connection refused: slave2/10.3.1.135:45172
> >> >> > ]
> >> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> >> >> > disassociated, removing it.
> >> >> > 14/07/04 06:29:40 WARN Master: Got status update for unknown
> executor
> >> >> > app-20140704062238-0033/5
> >> >> > 14/07/04 06:29:40 WARN Master: Got status update for unknown
> executor
> >> >> > app-20140704062238-0033/4
> >> >> >
> >> >> >
> >> >> > Coincidentally, after the initial executor failed, each following
> >> >> > executor
> >> >> > that was re-spawned failed with the following logs:
> >> >> > (e.g the following was from
> >> >> > slave1:~/spark-1.0.1-rc1/work/app-20140704062238-0033/2/stderr)
> >> >> >
> >> >> > log4j:WARN No appenders could be found for logger
> >> >> > (org.apache.hadoop.conf.Configuration).
> >> >> > log4j:WARN Please initialize the log4j system properly.
> >> >> > log4j:WARN See
> http://logging.apache.org/log4j/1.2/faq.html#noconfig
> >> >> > for
> >> >> > more info.
> >> >> > 14/07/04 06:29:35 INFO SparkHadoopUtil: Using Spark's default log4j
> >> >> > profile:
> >> >> > org/apache/spark/log4j-defaults.properties
> >> >> > 14/07/04 06:29:35 INFO SecurityManager: Changing view acls to:
> user1
> >> >> > 14/07/04 06:29:35 INFO SecurityManager: SecurityManager:
> >> >> > authentication
> >> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
> >> >> > 14/07/04 06:29:35 INFO Slf4jLogger: Slf4jLogger started
> >> >> > 14/07/04 06:29:35 INFO Remoting: Starting remoting
> >> >> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on
> >> >> > addresses
> >> >> > :[akka.tcp://sparkExecutor@slave1:54782]
> >> >> > 14/07/04 06:29:36 INFO Remoting: Remoting now listens on addresses:
> >> >> > [akka.tcp://sparkExecutor@slave1:54782]
> >> >> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Connecting to
> >> >> > driver:
> >> >> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
> >> >> > 14/07/04 06:29:36 INFO WorkerWatcher: Connecting to worker
> >> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> >> >> > 14/07/04 06:29:36 INFO WorkerWatcher: Successfully connected to
> >> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> >> >> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Successfully
> >> >> > registered
> >> >> > with driver
> >> >> > 14/07/04 06:29:36 INFO SecurityManager: Changing view acls to:
> user1
> >> >> > 14/07/04 06:29:36 INFO SecurityManager: SecurityManager:
> >> >> > authentication
> >> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
> >> >> > 14/07/04 06:29:36 INFO Slf4jLogger: Slf4jLogger started
> >> >> > 14/07/04 06:29:36 INFO Remoting: Starting remoting
> >> >> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on
> >> >> > addresses
> >> >> > :[akka.tcp://spark@slave1:39753]
> >> >> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to MapOutputTracker:
> >> >> > akka.tcp://spark@master:45172/user/MapOutputTracker
> >> >> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to BlockManagerMaster:
> >> >> > akka.tcp://spark@master:45172/user/BlockManagerMaster
> >> >> > 14/07/04 06:29:36 INFO DiskBlockManager: Created local directory at
> >> >> > /tmp/spark-local-20140704062936-6123
> >> >> > 14/07/04 06:29:36 INFO MemoryStore: MemoryStore started with
> capacity
> >> >> > 6.7
> >> >> > GB.
> >> >> > 14/07/04 06:29:36 INFO ConnectionManager: Bound socket to port
> 50960
> >> >> > with id
> >> >> > = ConnectionManagerId(slave1,50960)
> >> >> > 14/07/04 06:29:36 INFO BlockManagerMaster: Trying to register
> >> >> > BlockManager
> >> >> > 14/07/04 06:29:36 INFO BlockManagerMaster: Registered BlockManager
> >> >> > 14/07/04 06:29:36 INFO HttpFileServer: HTTP File server directory
> is
> >> >> > /tmp/spark-42c2782f-60f8-45a7-9e11-c789fc87fe2e
> >> >> > 14/07/04 06:29:36 INFO HttpServer: Starting HTTP Server
> >> >> > 14/07/04 06:29:36 ERROR CoarseGrainedExecutorBackend: Driver
> >> >> > Disassociated
> >> >> > [akka.tcp://sparkExecutor@slave1:54782] ->
> >> >> > [akka.tcp://spark@master:45172]
> >> >> > disassociated! Shutting down.
> >> >> >
> >> >> > In case of the initial executor that successfully started, the
> >> >> > corresponding
> >> >> > log messages (from
> >> >> > spark-1.0.1-rc1/work/app-20140704062238-0033/1/stderr) on
> >> >> > the executor were:
> >> >> > log4j:WARN No appenders could be found for logger
> >> >> > (org.apache.hadoop.conf.Configuration).
> >> >> > log4j:WARN Please initialize the log4j system properly.
> >> >> > log4j:WARN See
> http://logging.apache.org/log4j/1.2/faq.html#noconfig
> >> >> > for
> >> >> > more info.
> >> >> > 14/07/04 06:22:39 INFO SparkHadoopUtil: Using Spark's default log4j
> >> >> > profile:
> >> >> > org/apache/spark/log4j-defaults.properties
> >> >> > 14/07/04 06:22:39 INFO SecurityManager: Changing view acls to:
> user1
> >> >> > 14/07/04 06:22:39 INFO SecurityManager: SecurityManager:
> >> >> > authentication
> >> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
> >> >> > 14/07/04 06:22:39 INFO Slf4jLogger: Slf4jLogger started
> >> >> > 14/07/04 06:22:39 INFO Remoting: Starting remoting
> >> >> > 14/07/04 06:22:39 INFO Remoting: Remoting started; listening on
> >> >> > addresses
> >> >> > :[akka.tcp://sparkExecutor@slave1:50806]
> >> >> > 14/07/04 06:22:39 INFO Remoting: Remoting now listens on addresses:
> >> >> > [akka.tcp://sparkExecutor@slave1:50806]
> >> >> > 14/07/04 06:22:39 INFO CoarseGrainedExecutorBackend: Connecting to
> >> >> > driver:
> >> >> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
> >> >> > 14/07/04 06:22:39 INFO WorkerWatcher: Connecting to worker
> >> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> >> >> > 14/07/04 06:22:39 INFO WorkerWatcher: Successfully connected to
> >> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> >> >> > 14/07/04 06:22:40 INFO CoarseGrainedExecutorBackend: Successfully
> >> >> > registered
> >> >> > with driver
> >> >> > 14/07/04 06:22:40 INFO SecurityManager: Changing view acls to:
> user1
> >> >> > 14/07/04 06:22:40 INFO SecurityManager: SecurityManager:
> >> >> > authentication
> >> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
> >> >> > 14/07/04 06:22:40 INFO Slf4jLogger: Slf4jLogger started
> >> >> > 14/07/04 06:22:40 INFO Remoting: Starting remoting
> >> >> > 14/07/04 06:22:40 INFO Remoting: Remoting started; listening on
> >> >> > addresses
> >> >> > :[akka.tcp://spark@slave1:38558]
> >> >> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to MapOutputTracker:
> >> >> > akka.tcp://spark@master:45172/user/MapOutputTracker
> >> >> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to BlockManagerMaster:
> >> >> > akka.tcp://spark@master:45172/user/BlockManagerMaster
> >> >> > 14/07/04 06:22:40 INFO DiskBlockManager: Created local directory at
> >> >> > /tmp/spark-local-20140704062240-6a65
> >> >> > 14/07/04 06:22:40 INFO MemoryStore: MemoryStore started with
> capacity
> >> >> > 6.7
> >> >> > GB.
> >> >> > 14/07/04 06:22:40 INFO ConnectionManager: Bound socket to port
> 46901
> >> >> > with id
> >> >> > = ConnectionManagerId(slave1,46901)
> >> >> > 14/07/04 06:22:40 INFO BlockManagerMaster: Trying to register
> >> >> > BlockManager
> >> >> > 14/07/04 06:22:40 INFO BlockManagerMaster: Registered BlockManager
> >> >> > 14/07/04 06:22:40 INFO HttpFileServer: HTTP File server directory
> is
> >> >> > /tmp/spark-9eba78f9-8ae9-477c-9338-7222ae6fe306
> >> >> > 14/07/04 06:22:40 INFO HttpServer: Starting HTTP Server
> >> >> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned
> >> >> > task 0
> >> >> > 14/07/04 06:22:42 INFO Executor: Running task ID 0
> >> >> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned
> >> >> > task 2
> >> >> > 14/07/04 06:22:42 INFO Executor: Running task ID 2
> >> >> > ...
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng <me...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> The feature dimension is small. You don't need a big
> akka.frameSize.
> >> >> >> The default one (10M) should be sufficient. Did you cache the data
> >> >> >> before calling LRWithSGD? -Xiangrui
> >> >> >>
> >> >> >> On Thu, Jul 3, 2014 at 10:02 AM, Bharath Ravi Kumar
> >> >> >> <re...@gmail.com>
> >> >> >> wrote:
> >> >> >> > I tried another run after setting the driver memory to 8G (and
> >> >> >> > spark.akka.frameSize = 500 on the executors and the driver). In
> >> >> >> > addition, I
> >> >> >> > also tried to reduce the amount of data that a single task
> >> >> >> > processes,
> >> >> >> > by
> >> >> >> > increasing the number of partitions (of the labeled points) to
> 120
> >> >> >> > (instead
> >> >> >> > of 2 used earlier), and then setting max cores to 2. That made
> no
> >> >> >> > difference
> >> >> >> > since, at the end of 120 tasks, the familiar error message
> >> >> >> > appeared
> >> >> >> > on a
> >> >> >> > slave:
> >> >> >> >
> >> >> >> > <snipped earlier logs>
> >> >> >> > 14/07/03 16:18:48 INFO CoarseGrainedExecutorBackend: Got
> assigned
> >> >> >> > task
> >> >> >> > 1436
> >> >> >> > 14/07/03 16:18:48 INFO Executor: Running task ID 1436
> >> >> >> > 14/07/03 16:18:53 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00014:0+2215337
> >> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00014:2215337+2215338
> >> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00003:0+2196429
> >> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00003:2196429+2196430
> >> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00010:0+2186751
> >> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00010:2186751+2186751
> >> >> >> > 14/07/03 16:18:54 INFO Executor: Serialized size of result for
> >> >> >> > 1436
> >> >> >> > is
> >> >> >> > 5958822
> >> >> >> > 14/07/03 16:18:54 INFO Executor: Sending result for 1436
> directly
> >> >> >> > to
> >> >> >> > driver
> >> >> >> > 14/07/03 16:18:54 INFO Executor: Finished task ID 1436
> >> >> >> > 14/07/03 16:18:54 INFO CoarseGrainedExecutorBackend: Got
> assigned
> >> >> >> > task
> >> >> >> > 1438
> >> >> >> > 14/07/03 16:18:54 INFO Executor: Running task ID 1438
> >> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00004:0+2209615
> >> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00004:2209615+2209616
> >> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00011:0+2202240
> >> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00011:2202240+2202240
> >> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00009:0+2194423
> >> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> >> > file:~//2014-05-24-02/part-r-00009:2194423+2194424
> >> >> >> > 14/07/03 16:19:00 INFO Executor: Serialized size of result for
> >> >> >> > 1438
> >> >> >> > is
> >> >> >> > 5958822
> >> >> >> > 14/07/03 16:19:00 INFO Executor: Sending result for 1438
> directly
> >> >> >> > to
> >> >> >> > driver
> >> >> >> > 14/07/03 16:19:00 INFO Executor: Finished task ID 1438
> >> >> >> > 14/07/03 16:19:14 ERROR CoarseGrainedExecutorBackend: Driver
> >> >> >> > Disassociated
> >> >> >> > [akka.tcp://sparkExecutor@slave1:51099] ->
> >> >> >> > [akka.tcp://spark@master:58272]
> >> >> >> > disassociated! Shutting down.
> >> >> >> >
> >> >> >> >
> >> >> >> > The corresponding master logs were:
> >> >> >> >
> >> >> >> > 4/07/03 16:02:14 INFO Master: Registering app LogRegExp
> >> >> >> > 14/07/03 16:02:14 INFO Master: Registered app LogRegExp with ID
> >> >> >> > app-20140703160214-0028
> >> >> >> > 14/07/03 16:02:14 INFO Master: Launching executor
> >> >> >> > app-20140703160214-0028/1
> >> >> >> > on worker worker-20140630124441-slave1-40182
> >> >> >> > 14/07/03 16:19:15 INFO Master: Removing executor
> >> >> >> > app-20140703160214-0028/1
> >> >> >> > because it is EXITED
> >> >> >> > 14/07/03 16:19:15 INFO Master: Launching executor
> >> >> >> > app-20140703160214-0028/2
> >> >> >> > on worker worker-20140630124441-slave1-40182
> >> >> >> > 14/07/03 16:19:15 INFO Master: Removing executor
> >> >> >> > app-20140703160214-0028/0
> >> >> >> > because it is EXITED
> >> >> >> > 14/07/03 16:19:15 INFO Master: Launching executor
> >> >> >> > app-20140703160214-0028/3
> >> >> >> > on worker worker-20140630102913-slave2-44735
> >> >> >> > 14/07/03 16:19:18 INFO Master: Removing executor
> >> >> >> > app-20140703160214-0028/2
> >> >> >> > because it is EXITED
> >> >> >> > 14/07/03 16:19:18 INFO Master: Launching executor
> >> >> >> > app-20140703160214-0028/4
> >> >> >> > on worker worker-20140630124441-slave1-40182
> >> >> >> > 14/07/03 16:19:18 INFO Master: Removing executor
> >> >> >> > app-20140703160214-0028/3
> >> >> >> > because it is EXITED
> >> >> >> > 14/07/03 16:19:18 INFO Master: Launching executor
> >> >> >> > app-20140703160214-0028/5
> >> >> >> > on worker worker-20140630102913-slave2-44735
> >> >> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272
> got
> >> >> >> > disassociated, removing it.
> >> >> >> > 14/07/03 16:19:20 INFO Master: Removing app
> >> >> >> > app-20140703160214-0028
> >> >> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272
> got
> >> >> >> > disassociated, removing it.
> >> >> >> >
> >> >> >> >
> >> >> >> > Throughout the execution, I confirmed in the UI that driver
> memory
> >> >> >> > used
> >> >> >> > was
> >> >> >> > 0.0 B / 6.9 GB and each executor's memory showed 0.0 B / 12.1 GB
> >> >> >> > even
> >> >> >> > when
> >> >> >> > aggregate was being executed. On a related note, I noticed in
> the
> >> >> >> > executors
> >> >> >> > tab that just before the entire job terminated, executors on
> >> >> >> > slave1,
> >> >> >> > slave2
> >> >> >> > and the driver "disappeared" momentarily from the active
> executors
> >> >> >> > list.
> >> >> >> > The
> >> >> >> > replacement  executors on slave1 and slave2 were re-spawned a
> >> >> >> > couple
> >> >> >> > of
> >> >> >> > times and appeared on the executors list again before they too
> >> >> >> > died
> >> >> >> > and
> >> >> >> > the
> >> >> >> > job failed.
> >> >> >> > So it appears that no matter what the task input-result size,
> the
> >> >> >> > execution
> >> >> >> > fails at the end of the stage corresponding to
> >> >> >> > GradientDescent.aggregate
> >> >> >> > (and the preceding count() in GradientDescent goes through
> fine).
> >> >> >> > Let
> >> >> >> > me
> >> >> >> > know if you need any additional information.
> >> >> >> >
> >> >> >> >
> >> >> >> > On Thu, Jul 3, 2014 at 12:27 PM, Xiangrui Meng <
> mengxr@gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Could you check the driver memory in the executor tab of the
> >> >> >> >> Spark
> >> >> >> >> UI
> >> >> >> >> when the job is running? If it is too small, please set
> >> >> >> >> --driver-memory with spark-submit, e.g. 10g. Could you also
> >> >> >> >> attach
> >> >> >> >> the
> >> >> >> >> master log under spark/logs as well? -Xiangrui
> >> >> >> >>
> >> >> >> >> On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar
> >> >> >> >> <re...@gmail.com>
> >> >> >> >> wrote:
> >> >> >> >> > Hi Xiangrui,
> >> >> >> >> >
> >> >> >> >> > The issue with aggergating/counting over large feature
> vectors
> >> >> >> >> > (as
> >> >> >> >> > part
> >> >> >> >> > of
> >> >> >> >> > LogisticRegressionWithSGD) continues to exist, but now in
> >> >> >> >> > another
> >> >> >> >> > form:
> >> >> >> >> > while the execution doesn't freeze (due to SPARK-1112), it
> now
> >> >> >> >> > fails
> >> >> >> >> > at
> >> >> >> >> > the
> >> >> >> >> > second or third gradient descent iteration consistently with
> an
> >> >> >> >> > error
> >> >> >> >> > level
> >> >> >> >> > log message, but no stacktrace. I'm running against
> 1.0.1-rc1,
> >> >> >> >> > and
> >> >> >> >> > have
> >> >> >> >> > tried setting spark.akka.frameSize as high as 500. When the
> >> >> >> >> > execution
> >> >> >> >> > fails,
> >> >> >> >> > each of the two executors log the following message
> >> >> >> >> > (corresponding
> >> >> >> >> > to
> >> >> >> >> > aggregate at GradientDescent.scala:178) :
> >> >> >> >> >
> >> >> >> >> > 14/07/02 14:09:09 INFO
> >> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
> >> >> >> >> > maxBytesInFlight: 50331648, targetRequestSize: 10066329
> >> >> >> >> > 14/07/02 14:09:09 INFO
> >> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
> >> >> >> >> > Getting 2 non-empty blocks out of 2 blocks
> >> >> >> >> > 14/07/02 14:09:09 INFO
> >> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
> >> >> >> >> > Started 1 remote fetches in 0 ms
> >> >> >> >> > 14/07/02 14:09:11 INFO Executor: Serialized size of result
> for
> >> >> >> >> > 737
> >> >> >> >> > is
> >> >> >> >> > 5959086
> >> >> >> >> > 14/07/02 14:09:11 INFO Executor: Sending result for 737
> >> >> >> >> > directly
> >> >> >> >> > to
> >> >> >> >> > driver
> >> >> >> >> > 14/07/02 14:09:11 INFO Executor: Finished task ID 737
> >> >> >> >> > 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver
> >> >> >> >> > Disassociated
> >> >> >> >> > [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
> >> >> >> >> > [akka.tcp://spark@master:59487] disassociated! Shutting
> down.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > There is no separate stacktrace on the driver side.
> >> >> >> >> >
> >> >> >> >> > Each input record is of the form p1, p2, (p1,p2) where p1,
> p2 &
> >> >> >> >> > (p1,p2)
> >> >> >> >> > are
> >> >> >> >> > categorical features with large cardinality, and X is the
> >> >> >> >> > double
> >> >> >> >> > label
> >> >> >> >> > with
> >> >> >> >> > a continuous value. The categorical variables are converted
> to
> >> >> >> >> > binary
> >> >> >> >> > variables which results in a feature vector of size 741092
> >> >> >> >> > (composed
> >> >> >> >> > of
> >> >> >> >> > all
> >> >> >> >> > unique categories across p1, p2 and (p1,p2)). Thus, the
> labeled
> >> >> >> >> > point
> >> >> >> >> > for
> >> >> >> >> > input record is a sparse vector of size 741092 with only 3
> >> >> >> >> > variables
> >> >> >> >> > set
> >> >> >> >> > in
> >> >> >> >> > the record. The total number of records is 683233 after
> >> >> >> >> > aggregating
> >> >> >> >> > the
> >> >> >> >> > input data on (p1, p2). When attempting to train on the
> >> >> >> >> > unaggregated
> >> >> >> >> > records
> >> >> >> >> > (1337907 in number spread across 455 files), the execution
> >> >> >> >> > fails
> >> >> >> >> > at
> >> >> >> >> > count,
> >> >> >> >> > GradientDescent.scala:161 with the following log
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > (Snipped lines corresponding to other input files)
> >> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> >> >> >> > file:~/part-r-00012:2834590+2834590
> >> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> >> >> >> > file:~/part-r-00005:0+2845559
> >> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> >> >> >> > file:~/part-r-00005:2845559+2845560
> >> >> >> >> > 14/07/02 16:02:03 INFO Executor: Serialized size of result
> for
> >> >> >> >> > 726
> >> >> >> >> > is
> >> >> >> >> > 615
> >> >> >> >> > 14/07/02 16:02:03 INFO Executor: Sending result for 726
> >> >> >> >> > directly
> >> >> >> >> > to
> >> >> >> >> > driver
> >> >> >> >> > 14/07/02 16:02:03 INFO Executor: Finished task ID 726
> >> >> >> >> > 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver
> >> >> >> >> > Disassociated
> >> >> >> >> > [akka.tcp://sparkExecutor@slave1:48423] ->
> >> >> >> >> > [akka.tcp://spark@master:55792]
> >> >> >> >> > disassociated! Shutting down.
> >> >> >> >> >
> >> >> >> >> > A count() attempted on the input RDD before beginning
> training
> >> >> >> >> > has
> >> >> >> >> > the
> >> >> >> >> > following metrics:
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > Metric            Min        25th    Median    75th     Max
> >> >> >> >> >
> >> >> >> >> > Result
> >> >> >> >> > serialization
> >> >> >> >> > time            0 ms    0 ms    0 ms    0 ms    0 ms
> >> >> >> >> >
> >> >> >> >> > Duration        33 s    33 s    35 s    35 s    35 s
> >> >> >> >> >
> >> >> >> >> > Time spent
> >> >> >> >> > fetching task
> >> >> >> >> > results            0 ms    0 ms    0 ms    0 ms    0 ms
> >> >> >> >> >
> >> >> >> >> > Scheduler
> >> >> >> >> > delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
> >> >> >> >> >
> >> >> >> >> > Aggregated Metrics by Executor
> >> >> >> >> >
> >> >> >> >> > ID     Address Task             Time Total Failed Succeeded
> >> >> >> >> > Shuffle
> >> >> >> >> > Read
> >> >> >> >> > Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
> >> >> >> >> > 0     CANNOT FIND ADDRESS     34 s     1     0         1
> >> >> >> >> > 0.0
> >> >> >> >> > B
> >> >> >> >> > 0.0 B             0.0 B                 0.0 B
> >> >> >> >> > 1     CANNOT FIND ADDRESS     36 s     1     0         1
> >> >> >> >> > 0.0
> >> >> >> >> > B
> >> >> >> >> > 0.0 B             0.0 B                 0.0 B
> >> >> >> >> >
> >> >> >> >> > Tasks
> >> >> >> >> >
> >> >> >> >> > Task Index    Task ID    Status    Locality Level    Executor
> >> >> >> >> > Launch
> >> >> >> >> > Time
> >> >> >> >> > Duration    GC Time    Result Ser Time    Errors
> >> >> >> >> > 0     726     SUCCESS         PROCESS_LOCAL     slave1
> >> >> >> >> > 2014/07/02
> >> >> >> >> > 16:01:28 35 s         0.1 s
> >> >> >> >> > 1     727     SUCCESS         PROCESS_LOCAL     slave2
> >> >> >> >> > 2014/07/02
> >> >> >> >> > 16:01:28 33 s         99 ms
> >> >> >> >> >
> >> >> >> >> > Any pointers / diagnosis please?
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar
> >> >> >> >> > <re...@gmail.com>
> >> >> >> >> > wrote:
> >> >> >> >> >>
> >> >> >> >> >> Thanks. I'll await the fix to re-run my test.
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng
> >> >> >> >> >> <me...@gmail.com>
> >> >> >> >> >> wrote:
> >> >> >> >> >>>
> >> >> >> >> >>> Hi Bharath,
> >> >> >> >> >>>
> >> >> >> >> >>> This is related to SPARK-1112, which we already found the
> >> >> >> >> >>> root
> >> >> >> >> >>> cause.
> >> >> >> >> >>> I will let you know when this is fixed.
> >> >> >> >> >>>
> >> >> >> >> >>> Best,
> >> >> >> >> >>> Xiangrui
> >> >> >> >> >>>
> >> >> >> >> >>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar
> >> >> >> >> >>> <re...@gmail.com>
> >> >> >> >> >>> wrote:
> >> >> >> >> >>> > Couple more points:
> >> >> >> >> >>> > 1)The inexplicable stalling of execution with large
> feature
> >> >> >> >> >>> > sets
> >> >> >> >> >>> > appears
> >> >> >> >> >>> > similar to that reported with the news-20 dataset:
> >> >> >> >> >>> >
> >> >> >> >> >>> >
> >> >> >> >> >>> >
> >> >> >> >> >>> >
> >> >> >> >> >>> >
> >> >> >> >> >>> >
> http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
> >> >> >> >> >>> >
> >> >> >> >> >>> > 2) The NPE trying to call mapToPair convert an RDD<Long,
> >> >> >> >> >>> > Long,
> >> >> >> >> >>> > Integer,
> >> >> >> >> >>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>,
> >> >> >> >> >>> > Tuple2<Integer,Integer>>
> >> >> >> >> >>> > is
> >> >> >> >> >>> > unrelated to mllib.
> >> >> >> >> >>> >
> >> >> >> >> >>> > Thanks,
> >> >> >> >> >>> > Bharath
> >> >> >> >> >>> >
> >> >> >> >> >>> >
> >> >> >> >> >>> >
> >> >> >> >> >>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
> >> >> >> >> >>> > <re...@gmail.com>
> >> >> >> >> >>> > wrote:
> >> >> >> >> >>> >>
> >> >> >> >> >>> >> Hi  Xiangrui ,
> >> >> >> >> >>> >>
> >> >> >> >> >>> >> I'm using 1.0.0.
> >> >> >> >> >>> >>
> >> >> >> >> >>> >> Thanks,
> >> >> >> >> >>> >> Bharath
> >> >> >> >> >>> >>
> >> >> >> >> >>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <
> mengxr@gmail.com>
> >> >> >> >> >>> >> wrote:
> >> >> >> >> >>> >>>
> >> >> >> >> >>> >>> Hi Bharath,
> >> >> >> >> >>> >>>
> >> >> >> >> >>> >>> Thanks for posting the details! Which Spark version are
> >> >> >> >> >>> >>> you
> >> >> >> >> >>> >>> using?
> >> >> >> >> >>> >>>
> >> >> >> >> >>> >>> Best,
> >> >> >> >> >>> >>> Xiangrui
> >> >> >> >> >>> >>>
> >> >> >> >> >>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
> >> >> >> >> >>> >>> <re...@gmail.com>
> >> >> >> >> >>> >>> wrote:
> >> >> >> >> >>> >>> > Hi,
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > (Apologies for the long mail, but it's necessary to
> >> >> >> >> >>> >>> > provide
> >> >> >> >> >>> >>> > sufficient
> >> >> >> >> >>> >>> > details considering the number of issues faced.)
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > I'm running into issues testing
> >> >> >> >> >>> >>> > LogisticRegressionWithSGD
> >> >> >> >> >>> >>> > a
> >> >> >> >> >>> >>> > two
> >> >> >> >> >>> >>> > node
> >> >> >> >> >>> >>> > cluster
> >> >> >> >> >>> >>> > (each node with 24 cores and 16G available to slaves
> >> >> >> >> >>> >>> > out
> >> >> >> >> >>> >>> > of
> >> >> >> >> >>> >>> > 24G
> >> >> >> >> >>> >>> > on
> >> >> >> >> >>> >>> > the
> >> >> >> >> >>> >>> > system). Here's a description of the application:
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > The model is being trained based on categorical
> >> >> >> >> >>> >>> > features
> >> >> >> >> >>> >>> > x,
> >> >> >> >> >>> >>> > y,
> >> >> >> >> >>> >>> > and
> >> >> >> >> >>> >>> > (x,y).
> >> >> >> >> >>> >>> > The categorical features are mapped to binary
> features
> >> >> >> >> >>> >>> > by
> >> >> >> >> >>> >>> > converting
> >> >> >> >> >>> >>> > each
> >> >> >> >> >>> >>> > distinct value in the category enum into a binary
> >> >> >> >> >>> >>> > feature
> >> >> >> >> >>> >>> > by
> >> >> >> >> >>> >>> > itself
> >> >> >> >> >>> >>> > (i.e
> >> >> >> >> >>> >>> > presence of that value in a record implies
> >> >> >> >> >>> >>> > corresponding
> >> >> >> >> >>> >>> > feature
> >> >> >> >> >>> >>> > =
> >> >> >> >> >>> >>> > 1,
> >> >> >> >> >>> >>> > else
> >> >> >> >> >>> >>> > feature = 0. So, there'd be as many distinct features
> >> >> >> >> >>> >>> > as
> >> >> >> >> >>> >>> > enum
> >> >> >> >> >>> >>> > values) .
> >> >> >> >> >>> >>> > The
> >> >> >> >> >>> >>> > training vector is laid out as
> >> >> >> >> >>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)].
> >> >> >> >> >>> >>> > Each
> >> >> >> >> >>> >>> > record
> >> >> >> >> >>> >>> > in
> >> >> >> >> >>> >>> > the
> >> >> >> >> >>> >>> > training data has only one combination (Xk,Yk) and a
> >> >> >> >> >>> >>> > label
> >> >> >> >> >>> >>> > appearing in
> >> >> >> >> >>> >>> > the
> >> >> >> >> >>> >>> > record. Thus, the corresponding labeledpoint sparse
> >> >> >> >> >>> >>> > vector
> >> >> >> >> >>> >>> > would
> >> >> >> >> >>> >>> > only
> >> >> >> >> >>> >>> > have 3
> >> >> >> >> >>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total
> >> >> >> >> >>> >>> > length
> >> >> >> >> >>> >>> > of
> >> >> >> >> >>> >>> > the
> >> >> >> >> >>> >>> > vector
> >> >> >> >> >>> >>> > (though parse) would be nearly 614000.  The number of
> >> >> >> >> >>> >>> > records
> >> >> >> >> >>> >>> > is
> >> >> >> >> >>> >>> > about
> >> >> >> >> >>> >>> > 1.33
> >> >> >> >> >>> >>> > million. The records have been coalesced into 20
> >> >> >> >> >>> >>> > partitions
> >> >> >> >> >>> >>> > across
> >> >> >> >> >>> >>> > two
> >> >> >> >> >>> >>> > nodes. The input data has not been cached.
> >> >> >> >> >>> >>> > (NOTE: I do realize the records & features may seem
> >> >> >> >> >>> >>> > large
> >> >> >> >> >>> >>> > for
> >> >> >> >> >>> >>> > a
> >> >> >> >> >>> >>> > two
> >> >> >> >> >>> >>> > node
> >> >> >> >> >>> >>> > setup, but given the memory & cpu, and the fact that
> >> >> >> >> >>> >>> > I'm
> >> >> >> >> >>> >>> > willing
> >> >> >> >> >>> >>> > to
> >> >> >> >> >>> >>> > give up
> >> >> >> >> >>> >>> > some turnaround time, I don't see why tasks should
> >> >> >> >> >>> >>> > inexplicably
> >> >> >> >> >>> >>> > fail)
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Additional parameters include:
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > spark.executor.memory = 14G
> >> >> >> >> >>> >>> > spark.default.parallelism = 1
> >> >> >> >> >>> >>> > spark.cores.max=20
> >> >> >> >> >>> >>> > spark.storage.memoryFraction=0.8 //No cache space
> >> >> >> >> >>> >>> > required
> >> >> >> >> >>> >>> > (Trying to set spark.akka.frameSize to a larger
> number,
> >> >> >> >> >>> >>> > say,
> >> >> >> >> >>> >>> > 20
> >> >> >> >> >>> >>> > didn't
> >> >> >> >> >>> >>> > help
> >> >> >> >> >>> >>> > either)
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > The model training was initialized as : new
> >> >> >> >> >>> >>> > LogisticRegressionWithSGD(1,
> >> >> >> >> >>> >>> > maxIterations, 0.0, 0.05)
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > However, after 4 iterations of gradient descent, the
> >> >> >> >> >>> >>> > entire
> >> >> >> >> >>> >>> > execution
> >> >> >> >> >>> >>> > appeared to stall inexplicably. The corresponding
> >> >> >> >> >>> >>> > executor
> >> >> >> >> >>> >>> > details
> >> >> >> >> >>> >>> > and
> >> >> >> >> >>> >>> > details of the stalled stage (number 14) are as
> >> >> >> >> >>> >>> > follows:
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Metric                        Min        25th
> >> >> >> >> >>> >>> > Median
> >> >> >> >> >>> >>> > 75th
> >> >> >> >> >>> >>> > Max
> >> >> >> >> >>> >>> > Result serialization time    12 ms    13 ms    14 ms
> >> >> >> >> >>> >>> > 16
> >> >> >> >> >>> >>> > ms
> >> >> >> >> >>> >>> > 18
> >> >> >> >> >>> >>> > ms
> >> >> >> >> >>> >>> > Duration                    4 s        4 s        5 s
> >> >> >> >> >>> >>> > 5 s
> >> >> >> >> >>> >>> > 5 s
> >> >> >> >> >>> >>> > Time spent fetching task     0 ms    0 ms    0 ms
>  0
> >> >> >> >> >>> >>> > ms
> >> >> >> >> >>> >>> > 0
> >> >> >> >> >>> >>> > ms
> >> >> >> >> >>> >>> > results
> >> >> >> >> >>> >>> > Scheduler delay                6 s        6 s
>  6
> >> >> >> >> >>> >>> > s
> >> >> >> >> >>> >>> > 6 s
> >> >> >> >> >>> >>> > 12 s
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Stage Id
> >> >> >> >> >>> >>> > 14 aggregate at GradientDescent.scala:178
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Task Index    Task ID    Status    Locality Level
> >> >> >> >> >>> >>> > Executor
> >> >> >> >> >>> >>> > Launch Time                Duration    GC     Result
> >> >> >> >> >>> >>> > Ser
> >> >> >> >> >>> >>> > Time
> >> >> >> >> >>> >>> > Errors
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Time
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > 0     600     RUNNING     PROCESS_LOCAL
> >> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >> >> >> >>> >>> > 1     601     RUNNING     PROCESS_LOCAL
> >> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
> >> >> >> >> >>> >>> > 2     602     RUNNING     PROCESS_LOCAL
> >> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >> >> >> >>> >>> > 3     603     RUNNING     PROCESS_LOCAL
> >> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
> >> >> >> >> >>> >>> > 4     604     RUNNING     PROCESS_LOCAL
> >> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >> >> >> >>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
> >> >> >> >> >>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
> >> >> >> >> >>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
> >> >> >> >> >>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >> >> >> >>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
> >> >> >> >> >>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >> >> >> >>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
> >> >> >> >> >>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
> >> >> >> >> >>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
> >> >> >> >> >>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
> >> >> >> >> >>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
> >> >> >> >> >>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >> >> >> >>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
> >> >> >> >> >>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
> >> >> >> >> >>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
> >> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Executor stats:
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > RDD Blocks    Memory Used    Disk Used    Active
> Tasks
> >> >> >> >> >>> >>> > Failed
> >> >> >> >> >>> >>> > Tasks
> >> >> >> >> >>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle
> >> >> >> >> >>> >>> > Read
> >> >> >> >> >>> >>> > Shuffle
> >> >> >> >> >>> >>> > Write
> >> >> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2
> >> >> >> >> >>> >>> > 0
> >> >> >> >> >>> >>> > 307         309         23.2 m         0.0 B
> >> >> >> >> >>> >>> > 0.0
> >> >> >> >> >>> >>> > B
> >> >> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3
> >> >> >> >> >>> >>> > 0
> >> >> >> >> >>> >>> > 308         311         22.4 m         0.0 B
> >> >> >> >> >>> >>> > 0.0
> >> >> >> >> >>> >>> > B
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Executor jmap output:
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Server compiler detected.
> >> >> >> >> >>> >>> > JVM version is 24.55-b03
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > using thread-local object allocation.
> >> >> >> >> >>> >>> > Parallel GC with 18 thread(s)
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Heap Configuration:
> >> >> >> >> >>> >>> >    MinHeapFreeRatio = 40
> >> >> >> >> >>> >>> >    MaxHeapFreeRatio = 70
> >> >> >> >> >>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
> >> >> >> >> >>> >>> >    NewSize          = 1310720 (1.25MB)
> >> >> >> >> >>> >>> >    MaxNewSize       = 17592186044415 MB
> >> >> >> >> >>> >>> >    OldSize          = 5439488 (5.1875MB)
> >> >> >> >> >>> >>> >    NewRatio         = 2
> >> >> >> >> >>> >>> >    SurvivorRatio    = 8
> >> >> >> >> >>> >>> >    PermSize         = 21757952 (20.75MB)
> >> >> >> >> >>> >>> >    MaxPermSize      = 134217728 (128.0MB)
> >> >> >> >> >>> >>> >    G1HeapRegionSize = 0 (0.0MB)
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Heap Usage:
> >> >> >> >> >>> >>> > PS Young Generation
> >> >> >> >> >>> >>> > Eden Space:
> >> >> >> >> >>> >>> >    capacity = 2783969280 (2655.0MB)
> >> >> >> >> >>> >>> >    used     = 192583816 (183.66223907470703MB)
> >> >> >> >> >>> >>> >    free     = 2591385464 (2471.337760925293MB)
> >> >> >> >> >>> >>> >    6.917598458557704% used
> >> >> >> >> >>> >>> > From Space:
> >> >> >> >> >>> >>> >    capacity = 409993216 (391.0MB)
> >> >> >> >> >>> >>> >    used     = 1179808 (1.125152587890625MB)
> >> >> >> >> >>> >>> >    free     = 408813408 (389.8748474121094MB)
> >> >> >> >> >>> >>> >    0.2877628102022059% used
> >> >> >> >> >>> >>> > To Space:
> >> >> >> >> >>> >>> >    capacity = 385351680 (367.5MB)
> >> >> >> >> >>> >>> >    used     = 0 (0.0MB)
> >> >> >> >> >>> >>> >    free     = 385351680 (367.5MB)
> >> >> >> >> >>> >>> >    0.0% used
> >> >> >> >> >>> >>> > PS Old Generation
> >> >> >> >> >>> >>> >    capacity = 7158628352 (6827.0MB)
> >> >> >> >> >>> >>> >    used     = 4455093024 (4248.707794189453MB)
> >> >> >> >> >>> >>> >    free     = 2703535328 (2578.292205810547MB)
> >> >> >> >> >>> >>> >    62.2338918146983% used
> >> >> >> >> >>> >>> > PS Perm Generation
> >> >> >> >> >>> >>> >    capacity = 90701824 (86.5MB)
> >> >> >> >> >>> >>> >    used     = 45348832 (43.248016357421875MB)
> >> >> >> >> >>> >>> >    free     = 45352992 (43.251983642578125MB)
> >> >> >> >> >>> >>> >    49.99770677158598% used
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > 8432 interned Strings occupying 714672 bytes.
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Executor GC log snippet:
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > 168.778: [GC [PSYoungGen:
> 2702831K->578545K(2916864K)]
> >> >> >> >> >>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times:
> >> >> >> >> >>> >>> > user=5.13
> >> >> >> >> >>> >>> > sys=0.39,
> >> >> >> >> >>> >>> > real=0.32 secs]
> >> >> >> >> >>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)]
> >> >> >> >> >>> >>> > [ParOldGen:
> >> >> >> >> >>> >>> > 6882312K->1073297K(6990848K)]
> >> >> >> >> >>> >>> > 7460857K->1073297K(9907712K)
> >> >> >> >> >>> >>> > [PSPermGen:
> >> >> >> >> >>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times:
> >> >> >> >> >>> >>> > user=24.22
> >> >> >> >> >>> >>> > sys=0.18,
> >> >> >> >> >>> >>> > real=4.55 secs]
> >> >> >> >> >>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
> >> >> >> >> >>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times:
> >> >> >> >> >>> >>> > user=1.66
> >> >> >> >> >>> >>> > sys=0.00,
> >> >> >> >> >>> >>> > real=0.09 secs]
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > I tried to map partitions to cores on the nodes.
> >> >> >> >> >>> >>> > Increasing
> >> >> >> >> >>> >>> > the
> >> >> >> >> >>> >>> > number
> >> >> >> >> >>> >>> > of
> >> >> >> >> >>> >>> > partitions (say to 80 or 100) would result in
> progress
> >> >> >> >> >>> >>> > till
> >> >> >> >> >>> >>> > the
> >> >> >> >> >>> >>> > 6th
> >> >> >> >> >>> >>> > iteration or so, but the next stage would stall as
> >> >> >> >> >>> >>> > before
> >> >> >> >> >>> >>> > with
> >> >> >> >> >>> >>> > apparent
> >> >> >> >> >>> >>> > root
> >> >> >> >> >>> >>> > cause / logs. With increased partitions, the last
> stage
> >> >> >> >> >>> >>> > that
> >> >> >> >> >>> >>> > completed
> >> >> >> >> >>> >>> > had
> >> >> >> >> >>> >>> > the following task times:
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Metric                        Min        25th
> >> >> >> >> >>> >>> > Median
> >> >> >> >> >>> >>> > 75th
> >> >> >> >> >>> >>> > Max
> >> >> >> >> >>> >>> > Result serialization time    11 ms    12 ms    13 ms
> >> >> >> >> >>> >>> > 15
> >> >> >> >> >>> >>> > ms
> >> >> >> >> >>> >>> > 0.4 s
> >> >> >> >> >>> >>> > Duration                    0.5 s    0.9 s    1 s
> >> >> >> >> >>> >>> > 3
> >> >> >> >> >>> >>> > s
> >> >> >> >> >>> >>> > 7 s
> >> >> >> >> >>> >>> > Time spent fetching            0 ms    0 ms    0 ms
> >> >> >> >> >>> >>> > 0
> >> >> >> >> >>> >>> > ms
> >> >> >> >> >>> >>> > 0
> >> >> >> >> >>> >>> > ms
> >> >> >> >> >>> >>> > task results
> >> >> >> >> >>> >>> > Scheduler delay                5 s        6 s
>  6
> >> >> >> >> >>> >>> > s
> >> >> >> >> >>> >>> > 7 s
> >> >> >> >> >>> >>> > 12 s
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > My hypothesis is that as the coefficient array
> becomes
> >> >> >> >> >>> >>> > less
> >> >> >> >> >>> >>> > sparse
> >> >> >> >> >>> >>> > (with
> >> >> >> >> >>> >>> > successive iterations), the cost of the aggregate
> goes
> >> >> >> >> >>> >>> > up
> >> >> >> >> >>> >>> > to
> >> >> >> >> >>> >>> > the
> >> >> >> >> >>> >>> > point
> >> >> >> >> >>> >>> > that
> >> >> >> >> >>> >>> > it stalls (which I failed to explain). Reducing the
> >> >> >> >> >>> >>> > batch
> >> >> >> >> >>> >>> > fraction
> >> >> >> >> >>> >>> > to a
> >> >> >> >> >>> >>> > very
> >> >> >> >> >>> >>> > low number like 0.01 saw the iterations progress
> >> >> >> >> >>> >>> > further,
> >> >> >> >> >>> >>> > but
> >> >> >> >> >>> >>> > the
> >> >> >> >> >>> >>> > model
> >> >> >> >> >>> >>> > failed to converge in that case after a small number
> of
> >> >> >> >> >>> >>> > iterations.
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > I also tried reducing the number of records by
> >> >> >> >> >>> >>> > aggregating
> >> >> >> >> >>> >>> > on
> >> >> >> >> >>> >>> > (x,y)
> >> >> >> >> >>> >>> > as
> >> >> >> >> >>> >>> > the
> >> >> >> >> >>> >>> > key (i.e. using aggregations instead of training on
> >> >> >> >> >>> >>> > every
> >> >> >> >> >>> >>> > raw
> >> >> >> >> >>> >>> > record),
> >> >> >> >> >>> >>> > but
> >> >> >> >> >>> >>> > encountered by the following exception:
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Loss was due to java.lang.NullPointerException
> >> >> >> >> >>> >>> > java.lang.NullPointerException
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> > org.apache.spark.scheduler.Task.run(Task.scala:51)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >> >>> >>> >         at
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >> >>> >>> >         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > I'd appreciate any insights/comments about what may
> be
> >> >> >> >> >>> >>> > causing
> >> >> >> >> >>> >>> > the
> >> >> >> >> >>> >>> > execution
> >> >> >> >> >>> >>> > to stall.
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > If logs/tables appear poorly indented in the email,
> >> >> >> >> >>> >>> > here's
> >> >> >> >> >>> >>> > a
> >> >> >> >> >>> >>> > gist
> >> >> >> >> >>> >>> > with
> >> >> >> >> >>> >>> > relevant details:
> >> >> >> >> >>> >>> >
> https://gist.github.com/reachbach/a418ab2f01b639b624c1
> >> >> >> >> >>> >>> >
> >> >> >> >> >>> >>> > Thanks,
> >> >> >> >> >>> >>> > Bharath
> >> >> >> >> >>> >
> >> >> >> >> >>> >
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>

Re: Execution stalls in LogisticRegressionWithSGD

Posted by Xiangrui Meng <me...@gmail.com>.

We have maven-enforcer-plugin defined in the pom. I don't know why it
didn't work for you. Could you try rebuild with maven2 and confirm
that there is no error message? If that is the case, please create a
JIRA for it. Thanks! -Xiangrui

On Wed, Jul 9, 2014 at 3:53 AM, Bharath Ravi Kumar <re...@gmail.com> wrote:
> Xiangrui,
>
> Thanks for all the help in resolving this issue. The  cause turned out to
> bethe build environment rather than runtime configuration. The build process
> had picked up maven2 while building spark. Using binaries that were rebuilt
> using m3, the entire processing went through fine. While I'm aware that the
> build instruction page specifies m3 as the min requirement, declaratively
> preventing accidental m2 usage (e.g. through something like the maven
> enforcer plugin?) might help other developers avoid such issues.
>
> -Bharath
>
>
>
> On Mon, Jul 7, 2014 at 9:43 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> It seems to me a setup issue. I just tested news20.binary (1355191
>> features) on a 2-node EC2 cluster and it worked well. I added one line
>> to conf/spark-env.sh:
>>
>> export SPARK_JAVA_OPTS=" -Dspark.akka.frameSize=20 "
>>
>> and launched spark-shell with "--driver-memory 20g". Could you re-try
>> with an EC2 setup? If it still doesn't work, please attach all your
>> code and logs.
>>
>> Best,
>> Xiangrui
>>
>> On Sun, Jul 6, 2014 at 1:35 AM, Bharath Ravi Kumar <re...@gmail.com>
>> wrote:
>> > Hi Xiangrui,
>> >
>> > 1) Yes, I used the same build (compiled locally from source) to the host
>> > that has (master, slave1) and the second host with slave2.
>> >
>> > 2) The execution was successful when run in local mode with reduced
>> > number
>> > of partitions. Does this imply issues communicating/coordinating across
>> > processes (i.e. driver, master and workers)?
>> >
>> > Thanks,
>> > Bharath
>> >
>> >
>> >
>> > On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng <me...@gmail.com> wrote:
>> >>
>> >> Hi Bharath,
>> >>
>> >> 1) Did you sync the spark jar and conf to the worker nodes after build?
>> >> 2) Since the dataset is not large, could you try local mode first
>> >> using `spark-summit --driver-memory 12g --master local[*]`?
>> >> 3) Try to use less number of partitions, say 5.
>> >>
>> >> If the problem is still there, please attach the full master/worker log
>> >> files.
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >> On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar
>> >> <re...@gmail.com>
>> >> wrote:
>> >> > Xiangrui,
>> >> >
>> >> > Leaving the frameSize unspecified led to an error message (and
>> >> > failure)
>> >> > stating that the task size (~11M) was larger. I hence set it to an
>> >> > arbitrarily large value ( I realize 500 was unrealistic & unnecessary
>> >> > in
>> >> > this case). I've now set the size to 20M and repeated the runs. The
>> >> > earlier
>> >> > runs were on an uncached RDD. Caching the RDD (and setting
>> >> > spark.storage.memoryFraction=0.5) resulted in marginal speed up of
>> >> > execution, but the end result remained the same. The cached RDD size
>> >> > is
>> >> > as
>> >> > follows:
>> >> >
>> >> > RDD Name    Storage Level                        Cached Partitions
>> >> > Fraction Cached    Size in Memory    Size in Tachyon        Size on
>> >> > Disk
>> >> > 1084         Memory Deserialized 1x Replicated     80
>> >> > 100%         165.9 MB             0.0 B                 0.0 B
>> >> >
>> >> >
>> >> >
>> >> > The corresponding master logs were:
>> >> >
>> >> > 14/07/04 06:29:34 INFO Master: Removing executor
>> >> > app-20140704062238-0033/1
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:34 INFO Master: Launching executor
>> >> > app-20140704062238-0033/2
>> >> > on worker worker-20140630124441-slave1-40182
>> >> > 14/07/04 06:29:34 INFO Master: Removing executor
>> >> > app-20140704062238-0033/0
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:34 INFO Master: Launching executor
>> >> > app-20140704062238-0033/3
>> >> > on worker worker-20140630102913-slave2-44735
>> >> > 14/07/04 06:29:37 INFO Master: Removing executor
>> >> > app-20140704062238-0033/2
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:37 INFO Master: Launching executor
>> >> > app-20140704062238-0033/4
>> >> > on worker worker-20140630124441-slave1-40182
>> >> > 14/07/04 06:29:37 INFO Master: Removing executor
>> >> > app-20140704062238-0033/3
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:37 INFO Master: Launching executor
>> >> > app-20140704062238-0033/5
>> >> > on worker worker-20140630102913-slave2-44735
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
>> >> > 14/07/04 06:29:39 INFO LocalActorRef: Message
>> >> > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
>> >> > from
>> >> > Actor[akka://sparkMaster/deadLetters] to
>> >> >
>> >> >
>> >> > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
>> >> > was not delivered. [39] dead letters encountered. This logging can be
>> >> > turned
>> >> > off or adjusted with configuration settings 'akka.log-dead-letters'
>> >> > and
>> >> > 'akka.log-dead-letters-during-shutdown'.
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> >> > [akka.tcp://sparkMaster@master:7077] ->
>> >> > [akka.tcp://spark@slave2:45172]:
>> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> >> > akka.remote.EndpointAssociationException: Association failed with
>> >> > [akka.tcp://spark@slave2:45172]
>> >> > Caused by:
>> >> >
>> >> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> >> > Connection refused: slave2/10.3.1.135:45172
>> >> > ]
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> >> > [akka.tcp://sparkMaster@master:7077] ->
>> >> > [akka.tcp://spark@slave2:45172]:
>> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> >> > akka.remote.EndpointAssociationException: Association failed with
>> >> > [akka.tcp://spark@slave2:45172]
>> >> > Caused by:
>> >> >
>> >> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> >> > Connection refused: slave2/10.3.1.135:45172
>> >> > ]
>> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> >> > [akka.tcp://sparkMaster@master:7077] ->
>> >> > [akka.tcp://spark@slave2:45172]:
>> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> >> > akka.remote.EndpointAssociationException: Association failed with
>> >> > [akka.tcp://spark@slave2:45172]
>> >> > Caused by:
>> >> >
>> >> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> >> > Connection refused: slave2/10.3.1.135:45172
>> >> > ]
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
>> >> > app-20140704062238-0033/5
>> >> > 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
>> >> > app-20140704062238-0033/4
>> >> >
>> >> >
>> >> > Coincidentally, after the initial executor failed, each following
>> >> > executor
>> >> > that was re-spawned failed with the following logs:
>> >> > (e.g the following was from
>> >> > slave1:~/spark-1.0.1-rc1/work/app-20140704062238-0033/2/stderr)
>> >> >
>> >> > log4j:WARN No appenders could be found for logger
>> >> > (org.apache.hadoop.conf.Configuration).
>> >> > log4j:WARN Please initialize the log4j system properly.
>> >> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
>> >> > for
>> >> > more info.
>> >> > 14/07/04 06:29:35 INFO SparkHadoopUtil: Using Spark's default log4j
>> >> > profile:
>> >> > org/apache/spark/log4j-defaults.properties
>> >> > 14/07/04 06:29:35 INFO SecurityManager: Changing view acls to: user1
>> >> > 14/07/04 06:29:35 INFO SecurityManager: SecurityManager:
>> >> > authentication
>> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> >> > 14/07/04 06:29:35 INFO Slf4jLogger: Slf4jLogger started
>> >> > 14/07/04 06:29:35 INFO Remoting: Starting remoting
>> >> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on
>> >> > addresses
>> >> > :[akka.tcp://sparkExecutor@slave1:54782]
>> >> > 14/07/04 06:29:36 INFO Remoting: Remoting now listens on addresses:
>> >> > [akka.tcp://sparkExecutor@slave1:54782]
>> >> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Connecting to
>> >> > driver:
>> >> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
>> >> > 14/07/04 06:29:36 INFO WorkerWatcher: Connecting to worker
>> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> >> > 14/07/04 06:29:36 INFO WorkerWatcher: Successfully connected to
>> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> >> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Successfully
>> >> > registered
>> >> > with driver
>> >> > 14/07/04 06:29:36 INFO SecurityManager: Changing view acls to: user1
>> >> > 14/07/04 06:29:36 INFO SecurityManager: SecurityManager:
>> >> > authentication
>> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> >> > 14/07/04 06:29:36 INFO Slf4jLogger: Slf4jLogger started
>> >> > 14/07/04 06:29:36 INFO Remoting: Starting remoting
>> >> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on
>> >> > addresses
>> >> > :[akka.tcp://spark@slave1:39753]
>> >> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to MapOutputTracker:
>> >> > akka.tcp://spark@master:45172/user/MapOutputTracker
>> >> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to BlockManagerMaster:
>> >> > akka.tcp://spark@master:45172/user/BlockManagerMaster
>> >> > 14/07/04 06:29:36 INFO DiskBlockManager: Created local directory at
>> >> > /tmp/spark-local-20140704062936-6123
>> >> > 14/07/04 06:29:36 INFO MemoryStore: MemoryStore started with capacity
>> >> > 6.7
>> >> > GB.
>> >> > 14/07/04 06:29:36 INFO ConnectionManager: Bound socket to port 50960
>> >> > with id
>> >> > = ConnectionManagerId(slave1,50960)
>> >> > 14/07/04 06:29:36 INFO BlockManagerMaster: Trying to register
>> >> > BlockManager
>> >> > 14/07/04 06:29:36 INFO BlockManagerMaster: Registered BlockManager
>> >> > 14/07/04 06:29:36 INFO HttpFileServer: HTTP File server directory is
>> >> > /tmp/spark-42c2782f-60f8-45a7-9e11-c789fc87fe2e
>> >> > 14/07/04 06:29:36 INFO HttpServer: Starting HTTP Server
>> >> > 14/07/04 06:29:36 ERROR CoarseGrainedExecutorBackend: Driver
>> >> > Disassociated
>> >> > [akka.tcp://sparkExecutor@slave1:54782] ->
>> >> > [akka.tcp://spark@master:45172]
>> >> > disassociated! Shutting down.
>> >> >
>> >> > In case of the initial executor that successfully started, the
>> >> > corresponding
>> >> > log messages (from
>> >> > spark-1.0.1-rc1/work/app-20140704062238-0033/1/stderr) on
>> >> > the executor were:
>> >> > log4j:WARN No appenders could be found for logger
>> >> > (org.apache.hadoop.conf.Configuration).
>> >> > log4j:WARN Please initialize the log4j system properly.
>> >> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
>> >> > for
>> >> > more info.
>> >> > 14/07/04 06:22:39 INFO SparkHadoopUtil: Using Spark's default log4j
>> >> > profile:
>> >> > org/apache/spark/log4j-defaults.properties
>> >> > 14/07/04 06:22:39 INFO SecurityManager: Changing view acls to: user1
>> >> > 14/07/04 06:22:39 INFO SecurityManager: SecurityManager:
>> >> > authentication
>> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> >> > 14/07/04 06:22:39 INFO Slf4jLogger: Slf4jLogger started
>> >> > 14/07/04 06:22:39 INFO Remoting: Starting remoting
>> >> > 14/07/04 06:22:39 INFO Remoting: Remoting started; listening on
>> >> > addresses
>> >> > :[akka.tcp://sparkExecutor@slave1:50806]
>> >> > 14/07/04 06:22:39 INFO Remoting: Remoting now listens on addresses:
>> >> > [akka.tcp://sparkExecutor@slave1:50806]
>> >> > 14/07/04 06:22:39 INFO CoarseGrainedExecutorBackend: Connecting to
>> >> > driver:
>> >> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
>> >> > 14/07/04 06:22:39 INFO WorkerWatcher: Connecting to worker
>> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> >> > 14/07/04 06:22:39 INFO WorkerWatcher: Successfully connected to
>> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> >> > 14/07/04 06:22:40 INFO CoarseGrainedExecutorBackend: Successfully
>> >> > registered
>> >> > with driver
>> >> > 14/07/04 06:22:40 INFO SecurityManager: Changing view acls to: user1
>> >> > 14/07/04 06:22:40 INFO SecurityManager: SecurityManager:
>> >> > authentication
>> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> >> > 14/07/04 06:22:40 INFO Slf4jLogger: Slf4jLogger started
>> >> > 14/07/04 06:22:40 INFO Remoting: Starting remoting
>> >> > 14/07/04 06:22:40 INFO Remoting: Remoting started; listening on
>> >> > addresses
>> >> > :[akka.tcp://spark@slave1:38558]
>> >> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to MapOutputTracker:
>> >> > akka.tcp://spark@master:45172/user/MapOutputTracker
>> >> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to BlockManagerMaster:
>> >> > akka.tcp://spark@master:45172/user/BlockManagerMaster
>> >> > 14/07/04 06:22:40 INFO DiskBlockManager: Created local directory at
>> >> > /tmp/spark-local-20140704062240-6a65
>> >> > 14/07/04 06:22:40 INFO MemoryStore: MemoryStore started with capacity
>> >> > 6.7
>> >> > GB.
>> >> > 14/07/04 06:22:40 INFO ConnectionManager: Bound socket to port 46901
>> >> > with id
>> >> > = ConnectionManagerId(slave1,46901)
>> >> > 14/07/04 06:22:40 INFO BlockManagerMaster: Trying to register
>> >> > BlockManager
>> >> > 14/07/04 06:22:40 INFO BlockManagerMaster: Registered BlockManager
>> >> > 14/07/04 06:22:40 INFO HttpFileServer: HTTP File server directory is
>> >> > /tmp/spark-9eba78f9-8ae9-477c-9338-7222ae6fe306
>> >> > 14/07/04 06:22:40 INFO HttpServer: Starting HTTP Server
>> >> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned
>> >> > task 0
>> >> > 14/07/04 06:22:42 INFO Executor: Running task ID 0
>> >> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned
>> >> > task 2
>> >> > 14/07/04 06:22:42 INFO Executor: Running task ID 2
>> >> > ...
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng <me...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> The feature dimension is small. You don't need a big akka.frameSize.
>> >> >> The default one (10M) should be sufficient. Did you cache the data
>> >> >> before calling LRWithSGD? -Xiangrui
>> >> >>
>> >> >> On Thu, Jul 3, 2014 at 10:02 AM, Bharath Ravi Kumar
>> >> >> <re...@gmail.com>
>> >> >> wrote:
>> >> >> > I tried another run after setting the driver memory to 8G (and
>> >> >> > spark.akka.frameSize = 500 on the executors and the driver). In
>> >> >> > addition, I
>> >> >> > also tried to reduce the amount of data that a single task
>> >> >> > processes,
>> >> >> > by
>> >> >> > increasing the number of partitions (of the labeled points) to 120
>> >> >> > (instead
>> >> >> > of 2 used earlier), and then setting max cores to 2. That made no
>> >> >> > difference
>> >> >> > since, at the end of 120 tasks, the familiar error message
>> >> >> > appeared
>> >> >> > on a
>> >> >> > slave:
>> >> >> >
>> >> >> > <snipped earlier logs>
>> >> >> > 14/07/03 16:18:48 INFO CoarseGrainedExecutorBackend: Got assigned
>> >> >> > task
>> >> >> > 1436
>> >> >> > 14/07/03 16:18:48 INFO Executor: Running task ID 1436
>> >> >> > 14/07/03 16:18:53 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00014:0+2215337
>> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00014:2215337+2215338
>> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00003:0+2196429
>> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00003:2196429+2196430
>> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00010:0+2186751
>> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00010:2186751+2186751
>> >> >> > 14/07/03 16:18:54 INFO Executor: Serialized size of result for
>> >> >> > 1436
>> >> >> > is
>> >> >> > 5958822
>> >> >> > 14/07/03 16:18:54 INFO Executor: Sending result for 1436 directly
>> >> >> > to
>> >> >> > driver
>> >> >> > 14/07/03 16:18:54 INFO Executor: Finished task ID 1436
>> >> >> > 14/07/03 16:18:54 INFO CoarseGrainedExecutorBackend: Got assigned
>> >> >> > task
>> >> >> > 1438
>> >> >> > 14/07/03 16:18:54 INFO Executor: Running task ID 1438
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00004:0+2209615
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00004:2209615+2209616
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00011:0+2202240
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00011:2202240+2202240
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00009:0+2194423
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00009:2194423+2194424
>> >> >> > 14/07/03 16:19:00 INFO Executor: Serialized size of result for
>> >> >> > 1438
>> >> >> > is
>> >> >> > 5958822
>> >> >> > 14/07/03 16:19:00 INFO Executor: Sending result for 1438 directly
>> >> >> > to
>> >> >> > driver
>> >> >> > 14/07/03 16:19:00 INFO Executor: Finished task ID 1438
>> >> >> > 14/07/03 16:19:14 ERROR CoarseGrainedExecutorBackend: Driver
>> >> >> > Disassociated
>> >> >> > [akka.tcp://sparkExecutor@slave1:51099] ->
>> >> >> > [akka.tcp://spark@master:58272]
>> >> >> > disassociated! Shutting down.
>> >> >> >
>> >> >> >
>> >> >> > The corresponding master logs were:
>> >> >> >
>> >> >> > 4/07/03 16:02:14 INFO Master: Registering app LogRegExp
>> >> >> > 14/07/03 16:02:14 INFO Master: Registered app LogRegExp with ID
>> >> >> > app-20140703160214-0028
>> >> >> > 14/07/03 16:02:14 INFO Master: Launching executor
>> >> >> > app-20140703160214-0028/1
>> >> >> > on worker worker-20140630124441-slave1-40182
>> >> >> > 14/07/03 16:19:15 INFO Master: Removing executor
>> >> >> > app-20140703160214-0028/1
>> >> >> > because it is EXITED
>> >> >> > 14/07/03 16:19:15 INFO Master: Launching executor
>> >> >> > app-20140703160214-0028/2
>> >> >> > on worker worker-20140630124441-slave1-40182
>> >> >> > 14/07/03 16:19:15 INFO Master: Removing executor
>> >> >> > app-20140703160214-0028/0
>> >> >> > because it is EXITED
>> >> >> > 14/07/03 16:19:15 INFO Master: Launching executor
>> >> >> > app-20140703160214-0028/3
>> >> >> > on worker worker-20140630102913-slave2-44735
>> >> >> > 14/07/03 16:19:18 INFO Master: Removing executor
>> >> >> > app-20140703160214-0028/2
>> >> >> > because it is EXITED
>> >> >> > 14/07/03 16:19:18 INFO Master: Launching executor
>> >> >> > app-20140703160214-0028/4
>> >> >> > on worker worker-20140630124441-slave1-40182
>> >> >> > 14/07/03 16:19:18 INFO Master: Removing executor
>> >> >> > app-20140703160214-0028/3
>> >> >> > because it is EXITED
>> >> >> > 14/07/03 16:19:18 INFO Master: Launching executor
>> >> >> > app-20140703160214-0028/5
>> >> >> > on worker worker-20140630102913-slave2-44735
>> >> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
>> >> >> > disassociated, removing it.
>> >> >> > 14/07/03 16:19:20 INFO Master: Removing app
>> >> >> > app-20140703160214-0028
>> >> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
>> >> >> > disassociated, removing it.
>> >> >> >
>> >> >> >
>> >> >> > Throughout the execution, I confirmed in the UI that driver memory
>> >> >> > used
>> >> >> > was
>> >> >> > 0.0 B / 6.9 GB and each executor's memory showed 0.0 B / 12.1 GB
>> >> >> > even
>> >> >> > when
>> >> >> > aggregate was being executed. On a related note, I noticed in the
>> >> >> > executors
>> >> >> > tab that just before the entire job terminated, executors on
>> >> >> > slave1,
>> >> >> > slave2
>> >> >> > and the driver "disappeared" momentarily from the active executors
>> >> >> > list.
>> >> >> > The
>> >> >> > replacement  executors on slave1 and slave2 were re-spawned a
>> >> >> > couple
>> >> >> > of
>> >> >> > times and appeared on the executors list again before they too
>> >> >> > died
>> >> >> > and
>> >> >> > the
>> >> >> > job failed.
>> >> >> > So it appears that no matter what the task input-result size, the
>> >> >> > execution
>> >> >> > fails at the end of the stage corresponding to
>> >> >> > GradientDescent.aggregate
>> >> >> > (and the preceding count() in GradientDescent goes through fine).
>> >> >> > Let
>> >> >> > me
>> >> >> > know if you need any additional information.
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Jul 3, 2014 at 12:27 PM, Xiangrui Meng <me...@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Could you check the driver memory in the executor tab of the
>> >> >> >> Spark
>> >> >> >> UI
>> >> >> >> when the job is running? If it is too small, please set
>> >> >> >> --driver-memory with spark-submit, e.g. 10g. Could you also
>> >> >> >> attach
>> >> >> >> the
>> >> >> >> master log under spark/logs as well? -Xiangrui
>> >> >> >>
>> >> >> >> On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar
>> >> >> >> <re...@gmail.com>
>> >> >> >> wrote:
>> >> >> >> > Hi Xiangrui,
>> >> >> >> >
>> >> >> >> > The issue with aggergating/counting over large feature vectors
>> >> >> >> > (as
>> >> >> >> > part
>> >> >> >> > of
>> >> >> >> > LogisticRegressionWithSGD) continues to exist, but now in
>> >> >> >> > another
>> >> >> >> > form:
>> >> >> >> > while the execution doesn't freeze (due to SPARK-1112), it now
>> >> >> >> > fails
>> >> >> >> > at
>> >> >> >> > the
>> >> >> >> > second or third gradient descent iteration consistently with an
>> >> >> >> > error
>> >> >> >> > level
>> >> >> >> > log message, but no stacktrace. I'm running against 1.0.1-rc1,
>> >> >> >> > and
>> >> >> >> > have
>> >> >> >> > tried setting spark.akka.frameSize as high as 500. When the
>> >> >> >> > execution
>> >> >> >> > fails,
>> >> >> >> > each of the two executors log the following message
>> >> >> >> > (corresponding
>> >> >> >> > to
>> >> >> >> > aggregate at GradientDescent.scala:178) :
>> >> >> >> >
>> >> >> >> > 14/07/02 14:09:09 INFO
>> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> >> >> > maxBytesInFlight: 50331648, targetRequestSize: 10066329
>> >> >> >> > 14/07/02 14:09:09 INFO
>> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> >> >> > Getting 2 non-empty blocks out of 2 blocks
>> >> >> >> > 14/07/02 14:09:09 INFO
>> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> >> >> > Started 1 remote fetches in 0 ms
>> >> >> >> > 14/07/02 14:09:11 INFO Executor: Serialized size of result for
>> >> >> >> > 737
>> >> >> >> > is
>> >> >> >> > 5959086
>> >> >> >> > 14/07/02 14:09:11 INFO Executor: Sending result for 737
>> >> >> >> > directly
>> >> >> >> > to
>> >> >> >> > driver
>> >> >> >> > 14/07/02 14:09:11 INFO Executor: Finished task ID 737
>> >> >> >> > 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver
>> >> >> >> > Disassociated
>> >> >> >> > [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
>> >> >> >> > [akka.tcp://spark@master:59487] disassociated! Shutting down.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > There is no separate stacktrace on the driver side.
>> >> >> >> >
>> >> >> >> > Each input record is of the form p1, p2, (p1,p2) where p1, p2 &
>> >> >> >> > (p1,p2)
>> >> >> >> > are
>> >> >> >> > categorical features with large cardinality, and X is the
>> >> >> >> > double
>> >> >> >> > label
>> >> >> >> > with
>> >> >> >> > a continuous value. The categorical variables are converted to
>> >> >> >> > binary
>> >> >> >> > variables which results in a feature vector of size 741092
>> >> >> >> > (composed
>> >> >> >> > of
>> >> >> >> > all
>> >> >> >> > unique categories across p1, p2 and (p1,p2)). Thus, the labeled
>> >> >> >> > point
>> >> >> >> > for
>> >> >> >> > input record is a sparse vector of size 741092 with only 3
>> >> >> >> > variables
>> >> >> >> > set
>> >> >> >> > in
>> >> >> >> > the record. The total number of records is 683233 after
>> >> >> >> > aggregating
>> >> >> >> > the
>> >> >> >> > input data on (p1, p2). When attempting to train on the
>> >> >> >> > unaggregated
>> >> >> >> > records
>> >> >> >> > (1337907 in number spread across 455 files), the execution
>> >> >> >> > fails
>> >> >> >> > at
>> >> >> >> > count,
>> >> >> >> > GradientDescent.scala:161 with the following log
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > (Snipped lines corresponding to other input files)
>> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> >> >> > file:~/part-r-00012:2834590+2834590
>> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> >> >> > file:~/part-r-00005:0+2845559
>> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> >> >> > file:~/part-r-00005:2845559+2845560
>> >> >> >> > 14/07/02 16:02:03 INFO Executor: Serialized size of result for
>> >> >> >> > 726
>> >> >> >> > is
>> >> >> >> > 615
>> >> >> >> > 14/07/02 16:02:03 INFO Executor: Sending result for 726
>> >> >> >> > directly
>> >> >> >> > to
>> >> >> >> > driver
>> >> >> >> > 14/07/02 16:02:03 INFO Executor: Finished task ID 726
>> >> >> >> > 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver
>> >> >> >> > Disassociated
>> >> >> >> > [akka.tcp://sparkExecutor@slave1:48423] ->
>> >> >> >> > [akka.tcp://spark@master:55792]
>> >> >> >> > disassociated! Shutting down.
>> >> >> >> >
>> >> >> >> > A count() attempted on the input RDD before beginning training
>> >> >> >> > has
>> >> >> >> > the
>> >> >> >> > following metrics:
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > Metric            Min        25th    Median    75th     Max
>> >> >> >> >
>> >> >> >> > Result
>> >> >> >> > serialization
>> >> >> >> > time            0 ms    0 ms    0 ms    0 ms    0 ms
>> >> >> >> >
>> >> >> >> > Duration        33 s    33 s    35 s    35 s    35 s
>> >> >> >> >
>> >> >> >> > Time spent
>> >> >> >> > fetching task
>> >> >> >> > results            0 ms    0 ms    0 ms    0 ms    0 ms
>> >> >> >> >
>> >> >> >> > Scheduler
>> >> >> >> > delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
>> >> >> >> >
>> >> >> >> > Aggregated Metrics by Executor
>> >> >> >> >
>> >> >> >> > ID     Address Task             Time Total Failed Succeeded
>> >> >> >> > Shuffle
>> >> >> >> > Read
>> >> >> >> > Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
>> >> >> >> > 0     CANNOT FIND ADDRESS     34 s     1     0         1
>> >> >> >> > 0.0
>> >> >> >> > B
>> >> >> >> > 0.0 B             0.0 B                 0.0 B
>> >> >> >> > 1     CANNOT FIND ADDRESS     36 s     1     0         1
>> >> >> >> > 0.0
>> >> >> >> > B
>> >> >> >> > 0.0 B             0.0 B                 0.0 B
>> >> >> >> >
>> >> >> >> > Tasks
>> >> >> >> >
>> >> >> >> > Task Index    Task ID    Status    Locality Level    Executor
>> >> >> >> > Launch
>> >> >> >> > Time
>> >> >> >> > Duration    GC Time    Result Ser Time    Errors
>> >> >> >> > 0     726     SUCCESS         PROCESS_LOCAL     slave1
>> >> >> >> > 2014/07/02
>> >> >> >> > 16:01:28 35 s         0.1 s
>> >> >> >> > 1     727     SUCCESS         PROCESS_LOCAL     slave2
>> >> >> >> > 2014/07/02
>> >> >> >> > 16:01:28 33 s         99 ms
>> >> >> >> >
>> >> >> >> > Any pointers / diagnosis please?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar
>> >> >> >> > <re...@gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> Thanks. I'll await the fix to re-run my test.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng
>> >> >> >> >> <me...@gmail.com>
>> >> >> >> >> wrote:
>> >> >> >> >>>
>> >> >> >> >>> Hi Bharath,
>> >> >> >> >>>
>> >> >> >> >>> This is related to SPARK-1112, which we already found the
>> >> >> >> >>> root
>> >> >> >> >>> cause.
>> >> >> >> >>> I will let you know when this is fixed.
>> >> >> >> >>>
>> >> >> >> >>> Best,
>> >> >> >> >>> Xiangrui
>> >> >> >> >>>
>> >> >> >> >>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar
>> >> >> >> >>> <re...@gmail.com>
>> >> >> >> >>> wrote:
>> >> >> >> >>> > Couple more points:
>> >> >> >> >>> > 1)The inexplicable stalling of execution with large feature
>> >> >> >> >>> > sets
>> >> >> >> >>> > appears
>> >> >> >> >>> > similar to that reported with the news-20 dataset:
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> > http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
>> >> >> >> >>> >
>> >> >> >> >>> > 2) The NPE trying to call mapToPair convert an RDD<Long,
>> >> >> >> >>> > Long,
>> >> >> >> >>> > Integer,
>> >> >> >> >>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>,
>> >> >> >> >>> > Tuple2<Integer,Integer>>
>> >> >> >> >>> > is
>> >> >> >> >>> > unrelated to mllib.
>> >> >> >> >>> >
>> >> >> >> >>> > Thanks,
>> >> >> >> >>> > Bharath
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
>> >> >> >> >>> > <re...@gmail.com>
>> >> >> >> >>> > wrote:
>> >> >> >> >>> >>
>> >> >> >> >>> >> Hi  Xiangrui ,
>> >> >> >> >>> >>
>> >> >> >> >>> >> I'm using 1.0.0.
>> >> >> >> >>> >>
>> >> >> >> >>> >> Thanks,
>> >> >> >> >>> >> Bharath
>> >> >> >> >>> >>
>> >> >> >> >>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <me...@gmail.com>
>> >> >> >> >>> >> wrote:
>> >> >> >> >>> >>>
>> >> >> >> >>> >>> Hi Bharath,
>> >> >> >> >>> >>>
>> >> >> >> >>> >>> Thanks for posting the details! Which Spark version are
>> >> >> >> >>> >>> you
>> >> >> >> >>> >>> using?
>> >> >> >> >>> >>>
>> >> >> >> >>> >>> Best,
>> >> >> >> >>> >>> Xiangrui
>> >> >> >> >>> >>>
>> >> >> >> >>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
>> >> >> >> >>> >>> <re...@gmail.com>
>> >> >> >> >>> >>> wrote:
>> >> >> >> >>> >>> > Hi,
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > (Apologies for the long mail, but it's necessary to
>> >> >> >> >>> >>> > provide
>> >> >> >> >>> >>> > sufficient
>> >> >> >> >>> >>> > details considering the number of issues faced.)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > I'm running into issues testing
>> >> >> >> >>> >>> > LogisticRegressionWithSGD
>> >> >> >> >>> >>> > a
>> >> >> >> >>> >>> > two
>> >> >> >> >>> >>> > node
>> >> >> >> >>> >>> > cluster
>> >> >> >> >>> >>> > (each node with 24 cores and 16G available to slaves
>> >> >> >> >>> >>> > out
>> >> >> >> >>> >>> > of
>> >> >> >> >>> >>> > 24G
>> >> >> >> >>> >>> > on
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > system). Here's a description of the application:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > The model is being trained based on categorical
>> >> >> >> >>> >>> > features
>> >> >> >> >>> >>> > x,
>> >> >> >> >>> >>> > y,
>> >> >> >> >>> >>> > and
>> >> >> >> >>> >>> > (x,y).
>> >> >> >> >>> >>> > The categorical features are mapped to binary features
>> >> >> >> >>> >>> > by
>> >> >> >> >>> >>> > converting
>> >> >> >> >>> >>> > each
>> >> >> >> >>> >>> > distinct value in the category enum into a binary
>> >> >> >> >>> >>> > feature
>> >> >> >> >>> >>> > by
>> >> >> >> >>> >>> > itself
>> >> >> >> >>> >>> > (i.e
>> >> >> >> >>> >>> > presence of that value in a record implies
>> >> >> >> >>> >>> > corresponding
>> >> >> >> >>> >>> > feature
>> >> >> >> >>> >>> > =
>> >> >> >> >>> >>> > 1,
>> >> >> >> >>> >>> > else
>> >> >> >> >>> >>> > feature = 0. So, there'd be as many distinct features
>> >> >> >> >>> >>> > as
>> >> >> >> >>> >>> > enum
>> >> >> >> >>> >>> > values) .
>> >> >> >> >>> >>> > The
>> >> >> >> >>> >>> > training vector is laid out as
>> >> >> >> >>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)].
>> >> >> >> >>> >>> > Each
>> >> >> >> >>> >>> > record
>> >> >> >> >>> >>> > in
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > training data has only one combination (Xk,Yk) and a
>> >> >> >> >>> >>> > label
>> >> >> >> >>> >>> > appearing in
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > record. Thus, the corresponding labeledpoint sparse
>> >> >> >> >>> >>> > vector
>> >> >> >> >>> >>> > would
>> >> >> >> >>> >>> > only
>> >> >> >> >>> >>> > have 3
>> >> >> >> >>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total
>> >> >> >> >>> >>> > length
>> >> >> >> >>> >>> > of
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > vector
>> >> >> >> >>> >>> > (though parse) would be nearly 614000.  The number of
>> >> >> >> >>> >>> > records
>> >> >> >> >>> >>> > is
>> >> >> >> >>> >>> > about
>> >> >> >> >>> >>> > 1.33
>> >> >> >> >>> >>> > million. The records have been coalesced into 20
>> >> >> >> >>> >>> > partitions
>> >> >> >> >>> >>> > across
>> >> >> >> >>> >>> > two
>> >> >> >> >>> >>> > nodes. The input data has not been cached.
>> >> >> >> >>> >>> > (NOTE: I do realize the records & features may seem
>> >> >> >> >>> >>> > large
>> >> >> >> >>> >>> > for
>> >> >> >> >>> >>> > a
>> >> >> >> >>> >>> > two
>> >> >> >> >>> >>> > node
>> >> >> >> >>> >>> > setup, but given the memory & cpu, and the fact that
>> >> >> >> >>> >>> > I'm
>> >> >> >> >>> >>> > willing
>> >> >> >> >>> >>> > to
>> >> >> >> >>> >>> > give up
>> >> >> >> >>> >>> > some turnaround time, I don't see why tasks should
>> >> >> >> >>> >>> > inexplicably
>> >> >> >> >>> >>> > fail)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Additional parameters include:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > spark.executor.memory = 14G
>> >> >> >> >>> >>> > spark.default.parallelism = 1
>> >> >> >> >>> >>> > spark.cores.max=20
>> >> >> >> >>> >>> > spark.storage.memoryFraction=0.8 //No cache space
>> >> >> >> >>> >>> > required
>> >> >> >> >>> >>> > (Trying to set spark.akka.frameSize to a larger number,
>> >> >> >> >>> >>> > say,
>> >> >> >> >>> >>> > 20
>> >> >> >> >>> >>> > didn't
>> >> >> >> >>> >>> > help
>> >> >> >> >>> >>> > either)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > The model training was initialized as : new
>> >> >> >> >>> >>> > LogisticRegressionWithSGD(1,
>> >> >> >> >>> >>> > maxIterations, 0.0, 0.05)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > However, after 4 iterations of gradient descent, the
>> >> >> >> >>> >>> > entire
>> >> >> >> >>> >>> > execution
>> >> >> >> >>> >>> > appeared to stall inexplicably. The corresponding
>> >> >> >> >>> >>> > executor
>> >> >> >> >>> >>> > details
>> >> >> >> >>> >>> > and
>> >> >> >> >>> >>> > details of the stalled stage (number 14) are as
>> >> >> >> >>> >>> > follows:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Metric                        Min        25th
>> >> >> >> >>> >>> > Median
>> >> >> >> >>> >>> > 75th
>> >> >> >> >>> >>> > Max
>> >> >> >> >>> >>> > Result serialization time    12 ms    13 ms    14 ms
>> >> >> >> >>> >>> > 16
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > 18
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > Duration                    4 s        4 s        5 s
>> >> >> >> >>> >>> > 5 s
>> >> >> >> >>> >>> > 5 s
>> >> >> >> >>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > 0
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > results
>> >> >> >> >>> >>> > Scheduler delay                6 s        6 s        6
>> >> >> >> >>> >>> > s
>> >> >> >> >>> >>> > 6 s
>> >> >> >> >>> >>> > 12 s
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Stage Id
>> >> >> >> >>> >>> > 14 aggregate at GradientDescent.scala:178
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Task Index    Task ID    Status    Locality Level
>> >> >> >> >>> >>> > Executor
>> >> >> >> >>> >>> > Launch Time                Duration    GC     Result
>> >> >> >> >>> >>> > Ser
>> >> >> >> >>> >>> > Time
>> >> >> >> >>> >>> > Errors
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Time
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > 0     600     RUNNING     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >> >> >>> >>> > 1     601     RUNNING     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
>> >> >> >> >>> >>> > 2     602     RUNNING     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >> >> >>> >>> > 3     603     RUNNING     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
>> >> >> >> >>> >>> > 4     604     RUNNING     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >> >> >>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >> >> >> >>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >> >> >> >>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >> >> >> >>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >> >> >>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
>> >> >> >> >>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >> >> >>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
>> >> >> >> >>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
>> >> >> >> >>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
>> >> >> >> >>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >> >> >> >>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
>> >> >> >> >>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >> >> >>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
>> >> >> >> >>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
>> >> >> >> >>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Executor stats:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks
>> >> >> >> >>> >>> > Failed
>> >> >> >> >>> >>> > Tasks
>> >> >> >> >>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle
>> >> >> >> >>> >>> > Read
>> >> >> >> >>> >>> > Shuffle
>> >> >> >> >>> >>> > Write
>> >> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2
>> >> >> >> >>> >>> > 0
>> >> >> >> >>> >>> > 307         309         23.2 m         0.0 B
>> >> >> >> >>> >>> > 0.0
>> >> >> >> >>> >>> > B
>> >> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3
>> >> >> >> >>> >>> > 0
>> >> >> >> >>> >>> > 308         311         22.4 m         0.0 B
>> >> >> >> >>> >>> > 0.0
>> >> >> >> >>> >>> > B
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Executor jmap output:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Server compiler detected.
>> >> >> >> >>> >>> > JVM version is 24.55-b03
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > using thread-local object allocation.
>> >> >> >> >>> >>> > Parallel GC with 18 thread(s)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Heap Configuration:
>> >> >> >> >>> >>> >    MinHeapFreeRatio = 40
>> >> >> >> >>> >>> >    MaxHeapFreeRatio = 70
>> >> >> >> >>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
>> >> >> >> >>> >>> >    NewSize          = 1310720 (1.25MB)
>> >> >> >> >>> >>> >    MaxNewSize       = 17592186044415 MB
>> >> >> >> >>> >>> >    OldSize          = 5439488 (5.1875MB)
>> >> >> >> >>> >>> >    NewRatio         = 2
>> >> >> >> >>> >>> >    SurvivorRatio    = 8
>> >> >> >> >>> >>> >    PermSize         = 21757952 (20.75MB)
>> >> >> >> >>> >>> >    MaxPermSize      = 134217728 (128.0MB)
>> >> >> >> >>> >>> >    G1HeapRegionSize = 0 (0.0MB)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Heap Usage:
>> >> >> >> >>> >>> > PS Young Generation
>> >> >> >> >>> >>> > Eden Space:
>> >> >> >> >>> >>> >    capacity = 2783969280 (2655.0MB)
>> >> >> >> >>> >>> >    used     = 192583816 (183.66223907470703MB)
>> >> >> >> >>> >>> >    free     = 2591385464 (2471.337760925293MB)
>> >> >> >> >>> >>> >    6.917598458557704% used
>> >> >> >> >>> >>> > From Space:
>> >> >> >> >>> >>> >    capacity = 409993216 (391.0MB)
>> >> >> >> >>> >>> >    used     = 1179808 (1.125152587890625MB)
>> >> >> >> >>> >>> >    free     = 408813408 (389.8748474121094MB)
>> >> >> >> >>> >>> >    0.2877628102022059% used
>> >> >> >> >>> >>> > To Space:
>> >> >> >> >>> >>> >    capacity = 385351680 (367.5MB)
>> >> >> >> >>> >>> >    used     = 0 (0.0MB)
>> >> >> >> >>> >>> >    free     = 385351680 (367.5MB)
>> >> >> >> >>> >>> >    0.0% used
>> >> >> >> >>> >>> > PS Old Generation
>> >> >> >> >>> >>> >    capacity = 7158628352 (6827.0MB)
>> >> >> >> >>> >>> >    used     = 4455093024 (4248.707794189453MB)
>> >> >> >> >>> >>> >    free     = 2703535328 (2578.292205810547MB)
>> >> >> >> >>> >>> >    62.2338918146983% used
>> >> >> >> >>> >>> > PS Perm Generation
>> >> >> >> >>> >>> >    capacity = 90701824 (86.5MB)
>> >> >> >> >>> >>> >    used     = 45348832 (43.248016357421875MB)
>> >> >> >> >>> >>> >    free     = 45352992 (43.251983642578125MB)
>> >> >> >> >>> >>> >    49.99770677158598% used
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > 8432 interned Strings occupying 714672 bytes.
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Executor GC log snippet:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
>> >> >> >> >>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times:
>> >> >> >> >>> >>> > user=5.13
>> >> >> >> >>> >>> > sys=0.39,
>> >> >> >> >>> >>> > real=0.32 secs]
>> >> >> >> >>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)]
>> >> >> >> >>> >>> > [ParOldGen:
>> >> >> >> >>> >>> > 6882312K->1073297K(6990848K)]
>> >> >> >> >>> >>> > 7460857K->1073297K(9907712K)
>> >> >> >> >>> >>> > [PSPermGen:
>> >> >> >> >>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times:
>> >> >> >> >>> >>> > user=24.22
>> >> >> >> >>> >>> > sys=0.18,
>> >> >> >> >>> >>> > real=4.55 secs]
>> >> >> >> >>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
>> >> >> >> >>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times:
>> >> >> >> >>> >>> > user=1.66
>> >> >> >> >>> >>> > sys=0.00,
>> >> >> >> >>> >>> > real=0.09 secs]
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > I tried to map partitions to cores on the nodes.
>> >> >> >> >>> >>> > Increasing
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > number
>> >> >> >> >>> >>> > of
>> >> >> >> >>> >>> > partitions (say to 80 or 100) would result in progress
>> >> >> >> >>> >>> > till
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > 6th
>> >> >> >> >>> >>> > iteration or so, but the next stage would stall as
>> >> >> >> >>> >>> > before
>> >> >> >> >>> >>> > with
>> >> >> >> >>> >>> > apparent
>> >> >> >> >>> >>> > root
>> >> >> >> >>> >>> > cause / logs. With increased partitions, the last stage
>> >> >> >> >>> >>> > that
>> >> >> >> >>> >>> > completed
>> >> >> >> >>> >>> > had
>> >> >> >> >>> >>> > the following task times:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Metric                        Min        25th
>> >> >> >> >>> >>> > Median
>> >> >> >> >>> >>> > 75th
>> >> >> >> >>> >>> > Max
>> >> >> >> >>> >>> > Result serialization time    11 ms    12 ms    13 ms
>> >> >> >> >>> >>> > 15
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > 0.4 s
>> >> >> >> >>> >>> > Duration                    0.5 s    0.9 s    1 s
>> >> >> >> >>> >>> > 3
>> >> >> >> >>> >>> > s
>> >> >> >> >>> >>> > 7 s
>> >> >> >> >>> >>> > Time spent fetching            0 ms    0 ms    0 ms
>> >> >> >> >>> >>> > 0
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > 0
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > task results
>> >> >> >> >>> >>> > Scheduler delay                5 s        6 s        6
>> >> >> >> >>> >>> > s
>> >> >> >> >>> >>> > 7 s
>> >> >> >> >>> >>> > 12 s
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > My hypothesis is that as the coefficient array becomes
>> >> >> >> >>> >>> > less
>> >> >> >> >>> >>> > sparse
>> >> >> >> >>> >>> > (with
>> >> >> >> >>> >>> > successive iterations), the cost of the aggregate goes
>> >> >> >> >>> >>> > up
>> >> >> >> >>> >>> > to
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > point
>> >> >> >> >>> >>> > that
>> >> >> >> >>> >>> > it stalls (which I failed to explain). Reducing the
>> >> >> >> >>> >>> > batch
>> >> >> >> >>> >>> > fraction
>> >> >> >> >>> >>> > to a
>> >> >> >> >>> >>> > very
>> >> >> >> >>> >>> > low number like 0.01 saw the iterations progress
>> >> >> >> >>> >>> > further,
>> >> >> >> >>> >>> > but
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > model
>> >> >> >> >>> >>> > failed to converge in that case after a small number of
>> >> >> >> >>> >>> > iterations.
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > I also tried reducing the number of records by
>> >> >> >> >>> >>> > aggregating
>> >> >> >> >>> >>> > on
>> >> >> >> >>> >>> > (x,y)
>> >> >> >> >>> >>> > as
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > key (i.e. using aggregations instead of training on
>> >> >> >> >>> >>> > every
>> >> >> >> >>> >>> > raw
>> >> >> >> >>> >>> > record),
>> >> >> >> >>> >>> > but
>> >> >> >> >>> >>> > encountered by the following exception:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Loss was due to java.lang.NullPointerException
>> >> >> >> >>> >>> > java.lang.NullPointerException
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> > org.apache.spark.scheduler.Task.run(Task.scala:51)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >>> >>> >         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > I'd appreciate any insights/comments about what may be
>> >> >> >> >>> >>> > causing
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > execution
>> >> >> >> >>> >>> > to stall.
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > If logs/tables appear poorly indented in the email,
>> >> >> >> >>> >>> > here's
>> >> >> >> >>> >>> > a
>> >> >> >> >>> >>> > gist
>> >> >> >> >>> >>> > with
>> >> >> >> >>> >>> > relevant details:
>> >> >> >> >>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Thanks,
>> >> >> >> >>> >>> > Bharath
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

Re: Execution stalls in LogisticRegressionWithSGD

Posted by Bharath Ravi Kumar <re...@gmail.com>.

Xiangrui,

Thanks for all the help in resolving this issue. The  cause turned out to
bethe build environment rather than runtime configuration. The build
process had picked up maven2 while building spark. Using binaries that were
rebuilt using m3, the entire processing went through fine. While I'm aware
that the build instruction page specifies m3 as the min requirement,
declaratively preventing accidental m2 usage (e.g. through something like
the maven enforcer plugin?) might help other developers avoid such issues.

-Bharath



On Mon, Jul 7, 2014 at 9:43 PM, Xiangrui Meng <me...@gmail.com> wrote:

> It seems to me a setup issue. I just tested news20.binary (1355191
> features) on a 2-node EC2 cluster and it worked well. I added one line
> to conf/spark-env.sh:
>
> export SPARK_JAVA_OPTS=" -Dspark.akka.frameSize=20 "
>
> and launched spark-shell with "--driver-memory 20g". Could you re-try
> with an EC2 setup? If it still doesn't work, please attach all your
> code and logs.
>
> Best,
> Xiangrui
>
> On Sun, Jul 6, 2014 at 1:35 AM, Bharath Ravi Kumar <re...@gmail.com>
> wrote:
> > Hi Xiangrui,
> >
> > 1) Yes, I used the same build (compiled locally from source) to the host
> > that has (master, slave1) and the second host with slave2.
> >
> > 2) The execution was successful when run in local mode with reduced
> number
> > of partitions. Does this imply issues communicating/coordinating across
> > processes (i.e. driver, master and workers)?
> >
> > Thanks,
> > Bharath
> >
> >
> >
> > On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng <me...@gmail.com> wrote:
> >>
> >> Hi Bharath,
> >>
> >> 1) Did you sync the spark jar and conf to the worker nodes after build?
> >> 2) Since the dataset is not large, could you try local mode first
> >> using `spark-summit --driver-memory 12g --master local[*]`?
> >> 3) Try to use less number of partitions, say 5.
> >>
> >> If the problem is still there, please attach the full master/worker log
> >> files.
> >>
> >> Best,
> >> Xiangrui
> >>
> >> On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar <
> reachbach@gmail.com>
> >> wrote:
> >> > Xiangrui,
> >> >
> >> > Leaving the frameSize unspecified led to an error message (and
> failure)
> >> > stating that the task size (~11M) was larger. I hence set it to an
> >> > arbitrarily large value ( I realize 500 was unrealistic & unnecessary
> in
> >> > this case). I've now set the size to 20M and repeated the runs. The
> >> > earlier
> >> > runs were on an uncached RDD. Caching the RDD (and setting
> >> > spark.storage.memoryFraction=0.5) resulted in marginal speed up of
> >> > execution, but the end result remained the same. The cached RDD size
> is
> >> > as
> >> > follows:
> >> >
> >> > RDD Name    Storage Level                        Cached Partitions
> >> > Fraction Cached    Size in Memory    Size in Tachyon        Size on
> Disk
> >> > 1084         Memory Deserialized 1x Replicated     80
> >> > 100%         165.9 MB             0.0 B                 0.0 B
> >> >
> >> >
> >> >
> >> > The corresponding master logs were:
> >> >
> >> > 14/07/04 06:29:34 INFO Master: Removing executor
> >> > app-20140704062238-0033/1
> >> > because it is EXITED
> >> > 14/07/04 06:29:34 INFO Master: Launching executor
> >> > app-20140704062238-0033/2
> >> > on worker worker-20140630124441-slave1-40182
> >> > 14/07/04 06:29:34 INFO Master: Removing executor
> >> > app-20140704062238-0033/0
> >> > because it is EXITED
> >> > 14/07/04 06:29:34 INFO Master: Launching executor
> >> > app-20140704062238-0033/3
> >> > on worker worker-20140630102913-slave2-44735
> >> > 14/07/04 06:29:37 INFO Master: Removing executor
> >> > app-20140704062238-0033/2
> >> > because it is EXITED
> >> > 14/07/04 06:29:37 INFO Master: Launching executor
> >> > app-20140704062238-0033/4
> >> > on worker worker-20140630124441-slave1-40182
> >> > 14/07/04 06:29:37 INFO Master: Removing executor
> >> > app-20140704062238-0033/3
> >> > because it is EXITED
> >> > 14/07/04 06:29:37 INFO Master: Launching executor
> >> > app-20140704062238-0033/5
> >> > on worker worker-20140630102913-slave2-44735
> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> >> > disassociated, removing it.
> >> > 14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
> >> > 14/07/04 06:29:39 INFO LocalActorRef: Message
> >> > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
> >> > from
> >> > Actor[akka://sparkMaster/deadLetters] to
> >> >
> >> >
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
> >> > was not delivered. [39] dead letters encountered. This logging can be
> >> > turned
> >> > off or adjusted with configuration settings 'akka.log-dead-letters'
> and
> >> > 'akka.log-dead-letters-during-shutdown'.
> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> >> > disassociated, removing it.
> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> >> > disassociated, removing it.
> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> >> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2
> :45172]:
> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> >> > akka.remote.EndpointAssociationException: Association failed with
> >> > [akka.tcp://spark@slave2:45172]
> >> > Caused by:
> >> >
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> >> > Connection refused: slave2/10.3.1.135:45172
> >> > ]
> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> >> > disassociated, removing it.
> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> >> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2
> :45172]:
> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> >> > akka.remote.EndpointAssociationException: Association failed with
> >> > [akka.tcp://spark@slave2:45172]
> >> > Caused by:
> >> >
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> >> > Connection refused: slave2/10.3.1.135:45172
> >> > ]
> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> >> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2
> :45172]:
> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> >> > akka.remote.EndpointAssociationException: Association failed with
> >> > [akka.tcp://spark@slave2:45172]
> >> > Caused by:
> >> >
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> >> > Connection refused: slave2/10.3.1.135:45172
> >> > ]
> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> >> > disassociated, removing it.
> >> > 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
> >> > app-20140704062238-0033/5
> >> > 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
> >> > app-20140704062238-0033/4
> >> >
> >> >
> >> > Coincidentally, after the initial executor failed, each following
> >> > executor
> >> > that was re-spawned failed with the following logs:
> >> > (e.g the following was from
> >> > slave1:~/spark-1.0.1-rc1/work/app-20140704062238-0033/2/stderr)
> >> >
> >> > log4j:WARN No appenders could be found for logger
> >> > (org.apache.hadoop.conf.Configuration).
> >> > log4j:WARN Please initialize the log4j system properly.
> >> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
> for
> >> > more info.
> >> > 14/07/04 06:29:35 INFO SparkHadoopUtil: Using Spark's default log4j
> >> > profile:
> >> > org/apache/spark/log4j-defaults.properties
> >> > 14/07/04 06:29:35 INFO SecurityManager: Changing view acls to: user1
> >> > 14/07/04 06:29:35 INFO SecurityManager: SecurityManager:
> authentication
> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
> >> > 14/07/04 06:29:35 INFO Slf4jLogger: Slf4jLogger started
> >> > 14/07/04 06:29:35 INFO Remoting: Starting remoting
> >> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on
> >> > addresses
> >> > :[akka.tcp://sparkExecutor@slave1:54782]
> >> > 14/07/04 06:29:36 INFO Remoting: Remoting now listens on addresses:
> >> > [akka.tcp://sparkExecutor@slave1:54782]
> >> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Connecting to
> >> > driver:
> >> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
> >> > 14/07/04 06:29:36 INFO WorkerWatcher: Connecting to worker
> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> >> > 14/07/04 06:29:36 INFO WorkerWatcher: Successfully connected to
> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> >> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Successfully
> >> > registered
> >> > with driver
> >> > 14/07/04 06:29:36 INFO SecurityManager: Changing view acls to: user1
> >> > 14/07/04 06:29:36 INFO SecurityManager: SecurityManager:
> authentication
> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
> >> > 14/07/04 06:29:36 INFO Slf4jLogger: Slf4jLogger started
> >> > 14/07/04 06:29:36 INFO Remoting: Starting remoting
> >> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on
> >> > addresses
> >> > :[akka.tcp://spark@slave1:39753]
> >> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to MapOutputTracker:
> >> > akka.tcp://spark@master:45172/user/MapOutputTracker
> >> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to BlockManagerMaster:
> >> > akka.tcp://spark@master:45172/user/BlockManagerMaster
> >> > 14/07/04 06:29:36 INFO DiskBlockManager: Created local directory at
> >> > /tmp/spark-local-20140704062936-6123
> >> > 14/07/04 06:29:36 INFO MemoryStore: MemoryStore started with capacity
> >> > 6.7
> >> > GB.
> >> > 14/07/04 06:29:36 INFO ConnectionManager: Bound socket to port 50960
> >> > with id
> >> > = ConnectionManagerId(slave1,50960)
> >> > 14/07/04 06:29:36 INFO BlockManagerMaster: Trying to register
> >> > BlockManager
> >> > 14/07/04 06:29:36 INFO BlockManagerMaster: Registered BlockManager
> >> > 14/07/04 06:29:36 INFO HttpFileServer: HTTP File server directory is
> >> > /tmp/spark-42c2782f-60f8-45a7-9e11-c789fc87fe2e
> >> > 14/07/04 06:29:36 INFO HttpServer: Starting HTTP Server
> >> > 14/07/04 06:29:36 ERROR CoarseGrainedExecutorBackend: Driver
> >> > Disassociated
> >> > [akka.tcp://sparkExecutor@slave1:54782] ->
> >> > [akka.tcp://spark@master:45172]
> >> > disassociated! Shutting down.
> >> >
> >> > In case of the initial executor that successfully started, the
> >> > corresponding
> >> > log messages (from
> >> > spark-1.0.1-rc1/work/app-20140704062238-0033/1/stderr) on
> >> > the executor were:
> >> > log4j:WARN No appenders could be found for logger
> >> > (org.apache.hadoop.conf.Configuration).
> >> > log4j:WARN Please initialize the log4j system properly.
> >> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
> for
> >> > more info.
> >> > 14/07/04 06:22:39 INFO SparkHadoopUtil: Using Spark's default log4j
> >> > profile:
> >> > org/apache/spark/log4j-defaults.properties
> >> > 14/07/04 06:22:39 INFO SecurityManager: Changing view acls to: user1
> >> > 14/07/04 06:22:39 INFO SecurityManager: SecurityManager:
> authentication
> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
> >> > 14/07/04 06:22:39 INFO Slf4jLogger: Slf4jLogger started
> >> > 14/07/04 06:22:39 INFO Remoting: Starting remoting
> >> > 14/07/04 06:22:39 INFO Remoting: Remoting started; listening on
> >> > addresses
> >> > :[akka.tcp://sparkExecutor@slave1:50806]
> >> > 14/07/04 06:22:39 INFO Remoting: Remoting now listens on addresses:
> >> > [akka.tcp://sparkExecutor@slave1:50806]
> >> > 14/07/04 06:22:39 INFO CoarseGrainedExecutorBackend: Connecting to
> >> > driver:
> >> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
> >> > 14/07/04 06:22:39 INFO WorkerWatcher: Connecting to worker
> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> >> > 14/07/04 06:22:39 INFO WorkerWatcher: Successfully connected to
> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> >> > 14/07/04 06:22:40 INFO CoarseGrainedExecutorBackend: Successfully
> >> > registered
> >> > with driver
> >> > 14/07/04 06:22:40 INFO SecurityManager: Changing view acls to: user1
> >> > 14/07/04 06:22:40 INFO SecurityManager: SecurityManager:
> authentication
> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
> >> > 14/07/04 06:22:40 INFO Slf4jLogger: Slf4jLogger started
> >> > 14/07/04 06:22:40 INFO Remoting: Starting remoting
> >> > 14/07/04 06:22:40 INFO Remoting: Remoting started; listening on
> >> > addresses
> >> > :[akka.tcp://spark@slave1:38558]
> >> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to MapOutputTracker:
> >> > akka.tcp://spark@master:45172/user/MapOutputTracker
> >> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to BlockManagerMaster:
> >> > akka.tcp://spark@master:45172/user/BlockManagerMaster
> >> > 14/07/04 06:22:40 INFO DiskBlockManager: Created local directory at
> >> > /tmp/spark-local-20140704062240-6a65
> >> > 14/07/04 06:22:40 INFO MemoryStore: MemoryStore started with capacity
> >> > 6.7
> >> > GB.
> >> > 14/07/04 06:22:40 INFO ConnectionManager: Bound socket to port 46901
> >> > with id
> >> > = ConnectionManagerId(slave1,46901)
> >> > 14/07/04 06:22:40 INFO BlockManagerMaster: Trying to register
> >> > BlockManager
> >> > 14/07/04 06:22:40 INFO BlockManagerMaster: Registered BlockManager
> >> > 14/07/04 06:22:40 INFO HttpFileServer: HTTP File server directory is
> >> > /tmp/spark-9eba78f9-8ae9-477c-9338-7222ae6fe306
> >> > 14/07/04 06:22:40 INFO HttpServer: Starting HTTP Server
> >> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned
> task 0
> >> > 14/07/04 06:22:42 INFO Executor: Running task ID 0
> >> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned
> task 2
> >> > 14/07/04 06:22:42 INFO Executor: Running task ID 2
> >> > ...
> >> >
> >> >
> >> >
> >> > On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >> >>
> >> >> The feature dimension is small. You don't need a big akka.frameSize.
> >> >> The default one (10M) should be sufficient. Did you cache the data
> >> >> before calling LRWithSGD? -Xiangrui
> >> >>
> >> >> On Thu, Jul 3, 2014 at 10:02 AM, Bharath Ravi Kumar
> >> >> <re...@gmail.com>
> >> >> wrote:
> >> >> > I tried another run after setting the driver memory to 8G (and
> >> >> > spark.akka.frameSize = 500 on the executors and the driver). In
> >> >> > addition, I
> >> >> > also tried to reduce the amount of data that a single task
> processes,
> >> >> > by
> >> >> > increasing the number of partitions (of the labeled points) to 120
> >> >> > (instead
> >> >> > of 2 used earlier), and then setting max cores to 2. That made no
> >> >> > difference
> >> >> > since, at the end of 120 tasks, the familiar error message appeared
> >> >> > on a
> >> >> > slave:
> >> >> >
> >> >> > <snipped earlier logs>
> >> >> > 14/07/03 16:18:48 INFO CoarseGrainedExecutorBackend: Got assigned
> >> >> > task
> >> >> > 1436
> >> >> > 14/07/03 16:18:48 INFO Executor: Running task ID 1436
> >> >> > 14/07/03 16:18:53 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00014:0+2215337
> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00014:2215337+2215338
> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00003:0+2196429
> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00003:2196429+2196430
> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00010:0+2186751
> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00010:2186751+2186751
> >> >> > 14/07/03 16:18:54 INFO Executor: Serialized size of result for 1436
> >> >> > is
> >> >> > 5958822
> >> >> > 14/07/03 16:18:54 INFO Executor: Sending result for 1436 directly
> to
> >> >> > driver
> >> >> > 14/07/03 16:18:54 INFO Executor: Finished task ID 1436
> >> >> > 14/07/03 16:18:54 INFO CoarseGrainedExecutorBackend: Got assigned
> >> >> > task
> >> >> > 1438
> >> >> > 14/07/03 16:18:54 INFO Executor: Running task ID 1438
> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00004:0+2209615
> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00004:2209615+2209616
> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00011:0+2202240
> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00011:2202240+2202240
> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00009:0+2194423
> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> >> > file:~//2014-05-24-02/part-r-00009:2194423+2194424
> >> >> > 14/07/03 16:19:00 INFO Executor: Serialized size of result for 1438
> >> >> > is
> >> >> > 5958822
> >> >> > 14/07/03 16:19:00 INFO Executor: Sending result for 1438 directly
> to
> >> >> > driver
> >> >> > 14/07/03 16:19:00 INFO Executor: Finished task ID 1438
> >> >> > 14/07/03 16:19:14 ERROR CoarseGrainedExecutorBackend: Driver
> >> >> > Disassociated
> >> >> > [akka.tcp://sparkExecutor@slave1:51099] ->
> >> >> > [akka.tcp://spark@master:58272]
> >> >> > disassociated! Shutting down.
> >> >> >
> >> >> >
> >> >> > The corresponding master logs were:
> >> >> >
> >> >> > 4/07/03 16:02:14 INFO Master: Registering app LogRegExp
> >> >> > 14/07/03 16:02:14 INFO Master: Registered app LogRegExp with ID
> >> >> > app-20140703160214-0028
> >> >> > 14/07/03 16:02:14 INFO Master: Launching executor
> >> >> > app-20140703160214-0028/1
> >> >> > on worker worker-20140630124441-slave1-40182
> >> >> > 14/07/03 16:19:15 INFO Master: Removing executor
> >> >> > app-20140703160214-0028/1
> >> >> > because it is EXITED
> >> >> > 14/07/03 16:19:15 INFO Master: Launching executor
> >> >> > app-20140703160214-0028/2
> >> >> > on worker worker-20140630124441-slave1-40182
> >> >> > 14/07/03 16:19:15 INFO Master: Removing executor
> >> >> > app-20140703160214-0028/0
> >> >> > because it is EXITED
> >> >> > 14/07/03 16:19:15 INFO Master: Launching executor
> >> >> > app-20140703160214-0028/3
> >> >> > on worker worker-20140630102913-slave2-44735
> >> >> > 14/07/03 16:19:18 INFO Master: Removing executor
> >> >> > app-20140703160214-0028/2
> >> >> > because it is EXITED
> >> >> > 14/07/03 16:19:18 INFO Master: Launching executor
> >> >> > app-20140703160214-0028/4
> >> >> > on worker worker-20140630124441-slave1-40182
> >> >> > 14/07/03 16:19:18 INFO Master: Removing executor
> >> >> > app-20140703160214-0028/3
> >> >> > because it is EXITED
> >> >> > 14/07/03 16:19:18 INFO Master: Launching executor
> >> >> > app-20140703160214-0028/5
> >> >> > on worker worker-20140630102913-slave2-44735
> >> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
> >> >> > disassociated, removing it.
> >> >> > 14/07/03 16:19:20 INFO Master: Removing app app-20140703160214-0028
> >> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
> >> >> > disassociated, removing it.
> >> >> >
> >> >> >
> >> >> > Throughout the execution, I confirmed in the UI that driver memory
> >> >> > used
> >> >> > was
> >> >> > 0.0 B / 6.9 GB and each executor's memory showed 0.0 B / 12.1 GB
> even
> >> >> > when
> >> >> > aggregate was being executed. On a related note, I noticed in the
> >> >> > executors
> >> >> > tab that just before the entire job terminated, executors on
> slave1,
> >> >> > slave2
> >> >> > and the driver "disappeared" momentarily from the active executors
> >> >> > list.
> >> >> > The
> >> >> > replacement  executors on slave1 and slave2 were re-spawned a
> couple
> >> >> > of
> >> >> > times and appeared on the executors list again before they too died
> >> >> > and
> >> >> > the
> >> >> > job failed.
> >> >> > So it appears that no matter what the task input-result size, the
> >> >> > execution
> >> >> > fails at the end of the stage corresponding to
> >> >> > GradientDescent.aggregate
> >> >> > (and the preceding count() in GradientDescent goes through fine).
> Let
> >> >> > me
> >> >> > know if you need any additional information.
> >> >> >
> >> >> >
> >> >> > On Thu, Jul 3, 2014 at 12:27 PM, Xiangrui Meng <me...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Could you check the driver memory in the executor tab of the Spark
> >> >> >> UI
> >> >> >> when the job is running? If it is too small, please set
> >> >> >> --driver-memory with spark-submit, e.g. 10g. Could you also attach
> >> >> >> the
> >> >> >> master log under spark/logs as well? -Xiangrui
> >> >> >>
> >> >> >> On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar
> >> >> >> <re...@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi Xiangrui,
> >> >> >> >
> >> >> >> > The issue with aggergating/counting over large feature vectors
> (as
> >> >> >> > part
> >> >> >> > of
> >> >> >> > LogisticRegressionWithSGD) continues to exist, but now in
> another
> >> >> >> > form:
> >> >> >> > while the execution doesn't freeze (due to SPARK-1112), it now
> >> >> >> > fails
> >> >> >> > at
> >> >> >> > the
> >> >> >> > second or third gradient descent iteration consistently with an
> >> >> >> > error
> >> >> >> > level
> >> >> >> > log message, but no stacktrace. I'm running against 1.0.1-rc1,
> and
> >> >> >> > have
> >> >> >> > tried setting spark.akka.frameSize as high as 500. When the
> >> >> >> > execution
> >> >> >> > fails,
> >> >> >> > each of the two executors log the following message
> (corresponding
> >> >> >> > to
> >> >> >> > aggregate at GradientDescent.scala:178) :
> >> >> >> >
> >> >> >> > 14/07/02 14:09:09 INFO
> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
> >> >> >> > maxBytesInFlight: 50331648, targetRequestSize: 10066329
> >> >> >> > 14/07/02 14:09:09 INFO
> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
> >> >> >> > Getting 2 non-empty blocks out of 2 blocks
> >> >> >> > 14/07/02 14:09:09 INFO
> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
> >> >> >> > Started 1 remote fetches in 0 ms
> >> >> >> > 14/07/02 14:09:11 INFO Executor: Serialized size of result for
> 737
> >> >> >> > is
> >> >> >> > 5959086
> >> >> >> > 14/07/02 14:09:11 INFO Executor: Sending result for 737 directly
> >> >> >> > to
> >> >> >> > driver
> >> >> >> > 14/07/02 14:09:11 INFO Executor: Finished task ID 737
> >> >> >> > 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver
> >> >> >> > Disassociated
> >> >> >> > [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
> >> >> >> > [akka.tcp://spark@master:59487] disassociated! Shutting down.
> >> >> >> >
> >> >> >> >
> >> >> >> > There is no separate stacktrace on the driver side.
> >> >> >> >
> >> >> >> > Each input record is of the form p1, p2, (p1,p2) where p1, p2 &
> >> >> >> > (p1,p2)
> >> >> >> > are
> >> >> >> > categorical features with large cardinality, and X is the double
> >> >> >> > label
> >> >> >> > with
> >> >> >> > a continuous value. The categorical variables are converted to
> >> >> >> > binary
> >> >> >> > variables which results in a feature vector of size 741092
> >> >> >> > (composed
> >> >> >> > of
> >> >> >> > all
> >> >> >> > unique categories across p1, p2 and (p1,p2)). Thus, the labeled
> >> >> >> > point
> >> >> >> > for
> >> >> >> > input record is a sparse vector of size 741092 with only 3
> >> >> >> > variables
> >> >> >> > set
> >> >> >> > in
> >> >> >> > the record. The total number of records is 683233 after
> >> >> >> > aggregating
> >> >> >> > the
> >> >> >> > input data on (p1, p2). When attempting to train on the
> >> >> >> > unaggregated
> >> >> >> > records
> >> >> >> > (1337907 in number spread across 455 files), the execution fails
> >> >> >> > at
> >> >> >> > count,
> >> >> >> > GradientDescent.scala:161 with the following log
> >> >> >> >
> >> >> >> >
> >> >> >> > (Snipped lines corresponding to other input files)
> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> >> >> > file:~/part-r-00012:2834590+2834590
> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> >> >> > file:~/part-r-00005:0+2845559
> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> >> >> > file:~/part-r-00005:2845559+2845560
> >> >> >> > 14/07/02 16:02:03 INFO Executor: Serialized size of result for
> 726
> >> >> >> > is
> >> >> >> > 615
> >> >> >> > 14/07/02 16:02:03 INFO Executor: Sending result for 726 directly
> >> >> >> > to
> >> >> >> > driver
> >> >> >> > 14/07/02 16:02:03 INFO Executor: Finished task ID 726
> >> >> >> > 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver
> >> >> >> > Disassociated
> >> >> >> > [akka.tcp://sparkExecutor@slave1:48423] ->
> >> >> >> > [akka.tcp://spark@master:55792]
> >> >> >> > disassociated! Shutting down.
> >> >> >> >
> >> >> >> > A count() attempted on the input RDD before beginning training
> has
> >> >> >> > the
> >> >> >> > following metrics:
> >> >> >> >
> >> >> >> >
> >> >> >> > Metric            Min        25th    Median    75th     Max
> >> >> >> >
> >> >> >> > Result
> >> >> >> > serialization
> >> >> >> > time            0 ms    0 ms    0 ms    0 ms    0 ms
> >> >> >> >
> >> >> >> > Duration        33 s    33 s    35 s    35 s    35 s
> >> >> >> >
> >> >> >> > Time spent
> >> >> >> > fetching task
> >> >> >> > results            0 ms    0 ms    0 ms    0 ms    0 ms
> >> >> >> >
> >> >> >> > Scheduler
> >> >> >> > delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
> >> >> >> >
> >> >> >> > Aggregated Metrics by Executor
> >> >> >> >
> >> >> >> > ID     Address Task             Time Total Failed Succeeded
> >> >> >> > Shuffle
> >> >> >> > Read
> >> >> >> > Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
> >> >> >> > 0     CANNOT FIND ADDRESS     34 s     1     0         1
> >> >> >> > 0.0
> >> >> >> > B
> >> >> >> > 0.0 B             0.0 B                 0.0 B
> >> >> >> > 1     CANNOT FIND ADDRESS     36 s     1     0         1
> >> >> >> > 0.0
> >> >> >> > B
> >> >> >> > 0.0 B             0.0 B                 0.0 B
> >> >> >> >
> >> >> >> > Tasks
> >> >> >> >
> >> >> >> > Task Index    Task ID    Status    Locality Level    Executor
> >> >> >> > Launch
> >> >> >> > Time
> >> >> >> > Duration    GC Time    Result Ser Time    Errors
> >> >> >> > 0     726     SUCCESS         PROCESS_LOCAL     slave1
> >> >> >> > 2014/07/02
> >> >> >> > 16:01:28 35 s         0.1 s
> >> >> >> > 1     727     SUCCESS         PROCESS_LOCAL     slave2
> >> >> >> > 2014/07/02
> >> >> >> > 16:01:28 33 s         99 ms
> >> >> >> >
> >> >> >> > Any pointers / diagnosis please?
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar
> >> >> >> > <re...@gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Thanks. I'll await the fix to re-run my test.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng <
> mengxr@gmail.com>
> >> >> >> >> wrote:
> >> >> >> >>>
> >> >> >> >>> Hi Bharath,
> >> >> >> >>>
> >> >> >> >>> This is related to SPARK-1112, which we already found the root
> >> >> >> >>> cause.
> >> >> >> >>> I will let you know when this is fixed.
> >> >> >> >>>
> >> >> >> >>> Best,
> >> >> >> >>> Xiangrui
> >> >> >> >>>
> >> >> >> >>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar
> >> >> >> >>> <re...@gmail.com>
> >> >> >> >>> wrote:
> >> >> >> >>> > Couple more points:
> >> >> >> >>> > 1)The inexplicable stalling of execution with large feature
> >> >> >> >>> > sets
> >> >> >> >>> > appears
> >> >> >> >>> > similar to that reported with the news-20 dataset:
> >> >> >> >>> >
> >> >> >> >>> >
> >> >> >> >>> >
> >> >> >> >>> >
> >> >> >> >>> >
> http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
> >> >> >> >>> >
> >> >> >> >>> > 2) The NPE trying to call mapToPair convert an RDD<Long,
> Long,
> >> >> >> >>> > Integer,
> >> >> >> >>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>,
> >> >> >> >>> > Tuple2<Integer,Integer>>
> >> >> >> >>> > is
> >> >> >> >>> > unrelated to mllib.
> >> >> >> >>> >
> >> >> >> >>> > Thanks,
> >> >> >> >>> > Bharath
> >> >> >> >>> >
> >> >> >> >>> >
> >> >> >> >>> >
> >> >> >> >>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
> >> >> >> >>> > <re...@gmail.com>
> >> >> >> >>> > wrote:
> >> >> >> >>> >>
> >> >> >> >>> >> Hi  Xiangrui ,
> >> >> >> >>> >>
> >> >> >> >>> >> I'm using 1.0.0.
> >> >> >> >>> >>
> >> >> >> >>> >> Thanks,
> >> >> >> >>> >> Bharath
> >> >> >> >>> >>
> >> >> >> >>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <me...@gmail.com>
> >> >> >> >>> >> wrote:
> >> >> >> >>> >>>
> >> >> >> >>> >>> Hi Bharath,
> >> >> >> >>> >>>
> >> >> >> >>> >>> Thanks for posting the details! Which Spark version are
> you
> >> >> >> >>> >>> using?
> >> >> >> >>> >>>
> >> >> >> >>> >>> Best,
> >> >> >> >>> >>> Xiangrui
> >> >> >> >>> >>>
> >> >> >> >>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
> >> >> >> >>> >>> <re...@gmail.com>
> >> >> >> >>> >>> wrote:
> >> >> >> >>> >>> > Hi,
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > (Apologies for the long mail, but it's necessary to
> >> >> >> >>> >>> > provide
> >> >> >> >>> >>> > sufficient
> >> >> >> >>> >>> > details considering the number of issues faced.)
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > I'm running into issues testing
> LogisticRegressionWithSGD
> >> >> >> >>> >>> > a
> >> >> >> >>> >>> > two
> >> >> >> >>> >>> > node
> >> >> >> >>> >>> > cluster
> >> >> >> >>> >>> > (each node with 24 cores and 16G available to slaves out
> >> >> >> >>> >>> > of
> >> >> >> >>> >>> > 24G
> >> >> >> >>> >>> > on
> >> >> >> >>> >>> > the
> >> >> >> >>> >>> > system). Here's a description of the application:
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > The model is being trained based on categorical features
> >> >> >> >>> >>> > x,
> >> >> >> >>> >>> > y,
> >> >> >> >>> >>> > and
> >> >> >> >>> >>> > (x,y).
> >> >> >> >>> >>> > The categorical features are mapped to binary features
> by
> >> >> >> >>> >>> > converting
> >> >> >> >>> >>> > each
> >> >> >> >>> >>> > distinct value in the category enum into a binary
> feature
> >> >> >> >>> >>> > by
> >> >> >> >>> >>> > itself
> >> >> >> >>> >>> > (i.e
> >> >> >> >>> >>> > presence of that value in a record implies corresponding
> >> >> >> >>> >>> > feature
> >> >> >> >>> >>> > =
> >> >> >> >>> >>> > 1,
> >> >> >> >>> >>> > else
> >> >> >> >>> >>> > feature = 0. So, there'd be as many distinct features as
> >> >> >> >>> >>> > enum
> >> >> >> >>> >>> > values) .
> >> >> >> >>> >>> > The
> >> >> >> >>> >>> > training vector is laid out as
> >> >> >> >>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)]. Each
> >> >> >> >>> >>> > record
> >> >> >> >>> >>> > in
> >> >> >> >>> >>> > the
> >> >> >> >>> >>> > training data has only one combination (Xk,Yk) and a
> label
> >> >> >> >>> >>> > appearing in
> >> >> >> >>> >>> > the
> >> >> >> >>> >>> > record. Thus, the corresponding labeledpoint sparse
> vector
> >> >> >> >>> >>> > would
> >> >> >> >>> >>> > only
> >> >> >> >>> >>> > have 3
> >> >> >> >>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total
> length
> >> >> >> >>> >>> > of
> >> >> >> >>> >>> > the
> >> >> >> >>> >>> > vector
> >> >> >> >>> >>> > (though parse) would be nearly 614000.  The number of
> >> >> >> >>> >>> > records
> >> >> >> >>> >>> > is
> >> >> >> >>> >>> > about
> >> >> >> >>> >>> > 1.33
> >> >> >> >>> >>> > million. The records have been coalesced into 20
> >> >> >> >>> >>> > partitions
> >> >> >> >>> >>> > across
> >> >> >> >>> >>> > two
> >> >> >> >>> >>> > nodes. The input data has not been cached.
> >> >> >> >>> >>> > (NOTE: I do realize the records & features may seem
> large
> >> >> >> >>> >>> > for
> >> >> >> >>> >>> > a
> >> >> >> >>> >>> > two
> >> >> >> >>> >>> > node
> >> >> >> >>> >>> > setup, but given the memory & cpu, and the fact that I'm
> >> >> >> >>> >>> > willing
> >> >> >> >>> >>> > to
> >> >> >> >>> >>> > give up
> >> >> >> >>> >>> > some turnaround time, I don't see why tasks should
> >> >> >> >>> >>> > inexplicably
> >> >> >> >>> >>> > fail)
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Additional parameters include:
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > spark.executor.memory = 14G
> >> >> >> >>> >>> > spark.default.parallelism = 1
> >> >> >> >>> >>> > spark.cores.max=20
> >> >> >> >>> >>> > spark.storage.memoryFraction=0.8 //No cache space
> required
> >> >> >> >>> >>> > (Trying to set spark.akka.frameSize to a larger number,
> >> >> >> >>> >>> > say,
> >> >> >> >>> >>> > 20
> >> >> >> >>> >>> > didn't
> >> >> >> >>> >>> > help
> >> >> >> >>> >>> > either)
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > The model training was initialized as : new
> >> >> >> >>> >>> > LogisticRegressionWithSGD(1,
> >> >> >> >>> >>> > maxIterations, 0.0, 0.05)
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > However, after 4 iterations of gradient descent, the
> >> >> >> >>> >>> > entire
> >> >> >> >>> >>> > execution
> >> >> >> >>> >>> > appeared to stall inexplicably. The corresponding
> executor
> >> >> >> >>> >>> > details
> >> >> >> >>> >>> > and
> >> >> >> >>> >>> > details of the stalled stage (number 14) are as follows:
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Metric                        Min        25th     Median
> >> >> >> >>> >>> > 75th
> >> >> >> >>> >>> > Max
> >> >> >> >>> >>> > Result serialization time    12 ms    13 ms    14 ms
>  16
> >> >> >> >>> >>> > ms
> >> >> >> >>> >>> > 18
> >> >> >> >>> >>> > ms
> >> >> >> >>> >>> > Duration                    4 s        4 s        5 s
> >> >> >> >>> >>> > 5 s
> >> >> >> >>> >>> > 5 s
> >> >> >> >>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0
> ms
> >> >> >> >>> >>> > 0
> >> >> >> >>> >>> > ms
> >> >> >> >>> >>> > results
> >> >> >> >>> >>> > Scheduler delay                6 s        6 s        6 s
> >> >> >> >>> >>> > 6 s
> >> >> >> >>> >>> > 12 s
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Stage Id
> >> >> >> >>> >>> > 14 aggregate at GradientDescent.scala:178
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Task Index    Task ID    Status    Locality Level
> >> >> >> >>> >>> > Executor
> >> >> >> >>> >>> > Launch Time                Duration    GC     Result Ser
> >> >> >> >>> >>> > Time
> >> >> >> >>> >>> > Errors
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Time
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > 0     600     RUNNING     PROCESS_LOCAL
> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >> >> >>> >>> > 1     601     RUNNING     PROCESS_LOCAL
> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
> >> >> >> >>> >>> > 2     602     RUNNING     PROCESS_LOCAL
> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >> >> >>> >>> > 3     603     RUNNING     PROCESS_LOCAL
> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
> >> >> >> >>> >>> > 4     604     RUNNING     PROCESS_LOCAL
> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >> >> >>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
> >> >> >> >>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
> >> >> >> >>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
> >> >> >> >>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >> >> >>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
> >> >> >> >>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >> >> >>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
> >> >> >> >>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
> >> >> >> >>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
> >> >> >> >>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
> >> >> >> >>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
> >> >> >> >>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >> >> >>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
> >> >> >> >>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
> >> >> >> >>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
> >> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Executor stats:
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks
> >> >> >> >>> >>> > Failed
> >> >> >> >>> >>> > Tasks
> >> >> >> >>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle
> Read
> >> >> >> >>> >>> > Shuffle
> >> >> >> >>> >>> > Write
> >> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2
> >> >> >> >>> >>> > 0
> >> >> >> >>> >>> > 307         309         23.2 m         0.0 B
> >> >> >> >>> >>> > 0.0
> >> >> >> >>> >>> > B
> >> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3
> >> >> >> >>> >>> > 0
> >> >> >> >>> >>> > 308         311         22.4 m         0.0 B
> >> >> >> >>> >>> > 0.0
> >> >> >> >>> >>> > B
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Executor jmap output:
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Server compiler detected.
> >> >> >> >>> >>> > JVM version is 24.55-b03
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > using thread-local object allocation.
> >> >> >> >>> >>> > Parallel GC with 18 thread(s)
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Heap Configuration:
> >> >> >> >>> >>> >    MinHeapFreeRatio = 40
> >> >> >> >>> >>> >    MaxHeapFreeRatio = 70
> >> >> >> >>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
> >> >> >> >>> >>> >    NewSize          = 1310720 (1.25MB)
> >> >> >> >>> >>> >    MaxNewSize       = 17592186044415 MB
> >> >> >> >>> >>> >    OldSize          = 5439488 (5.1875MB)
> >> >> >> >>> >>> >    NewRatio         = 2
> >> >> >> >>> >>> >    SurvivorRatio    = 8
> >> >> >> >>> >>> >    PermSize         = 21757952 (20.75MB)
> >> >> >> >>> >>> >    MaxPermSize      = 134217728 (128.0MB)
> >> >> >> >>> >>> >    G1HeapRegionSize = 0 (0.0MB)
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Heap Usage:
> >> >> >> >>> >>> > PS Young Generation
> >> >> >> >>> >>> > Eden Space:
> >> >> >> >>> >>> >    capacity = 2783969280 (2655.0MB)
> >> >> >> >>> >>> >    used     = 192583816 (183.66223907470703MB)
> >> >> >> >>> >>> >    free     = 2591385464 (2471.337760925293MB)
> >> >> >> >>> >>> >    6.917598458557704% used
> >> >> >> >>> >>> > From Space:
> >> >> >> >>> >>> >    capacity = 409993216 (391.0MB)
> >> >> >> >>> >>> >    used     = 1179808 (1.125152587890625MB)
> >> >> >> >>> >>> >    free     = 408813408 (389.8748474121094MB)
> >> >> >> >>> >>> >    0.2877628102022059% used
> >> >> >> >>> >>> > To Space:
> >> >> >> >>> >>> >    capacity = 385351680 (367.5MB)
> >> >> >> >>> >>> >    used     = 0 (0.0MB)
> >> >> >> >>> >>> >    free     = 385351680 (367.5MB)
> >> >> >> >>> >>> >    0.0% used
> >> >> >> >>> >>> > PS Old Generation
> >> >> >> >>> >>> >    capacity = 7158628352 (6827.0MB)
> >> >> >> >>> >>> >    used     = 4455093024 (4248.707794189453MB)
> >> >> >> >>> >>> >    free     = 2703535328 (2578.292205810547MB)
> >> >> >> >>> >>> >    62.2338918146983% used
> >> >> >> >>> >>> > PS Perm Generation
> >> >> >> >>> >>> >    capacity = 90701824 (86.5MB)
> >> >> >> >>> >>> >    used     = 45348832 (43.248016357421875MB)
> >> >> >> >>> >>> >    free     = 45352992 (43.251983642578125MB)
> >> >> >> >>> >>> >    49.99770677158598% used
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > 8432 interned Strings occupying 714672 bytes.
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Executor GC log snippet:
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
> >> >> >> >>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times:
> >> >> >> >>> >>> > user=5.13
> >> >> >> >>> >>> > sys=0.39,
> >> >> >> >>> >>> > real=0.32 secs]
> >> >> >> >>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)]
> >> >> >> >>> >>> > [ParOldGen:
> >> >> >> >>> >>> > 6882312K->1073297K(6990848K)]
> 7460857K->1073297K(9907712K)
> >> >> >> >>> >>> > [PSPermGen:
> >> >> >> >>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times:
> >> >> >> >>> >>> > user=24.22
> >> >> >> >>> >>> > sys=0.18,
> >> >> >> >>> >>> > real=4.55 secs]
> >> >> >> >>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
> >> >> >> >>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times:
> >> >> >> >>> >>> > user=1.66
> >> >> >> >>> >>> > sys=0.00,
> >> >> >> >>> >>> > real=0.09 secs]
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > I tried to map partitions to cores on the nodes.
> >> >> >> >>> >>> > Increasing
> >> >> >> >>> >>> > the
> >> >> >> >>> >>> > number
> >> >> >> >>> >>> > of
> >> >> >> >>> >>> > partitions (say to 80 or 100) would result in progress
> >> >> >> >>> >>> > till
> >> >> >> >>> >>> > the
> >> >> >> >>> >>> > 6th
> >> >> >> >>> >>> > iteration or so, but the next stage would stall as
> before
> >> >> >> >>> >>> > with
> >> >> >> >>> >>> > apparent
> >> >> >> >>> >>> > root
> >> >> >> >>> >>> > cause / logs. With increased partitions, the last stage
> >> >> >> >>> >>> > that
> >> >> >> >>> >>> > completed
> >> >> >> >>> >>> > had
> >> >> >> >>> >>> > the following task times:
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Metric                        Min        25th     Median
> >> >> >> >>> >>> > 75th
> >> >> >> >>> >>> > Max
> >> >> >> >>> >>> > Result serialization time    11 ms    12 ms    13 ms
>  15
> >> >> >> >>> >>> > ms
> >> >> >> >>> >>> > 0.4 s
> >> >> >> >>> >>> > Duration                    0.5 s    0.9 s    1 s
>  3
> >> >> >> >>> >>> > s
> >> >> >> >>> >>> > 7 s
> >> >> >> >>> >>> > Time spent fetching            0 ms    0 ms    0 ms    0
> >> >> >> >>> >>> > ms
> >> >> >> >>> >>> > 0
> >> >> >> >>> >>> > ms
> >> >> >> >>> >>> > task results
> >> >> >> >>> >>> > Scheduler delay                5 s        6 s        6 s
> >> >> >> >>> >>> > 7 s
> >> >> >> >>> >>> > 12 s
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > My hypothesis is that as the coefficient array becomes
> >> >> >> >>> >>> > less
> >> >> >> >>> >>> > sparse
> >> >> >> >>> >>> > (with
> >> >> >> >>> >>> > successive iterations), the cost of the aggregate goes
> up
> >> >> >> >>> >>> > to
> >> >> >> >>> >>> > the
> >> >> >> >>> >>> > point
> >> >> >> >>> >>> > that
> >> >> >> >>> >>> > it stalls (which I failed to explain). Reducing the
> batch
> >> >> >> >>> >>> > fraction
> >> >> >> >>> >>> > to a
> >> >> >> >>> >>> > very
> >> >> >> >>> >>> > low number like 0.01 saw the iterations progress
> further,
> >> >> >> >>> >>> > but
> >> >> >> >>> >>> > the
> >> >> >> >>> >>> > model
> >> >> >> >>> >>> > failed to converge in that case after a small number of
> >> >> >> >>> >>> > iterations.
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > I also tried reducing the number of records by
> aggregating
> >> >> >> >>> >>> > on
> >> >> >> >>> >>> > (x,y)
> >> >> >> >>> >>> > as
> >> >> >> >>> >>> > the
> >> >> >> >>> >>> > key (i.e. using aggregations instead of training on
> every
> >> >> >> >>> >>> > raw
> >> >> >> >>> >>> > record),
> >> >> >> >>> >>> > but
> >> >> >> >>> >>> > encountered by the following exception:
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Loss was due to java.lang.NullPointerException
> >> >> >> >>> >>> > java.lang.NullPointerException
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> > org.apache.spark.scheduler.Task.run(Task.scala:51)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >> >>> >>> >         at
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >> >>> >>> >         at java.lang.Thread.run(Thread.java:745)
> >> >> >> >>> >>> >
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > I'd appreciate any insights/comments about what may be
> >> >> >> >>> >>> > causing
> >> >> >> >>> >>> > the
> >> >> >> >>> >>> > execution
> >> >> >> >>> >>> > to stall.
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > If logs/tables appear poorly indented in the email,
> here's
> >> >> >> >>> >>> > a
> >> >> >> >>> >>> > gist
> >> >> >> >>> >>> > with
> >> >> >> >>> >>> > relevant details:
> >> >> >> >>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
> >> >> >> >>> >>> >
> >> >> >> >>> >>> > Thanks,
> >> >> >> >>> >>> > Bharath
> >> >> >> >>> >
> >> >> >> >>> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>

Re: Execution stalls in LogisticRegressionWithSGD

Posted by Xiangrui Meng <me...@gmail.com>.

It seems to me a setup issue. I just tested news20.binary (1355191
features) on a 2-node EC2 cluster and it worked well. I added one line
to conf/spark-env.sh:

export SPARK_JAVA_OPTS=" -Dspark.akka.frameSize=20 "

and launched spark-shell with "--driver-memory 20g". Could you re-try
with an EC2 setup? If it still doesn't work, please attach all your
code and logs.

Best,
Xiangrui

On Sun, Jul 6, 2014 at 1:35 AM, Bharath Ravi Kumar <re...@gmail.com> wrote:
> Hi Xiangrui,
>
> 1) Yes, I used the same build (compiled locally from source) to the host
> that has (master, slave1) and the second host with slave2.
>
> 2) The execution was successful when run in local mode with reduced number
> of partitions. Does this imply issues communicating/coordinating across
> processes (i.e. driver, master and workers)?
>
> Thanks,
> Bharath
>
>
>
> On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> Hi Bharath,
>>
>> 1) Did you sync the spark jar and conf to the worker nodes after build?
>> 2) Since the dataset is not large, could you try local mode first
>> using `spark-summit --driver-memory 12g --master local[*]`?
>> 3) Try to use less number of partitions, say 5.
>>
>> If the problem is still there, please attach the full master/worker log
>> files.
>>
>> Best,
>> Xiangrui
>>
>> On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar <re...@gmail.com>
>> wrote:
>> > Xiangrui,
>> >
>> > Leaving the frameSize unspecified led to an error message (and failure)
>> > stating that the task size (~11M) was larger. I hence set it to an
>> > arbitrarily large value ( I realize 500 was unrealistic & unnecessary in
>> > this case). I've now set the size to 20M and repeated the runs. The
>> > earlier
>> > runs were on an uncached RDD. Caching the RDD (and setting
>> > spark.storage.memoryFraction=0.5) resulted in marginal speed up of
>> > execution, but the end result remained the same. The cached RDD size is
>> > as
>> > follows:
>> >
>> > RDD Name    Storage Level                        Cached Partitions
>> > Fraction Cached    Size in Memory    Size in Tachyon        Size on Disk
>> > 1084         Memory Deserialized 1x Replicated     80
>> > 100%         165.9 MB             0.0 B                 0.0 B
>> >
>> >
>> >
>> > The corresponding master logs were:
>> >
>> > 14/07/04 06:29:34 INFO Master: Removing executor
>> > app-20140704062238-0033/1
>> > because it is EXITED
>> > 14/07/04 06:29:34 INFO Master: Launching executor
>> > app-20140704062238-0033/2
>> > on worker worker-20140630124441-slave1-40182
>> > 14/07/04 06:29:34 INFO Master: Removing executor
>> > app-20140704062238-0033/0
>> > because it is EXITED
>> > 14/07/04 06:29:34 INFO Master: Launching executor
>> > app-20140704062238-0033/3
>> > on worker worker-20140630102913-slave2-44735
>> > 14/07/04 06:29:37 INFO Master: Removing executor
>> > app-20140704062238-0033/2
>> > because it is EXITED
>> > 14/07/04 06:29:37 INFO Master: Launching executor
>> > app-20140704062238-0033/4
>> > on worker worker-20140630124441-slave1-40182
>> > 14/07/04 06:29:37 INFO Master: Removing executor
>> > app-20140704062238-0033/3
>> > because it is EXITED
>> > 14/07/04 06:29:37 INFO Master: Launching executor
>> > app-20140704062238-0033/5
>> > on worker worker-20140630102913-slave2-44735
>> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> > disassociated, removing it.
>> > 14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
>> > 14/07/04 06:29:39 INFO LocalActorRef: Message
>> > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
>> > from
>> > Actor[akka://sparkMaster/deadLetters] to
>> >
>> > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
>> > was not delivered. [39] dead letters encountered. This logging can be
>> > turned
>> > off or adjusted with configuration settings 'akka.log-dead-letters' and
>> > 'akka.log-dead-letters-during-shutdown'.
>> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> > disassociated, removing it.
>> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> > disassociated, removing it.
>> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
>> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> > akka.remote.EndpointAssociationException: Association failed with
>> > [akka.tcp://spark@slave2:45172]
>> > Caused by:
>> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> > Connection refused: slave2/10.3.1.135:45172
>> > ]
>> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> > disassociated, removing it.
>> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
>> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> > akka.remote.EndpointAssociationException: Association failed with
>> > [akka.tcp://spark@slave2:45172]
>> > Caused by:
>> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> > Connection refused: slave2/10.3.1.135:45172
>> > ]
>> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
>> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> > akka.remote.EndpointAssociationException: Association failed with
>> > [akka.tcp://spark@slave2:45172]
>> > Caused by:
>> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> > Connection refused: slave2/10.3.1.135:45172
>> > ]
>> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> > disassociated, removing it.
>> > 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
>> > app-20140704062238-0033/5
>> > 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
>> > app-20140704062238-0033/4
>> >
>> >
>> > Coincidentally, after the initial executor failed, each following
>> > executor
>> > that was re-spawned failed with the following logs:
>> > (e.g the following was from
>> > slave1:~/spark-1.0.1-rc1/work/app-20140704062238-0033/2/stderr)
>> >
>> > log4j:WARN No appenders could be found for logger
>> > (org.apache.hadoop.conf.Configuration).
>> > log4j:WARN Please initialize the log4j system properly.
>> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>> > more info.
>> > 14/07/04 06:29:35 INFO SparkHadoopUtil: Using Spark's default log4j
>> > profile:
>> > org/apache/spark/log4j-defaults.properties
>> > 14/07/04 06:29:35 INFO SecurityManager: Changing view acls to: user1
>> > 14/07/04 06:29:35 INFO SecurityManager: SecurityManager: authentication
>> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> > 14/07/04 06:29:35 INFO Slf4jLogger: Slf4jLogger started
>> > 14/07/04 06:29:35 INFO Remoting: Starting remoting
>> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on
>> > addresses
>> > :[akka.tcp://sparkExecutor@slave1:54782]
>> > 14/07/04 06:29:36 INFO Remoting: Remoting now listens on addresses:
>> > [akka.tcp://sparkExecutor@slave1:54782]
>> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Connecting to
>> > driver:
>> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
>> > 14/07/04 06:29:36 INFO WorkerWatcher: Connecting to worker
>> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> > 14/07/04 06:29:36 INFO WorkerWatcher: Successfully connected to
>> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Successfully
>> > registered
>> > with driver
>> > 14/07/04 06:29:36 INFO SecurityManager: Changing view acls to: user1
>> > 14/07/04 06:29:36 INFO SecurityManager: SecurityManager: authentication
>> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> > 14/07/04 06:29:36 INFO Slf4jLogger: Slf4jLogger started
>> > 14/07/04 06:29:36 INFO Remoting: Starting remoting
>> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on
>> > addresses
>> > :[akka.tcp://spark@slave1:39753]
>> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to MapOutputTracker:
>> > akka.tcp://spark@master:45172/user/MapOutputTracker
>> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to BlockManagerMaster:
>> > akka.tcp://spark@master:45172/user/BlockManagerMaster
>> > 14/07/04 06:29:36 INFO DiskBlockManager: Created local directory at
>> > /tmp/spark-local-20140704062936-6123
>> > 14/07/04 06:29:36 INFO MemoryStore: MemoryStore started with capacity
>> > 6.7
>> > GB.
>> > 14/07/04 06:29:36 INFO ConnectionManager: Bound socket to port 50960
>> > with id
>> > = ConnectionManagerId(slave1,50960)
>> > 14/07/04 06:29:36 INFO BlockManagerMaster: Trying to register
>> > BlockManager
>> > 14/07/04 06:29:36 INFO BlockManagerMaster: Registered BlockManager
>> > 14/07/04 06:29:36 INFO HttpFileServer: HTTP File server directory is
>> > /tmp/spark-42c2782f-60f8-45a7-9e11-c789fc87fe2e
>> > 14/07/04 06:29:36 INFO HttpServer: Starting HTTP Server
>> > 14/07/04 06:29:36 ERROR CoarseGrainedExecutorBackend: Driver
>> > Disassociated
>> > [akka.tcp://sparkExecutor@slave1:54782] ->
>> > [akka.tcp://spark@master:45172]
>> > disassociated! Shutting down.
>> >
>> > In case of the initial executor that successfully started, the
>> > corresponding
>> > log messages (from
>> > spark-1.0.1-rc1/work/app-20140704062238-0033/1/stderr) on
>> > the executor were:
>> > log4j:WARN No appenders could be found for logger
>> > (org.apache.hadoop.conf.Configuration).
>> > log4j:WARN Please initialize the log4j system properly.
>> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>> > more info.
>> > 14/07/04 06:22:39 INFO SparkHadoopUtil: Using Spark's default log4j
>> > profile:
>> > org/apache/spark/log4j-defaults.properties
>> > 14/07/04 06:22:39 INFO SecurityManager: Changing view acls to: user1
>> > 14/07/04 06:22:39 INFO SecurityManager: SecurityManager: authentication
>> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> > 14/07/04 06:22:39 INFO Slf4jLogger: Slf4jLogger started
>> > 14/07/04 06:22:39 INFO Remoting: Starting remoting
>> > 14/07/04 06:22:39 INFO Remoting: Remoting started; listening on
>> > addresses
>> > :[akka.tcp://sparkExecutor@slave1:50806]
>> > 14/07/04 06:22:39 INFO Remoting: Remoting now listens on addresses:
>> > [akka.tcp://sparkExecutor@slave1:50806]
>> > 14/07/04 06:22:39 INFO CoarseGrainedExecutorBackend: Connecting to
>> > driver:
>> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
>> > 14/07/04 06:22:39 INFO WorkerWatcher: Connecting to worker
>> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> > 14/07/04 06:22:39 INFO WorkerWatcher: Successfully connected to
>> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> > 14/07/04 06:22:40 INFO CoarseGrainedExecutorBackend: Successfully
>> > registered
>> > with driver
>> > 14/07/04 06:22:40 INFO SecurityManager: Changing view acls to: user1
>> > 14/07/04 06:22:40 INFO SecurityManager: SecurityManager: authentication
>> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> > 14/07/04 06:22:40 INFO Slf4jLogger: Slf4jLogger started
>> > 14/07/04 06:22:40 INFO Remoting: Starting remoting
>> > 14/07/04 06:22:40 INFO Remoting: Remoting started; listening on
>> > addresses
>> > :[akka.tcp://spark@slave1:38558]
>> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to MapOutputTracker:
>> > akka.tcp://spark@master:45172/user/MapOutputTracker
>> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to BlockManagerMaster:
>> > akka.tcp://spark@master:45172/user/BlockManagerMaster
>> > 14/07/04 06:22:40 INFO DiskBlockManager: Created local directory at
>> > /tmp/spark-local-20140704062240-6a65
>> > 14/07/04 06:22:40 INFO MemoryStore: MemoryStore started with capacity
>> > 6.7
>> > GB.
>> > 14/07/04 06:22:40 INFO ConnectionManager: Bound socket to port 46901
>> > with id
>> > = ConnectionManagerId(slave1,46901)
>> > 14/07/04 06:22:40 INFO BlockManagerMaster: Trying to register
>> > BlockManager
>> > 14/07/04 06:22:40 INFO BlockManagerMaster: Registered BlockManager
>> > 14/07/04 06:22:40 INFO HttpFileServer: HTTP File server directory is
>> > /tmp/spark-9eba78f9-8ae9-477c-9338-7222ae6fe306
>> > 14/07/04 06:22:40 INFO HttpServer: Starting HTTP Server
>> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned task 0
>> > 14/07/04 06:22:42 INFO Executor: Running task ID 0
>> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned task 2
>> > 14/07/04 06:22:42 INFO Executor: Running task ID 2
>> > ...
>> >
>> >
>> >
>> > On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng <me...@gmail.com> wrote:
>> >>
>> >> The feature dimension is small. You don't need a big akka.frameSize.
>> >> The default one (10M) should be sufficient. Did you cache the data
>> >> before calling LRWithSGD? -Xiangrui
>> >>
>> >> On Thu, Jul 3, 2014 at 10:02 AM, Bharath Ravi Kumar
>> >> <re...@gmail.com>
>> >> wrote:
>> >> > I tried another run after setting the driver memory to 8G (and
>> >> > spark.akka.frameSize = 500 on the executors and the driver). In
>> >> > addition, I
>> >> > also tried to reduce the amount of data that a single task processes,
>> >> > by
>> >> > increasing the number of partitions (of the labeled points) to 120
>> >> > (instead
>> >> > of 2 used earlier), and then setting max cores to 2. That made no
>> >> > difference
>> >> > since, at the end of 120 tasks, the familiar error message appeared
>> >> > on a
>> >> > slave:
>> >> >
>> >> > <snipped earlier logs>
>> >> > 14/07/03 16:18:48 INFO CoarseGrainedExecutorBackend: Got assigned
>> >> > task
>> >> > 1436
>> >> > 14/07/03 16:18:48 INFO Executor: Running task ID 1436
>> >> > 14/07/03 16:18:53 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00014:0+2215337
>> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00014:2215337+2215338
>> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00003:0+2196429
>> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00003:2196429+2196430
>> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00010:0+2186751
>> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00010:2186751+2186751
>> >> > 14/07/03 16:18:54 INFO Executor: Serialized size of result for 1436
>> >> > is
>> >> > 5958822
>> >> > 14/07/03 16:18:54 INFO Executor: Sending result for 1436 directly to
>> >> > driver
>> >> > 14/07/03 16:18:54 INFO Executor: Finished task ID 1436
>> >> > 14/07/03 16:18:54 INFO CoarseGrainedExecutorBackend: Got assigned
>> >> > task
>> >> > 1438
>> >> > 14/07/03 16:18:54 INFO Executor: Running task ID 1438
>> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00004:0+2209615
>> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00004:2209615+2209616
>> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00011:0+2202240
>> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00011:2202240+2202240
>> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00009:0+2194423
>> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> > file:~//2014-05-24-02/part-r-00009:2194423+2194424
>> >> > 14/07/03 16:19:00 INFO Executor: Serialized size of result for 1438
>> >> > is
>> >> > 5958822
>> >> > 14/07/03 16:19:00 INFO Executor: Sending result for 1438 directly to
>> >> > driver
>> >> > 14/07/03 16:19:00 INFO Executor: Finished task ID 1438
>> >> > 14/07/03 16:19:14 ERROR CoarseGrainedExecutorBackend: Driver
>> >> > Disassociated
>> >> > [akka.tcp://sparkExecutor@slave1:51099] ->
>> >> > [akka.tcp://spark@master:58272]
>> >> > disassociated! Shutting down.
>> >> >
>> >> >
>> >> > The corresponding master logs were:
>> >> >
>> >> > 4/07/03 16:02:14 INFO Master: Registering app LogRegExp
>> >> > 14/07/03 16:02:14 INFO Master: Registered app LogRegExp with ID
>> >> > app-20140703160214-0028
>> >> > 14/07/03 16:02:14 INFO Master: Launching executor
>> >> > app-20140703160214-0028/1
>> >> > on worker worker-20140630124441-slave1-40182
>> >> > 14/07/03 16:19:15 INFO Master: Removing executor
>> >> > app-20140703160214-0028/1
>> >> > because it is EXITED
>> >> > 14/07/03 16:19:15 INFO Master: Launching executor
>> >> > app-20140703160214-0028/2
>> >> > on worker worker-20140630124441-slave1-40182
>> >> > 14/07/03 16:19:15 INFO Master: Removing executor
>> >> > app-20140703160214-0028/0
>> >> > because it is EXITED
>> >> > 14/07/03 16:19:15 INFO Master: Launching executor
>> >> > app-20140703160214-0028/3
>> >> > on worker worker-20140630102913-slave2-44735
>> >> > 14/07/03 16:19:18 INFO Master: Removing executor
>> >> > app-20140703160214-0028/2
>> >> > because it is EXITED
>> >> > 14/07/03 16:19:18 INFO Master: Launching executor
>> >> > app-20140703160214-0028/4
>> >> > on worker worker-20140630124441-slave1-40182
>> >> > 14/07/03 16:19:18 INFO Master: Removing executor
>> >> > app-20140703160214-0028/3
>> >> > because it is EXITED
>> >> > 14/07/03 16:19:18 INFO Master: Launching executor
>> >> > app-20140703160214-0028/5
>> >> > on worker worker-20140630102913-slave2-44735
>> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
>> >> > disassociated, removing it.
>> >> > 14/07/03 16:19:20 INFO Master: Removing app app-20140703160214-0028
>> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
>> >> > disassociated, removing it.
>> >> >
>> >> >
>> >> > Throughout the execution, I confirmed in the UI that driver memory
>> >> > used
>> >> > was
>> >> > 0.0 B / 6.9 GB and each executor's memory showed 0.0 B / 12.1 GB even
>> >> > when
>> >> > aggregate was being executed. On a related note, I noticed in the
>> >> > executors
>> >> > tab that just before the entire job terminated, executors on slave1,
>> >> > slave2
>> >> > and the driver "disappeared" momentarily from the active executors
>> >> > list.
>> >> > The
>> >> > replacement  executors on slave1 and slave2 were re-spawned a couple
>> >> > of
>> >> > times and appeared on the executors list again before they too died
>> >> > and
>> >> > the
>> >> > job failed.
>> >> > So it appears that no matter what the task input-result size, the
>> >> > execution
>> >> > fails at the end of the stage corresponding to
>> >> > GradientDescent.aggregate
>> >> > (and the preceding count() in GradientDescent goes through fine). Let
>> >> > me
>> >> > know if you need any additional information.
>> >> >
>> >> >
>> >> > On Thu, Jul 3, 2014 at 12:27 PM, Xiangrui Meng <me...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Could you check the driver memory in the executor tab of the Spark
>> >> >> UI
>> >> >> when the job is running? If it is too small, please set
>> >> >> --driver-memory with spark-submit, e.g. 10g. Could you also attach
>> >> >> the
>> >> >> master log under spark/logs as well? -Xiangrui
>> >> >>
>> >> >> On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar
>> >> >> <re...@gmail.com>
>> >> >> wrote:
>> >> >> > Hi Xiangrui,
>> >> >> >
>> >> >> > The issue with aggergating/counting over large feature vectors (as
>> >> >> > part
>> >> >> > of
>> >> >> > LogisticRegressionWithSGD) continues to exist, but now in another
>> >> >> > form:
>> >> >> > while the execution doesn't freeze (due to SPARK-1112), it now
>> >> >> > fails
>> >> >> > at
>> >> >> > the
>> >> >> > second or third gradient descent iteration consistently with an
>> >> >> > error
>> >> >> > level
>> >> >> > log message, but no stacktrace. I'm running against 1.0.1-rc1, and
>> >> >> > have
>> >> >> > tried setting spark.akka.frameSize as high as 500. When the
>> >> >> > execution
>> >> >> > fails,
>> >> >> > each of the two executors log the following message (corresponding
>> >> >> > to
>> >> >> > aggregate at GradientDescent.scala:178) :
>> >> >> >
>> >> >> > 14/07/02 14:09:09 INFO
>> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> >> > maxBytesInFlight: 50331648, targetRequestSize: 10066329
>> >> >> > 14/07/02 14:09:09 INFO
>> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> >> > Getting 2 non-empty blocks out of 2 blocks
>> >> >> > 14/07/02 14:09:09 INFO
>> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> >> > Started 1 remote fetches in 0 ms
>> >> >> > 14/07/02 14:09:11 INFO Executor: Serialized size of result for 737
>> >> >> > is
>> >> >> > 5959086
>> >> >> > 14/07/02 14:09:11 INFO Executor: Sending result for 737 directly
>> >> >> > to
>> >> >> > driver
>> >> >> > 14/07/02 14:09:11 INFO Executor: Finished task ID 737
>> >> >> > 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver
>> >> >> > Disassociated
>> >> >> > [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
>> >> >> > [akka.tcp://spark@master:59487] disassociated! Shutting down.
>> >> >> >
>> >> >> >
>> >> >> > There is no separate stacktrace on the driver side.
>> >> >> >
>> >> >> > Each input record is of the form p1, p2, (p1,p2) where p1, p2 &
>> >> >> > (p1,p2)
>> >> >> > are
>> >> >> > categorical features with large cardinality, and X is the double
>> >> >> > label
>> >> >> > with
>> >> >> > a continuous value. The categorical variables are converted to
>> >> >> > binary
>> >> >> > variables which results in a feature vector of size 741092
>> >> >> > (composed
>> >> >> > of
>> >> >> > all
>> >> >> > unique categories across p1, p2 and (p1,p2)). Thus, the labeled
>> >> >> > point
>> >> >> > for
>> >> >> > input record is a sparse vector of size 741092 with only 3
>> >> >> > variables
>> >> >> > set
>> >> >> > in
>> >> >> > the record. The total number of records is 683233 after
>> >> >> > aggregating
>> >> >> > the
>> >> >> > input data on (p1, p2). When attempting to train on the
>> >> >> > unaggregated
>> >> >> > records
>> >> >> > (1337907 in number spread across 455 files), the execution fails
>> >> >> > at
>> >> >> > count,
>> >> >> > GradientDescent.scala:161 with the following log
>> >> >> >
>> >> >> >
>> >> >> > (Snipped lines corresponding to other input files)
>> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> >> > file:~/part-r-00012:2834590+2834590
>> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> >> > file:~/part-r-00005:0+2845559
>> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> >> > file:~/part-r-00005:2845559+2845560
>> >> >> > 14/07/02 16:02:03 INFO Executor: Serialized size of result for 726
>> >> >> > is
>> >> >> > 615
>> >> >> > 14/07/02 16:02:03 INFO Executor: Sending result for 726 directly
>> >> >> > to
>> >> >> > driver
>> >> >> > 14/07/02 16:02:03 INFO Executor: Finished task ID 726
>> >> >> > 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver
>> >> >> > Disassociated
>> >> >> > [akka.tcp://sparkExecutor@slave1:48423] ->
>> >> >> > [akka.tcp://spark@master:55792]
>> >> >> > disassociated! Shutting down.
>> >> >> >
>> >> >> > A count() attempted on the input RDD before beginning training has
>> >> >> > the
>> >> >> > following metrics:
>> >> >> >
>> >> >> >
>> >> >> > Metric            Min        25th    Median    75th     Max
>> >> >> >
>> >> >> > Result
>> >> >> > serialization
>> >> >> > time            0 ms    0 ms    0 ms    0 ms    0 ms
>> >> >> >
>> >> >> > Duration        33 s    33 s    35 s    35 s    35 s
>> >> >> >
>> >> >> > Time spent
>> >> >> > fetching task
>> >> >> > results            0 ms    0 ms    0 ms    0 ms    0 ms
>> >> >> >
>> >> >> > Scheduler
>> >> >> > delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
>> >> >> >
>> >> >> > Aggregated Metrics by Executor
>> >> >> >
>> >> >> > ID     Address Task             Time Total Failed Succeeded
>> >> >> > Shuffle
>> >> >> > Read
>> >> >> > Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
>> >> >> > 0     CANNOT FIND ADDRESS     34 s     1     0         1
>> >> >> > 0.0
>> >> >> > B
>> >> >> > 0.0 B             0.0 B                 0.0 B
>> >> >> > 1     CANNOT FIND ADDRESS     36 s     1     0         1
>> >> >> > 0.0
>> >> >> > B
>> >> >> > 0.0 B             0.0 B                 0.0 B
>> >> >> >
>> >> >> > Tasks
>> >> >> >
>> >> >> > Task Index    Task ID    Status    Locality Level    Executor
>> >> >> > Launch
>> >> >> > Time
>> >> >> > Duration    GC Time    Result Ser Time    Errors
>> >> >> > 0     726     SUCCESS         PROCESS_LOCAL     slave1
>> >> >> > 2014/07/02
>> >> >> > 16:01:28 35 s         0.1 s
>> >> >> > 1     727     SUCCESS         PROCESS_LOCAL     slave2
>> >> >> > 2014/07/02
>> >> >> > 16:01:28 33 s         99 ms
>> >> >> >
>> >> >> > Any pointers / diagnosis please?
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar
>> >> >> > <re...@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Thanks. I'll await the fix to re-run my test.
>> >> >> >>
>> >> >> >>
>> >> >> >> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng <me...@gmail.com>
>> >> >> >> wrote:
>> >> >> >>>
>> >> >> >>> Hi Bharath,
>> >> >> >>>
>> >> >> >>> This is related to SPARK-1112, which we already found the root
>> >> >> >>> cause.
>> >> >> >>> I will let you know when this is fixed.
>> >> >> >>>
>> >> >> >>> Best,
>> >> >> >>> Xiangrui
>> >> >> >>>
>> >> >> >>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar
>> >> >> >>> <re...@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>> > Couple more points:
>> >> >> >>> > 1)The inexplicable stalling of execution with large feature
>> >> >> >>> > sets
>> >> >> >>> > appears
>> >> >> >>> > similar to that reported with the news-20 dataset:
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
>> >> >> >>> >
>> >> >> >>> > 2) The NPE trying to call mapToPair convert an RDD<Long, Long,
>> >> >> >>> > Integer,
>> >> >> >>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>,
>> >> >> >>> > Tuple2<Integer,Integer>>
>> >> >> >>> > is
>> >> >> >>> > unrelated to mllib.
>> >> >> >>> >
>> >> >> >>> > Thanks,
>> >> >> >>> > Bharath
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
>> >> >> >>> > <re...@gmail.com>
>> >> >> >>> > wrote:
>> >> >> >>> >>
>> >> >> >>> >> Hi  Xiangrui ,
>> >> >> >>> >>
>> >> >> >>> >> I'm using 1.0.0.
>> >> >> >>> >>
>> >> >> >>> >> Thanks,
>> >> >> >>> >> Bharath
>> >> >> >>> >>
>> >> >> >>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <me...@gmail.com>
>> >> >> >>> >> wrote:
>> >> >> >>> >>>
>> >> >> >>> >>> Hi Bharath,
>> >> >> >>> >>>
>> >> >> >>> >>> Thanks for posting the details! Which Spark version are you
>> >> >> >>> >>> using?
>> >> >> >>> >>>
>> >> >> >>> >>> Best,
>> >> >> >>> >>> Xiangrui
>> >> >> >>> >>>
>> >> >> >>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
>> >> >> >>> >>> <re...@gmail.com>
>> >> >> >>> >>> wrote:
>> >> >> >>> >>> > Hi,
>> >> >> >>> >>> >
>> >> >> >>> >>> > (Apologies for the long mail, but it's necessary to
>> >> >> >>> >>> > provide
>> >> >> >>> >>> > sufficient
>> >> >> >>> >>> > details considering the number of issues faced.)
>> >> >> >>> >>> >
>> >> >> >>> >>> > I'm running into issues testing LogisticRegressionWithSGD
>> >> >> >>> >>> > a
>> >> >> >>> >>> > two
>> >> >> >>> >>> > node
>> >> >> >>> >>> > cluster
>> >> >> >>> >>> > (each node with 24 cores and 16G available to slaves out
>> >> >> >>> >>> > of
>> >> >> >>> >>> > 24G
>> >> >> >>> >>> > on
>> >> >> >>> >>> > the
>> >> >> >>> >>> > system). Here's a description of the application:
>> >> >> >>> >>> >
>> >> >> >>> >>> > The model is being trained based on categorical features
>> >> >> >>> >>> > x,
>> >> >> >>> >>> > y,
>> >> >> >>> >>> > and
>> >> >> >>> >>> > (x,y).
>> >> >> >>> >>> > The categorical features are mapped to binary features by
>> >> >> >>> >>> > converting
>> >> >> >>> >>> > each
>> >> >> >>> >>> > distinct value in the category enum into a binary feature
>> >> >> >>> >>> > by
>> >> >> >>> >>> > itself
>> >> >> >>> >>> > (i.e
>> >> >> >>> >>> > presence of that value in a record implies corresponding
>> >> >> >>> >>> > feature
>> >> >> >>> >>> > =
>> >> >> >>> >>> > 1,
>> >> >> >>> >>> > else
>> >> >> >>> >>> > feature = 0. So, there'd be as many distinct features as
>> >> >> >>> >>> > enum
>> >> >> >>> >>> > values) .
>> >> >> >>> >>> > The
>> >> >> >>> >>> > training vector is laid out as
>> >> >> >>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)]. Each
>> >> >> >>> >>> > record
>> >> >> >>> >>> > in
>> >> >> >>> >>> > the
>> >> >> >>> >>> > training data has only one combination (Xk,Yk) and a label
>> >> >> >>> >>> > appearing in
>> >> >> >>> >>> > the
>> >> >> >>> >>> > record. Thus, the corresponding labeledpoint sparse vector
>> >> >> >>> >>> > would
>> >> >> >>> >>> > only
>> >> >> >>> >>> > have 3
>> >> >> >>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total length
>> >> >> >>> >>> > of
>> >> >> >>> >>> > the
>> >> >> >>> >>> > vector
>> >> >> >>> >>> > (though parse) would be nearly 614000.  The number of
>> >> >> >>> >>> > records
>> >> >> >>> >>> > is
>> >> >> >>> >>> > about
>> >> >> >>> >>> > 1.33
>> >> >> >>> >>> > million. The records have been coalesced into 20
>> >> >> >>> >>> > partitions
>> >> >> >>> >>> > across
>> >> >> >>> >>> > two
>> >> >> >>> >>> > nodes. The input data has not been cached.
>> >> >> >>> >>> > (NOTE: I do realize the records & features may seem large
>> >> >> >>> >>> > for
>> >> >> >>> >>> > a
>> >> >> >>> >>> > two
>> >> >> >>> >>> > node
>> >> >> >>> >>> > setup, but given the memory & cpu, and the fact that I'm
>> >> >> >>> >>> > willing
>> >> >> >>> >>> > to
>> >> >> >>> >>> > give up
>> >> >> >>> >>> > some turnaround time, I don't see why tasks should
>> >> >> >>> >>> > inexplicably
>> >> >> >>> >>> > fail)
>> >> >> >>> >>> >
>> >> >> >>> >>> > Additional parameters include:
>> >> >> >>> >>> >
>> >> >> >>> >>> > spark.executor.memory = 14G
>> >> >> >>> >>> > spark.default.parallelism = 1
>> >> >> >>> >>> > spark.cores.max=20
>> >> >> >>> >>> > spark.storage.memoryFraction=0.8 //No cache space required
>> >> >> >>> >>> > (Trying to set spark.akka.frameSize to a larger number,
>> >> >> >>> >>> > say,
>> >> >> >>> >>> > 20
>> >> >> >>> >>> > didn't
>> >> >> >>> >>> > help
>> >> >> >>> >>> > either)
>> >> >> >>> >>> >
>> >> >> >>> >>> > The model training was initialized as : new
>> >> >> >>> >>> > LogisticRegressionWithSGD(1,
>> >> >> >>> >>> > maxIterations, 0.0, 0.05)
>> >> >> >>> >>> >
>> >> >> >>> >>> > However, after 4 iterations of gradient descent, the
>> >> >> >>> >>> > entire
>> >> >> >>> >>> > execution
>> >> >> >>> >>> > appeared to stall inexplicably. The corresponding executor
>> >> >> >>> >>> > details
>> >> >> >>> >>> > and
>> >> >> >>> >>> > details of the stalled stage (number 14) are as follows:
>> >> >> >>> >>> >
>> >> >> >>> >>> > Metric                        Min        25th     Median
>> >> >> >>> >>> > 75th
>> >> >> >>> >>> > Max
>> >> >> >>> >>> > Result serialization time    12 ms    13 ms    14 ms    16
>> >> >> >>> >>> > ms
>> >> >> >>> >>> > 18
>> >> >> >>> >>> > ms
>> >> >> >>> >>> > Duration                    4 s        4 s        5 s
>> >> >> >>> >>> > 5 s
>> >> >> >>> >>> > 5 s
>> >> >> >>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0 ms
>> >> >> >>> >>> > 0
>> >> >> >>> >>> > ms
>> >> >> >>> >>> > results
>> >> >> >>> >>> > Scheduler delay                6 s        6 s        6 s
>> >> >> >>> >>> > 6 s
>> >> >> >>> >>> > 12 s
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > Stage Id
>> >> >> >>> >>> > 14 aggregate at GradientDescent.scala:178
>> >> >> >>> >>> >
>> >> >> >>> >>> > Task Index    Task ID    Status    Locality Level
>> >> >> >>> >>> > Executor
>> >> >> >>> >>> > Launch Time                Duration    GC     Result Ser
>> >> >> >>> >>> > Time
>> >> >> >>> >>> > Errors
>> >> >> >>> >>> >
>> >> >> >>> >>> > Time
>> >> >> >>> >>> >
>> >> >> >>> >>> > 0     600     RUNNING     PROCESS_LOCAL
>> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >> >>> >>> > 1     601     RUNNING     PROCESS_LOCAL
>> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
>> >> >> >>> >>> > 2     602     RUNNING     PROCESS_LOCAL
>> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >> >>> >>> > 3     603     RUNNING     PROCESS_LOCAL
>> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
>> >> >> >>> >>> > 4     604     RUNNING     PROCESS_LOCAL
>> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >> >>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >> >> >>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >> >> >>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >> >> >>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >> >>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
>> >> >> >>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >> >>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
>> >> >> >>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
>> >> >> >>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
>> >> >> >>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >> >> >>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
>> >> >> >>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >> >>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
>> >> >> >>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
>> >> >> >>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
>> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
>> >> >> >>> >>> >
>> >> >> >>> >>> > Executor stats:
>> >> >> >>> >>> >
>> >> >> >>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks
>> >> >> >>> >>> > Failed
>> >> >> >>> >>> > Tasks
>> >> >> >>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle Read
>> >> >> >>> >>> > Shuffle
>> >> >> >>> >>> > Write
>> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2
>> >> >> >>> >>> > 0
>> >> >> >>> >>> > 307         309         23.2 m         0.0 B
>> >> >> >>> >>> > 0.0
>> >> >> >>> >>> > B
>> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3
>> >> >> >>> >>> > 0
>> >> >> >>> >>> > 308         311         22.4 m         0.0 B
>> >> >> >>> >>> > 0.0
>> >> >> >>> >>> > B
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > Executor jmap output:
>> >> >> >>> >>> >
>> >> >> >>> >>> > Server compiler detected.
>> >> >> >>> >>> > JVM version is 24.55-b03
>> >> >> >>> >>> >
>> >> >> >>> >>> > using thread-local object allocation.
>> >> >> >>> >>> > Parallel GC with 18 thread(s)
>> >> >> >>> >>> >
>> >> >> >>> >>> > Heap Configuration:
>> >> >> >>> >>> >    MinHeapFreeRatio = 40
>> >> >> >>> >>> >    MaxHeapFreeRatio = 70
>> >> >> >>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
>> >> >> >>> >>> >    NewSize          = 1310720 (1.25MB)
>> >> >> >>> >>> >    MaxNewSize       = 17592186044415 MB
>> >> >> >>> >>> >    OldSize          = 5439488 (5.1875MB)
>> >> >> >>> >>> >    NewRatio         = 2
>> >> >> >>> >>> >    SurvivorRatio    = 8
>> >> >> >>> >>> >    PermSize         = 21757952 (20.75MB)
>> >> >> >>> >>> >    MaxPermSize      = 134217728 (128.0MB)
>> >> >> >>> >>> >    G1HeapRegionSize = 0 (0.0MB)
>> >> >> >>> >>> >
>> >> >> >>> >>> > Heap Usage:
>> >> >> >>> >>> > PS Young Generation
>> >> >> >>> >>> > Eden Space:
>> >> >> >>> >>> >    capacity = 2783969280 (2655.0MB)
>> >> >> >>> >>> >    used     = 192583816 (183.66223907470703MB)
>> >> >> >>> >>> >    free     = 2591385464 (2471.337760925293MB)
>> >> >> >>> >>> >    6.917598458557704% used
>> >> >> >>> >>> > From Space:
>> >> >> >>> >>> >    capacity = 409993216 (391.0MB)
>> >> >> >>> >>> >    used     = 1179808 (1.125152587890625MB)
>> >> >> >>> >>> >    free     = 408813408 (389.8748474121094MB)
>> >> >> >>> >>> >    0.2877628102022059% used
>> >> >> >>> >>> > To Space:
>> >> >> >>> >>> >    capacity = 385351680 (367.5MB)
>> >> >> >>> >>> >    used     = 0 (0.0MB)
>> >> >> >>> >>> >    free     = 385351680 (367.5MB)
>> >> >> >>> >>> >    0.0% used
>> >> >> >>> >>> > PS Old Generation
>> >> >> >>> >>> >    capacity = 7158628352 (6827.0MB)
>> >> >> >>> >>> >    used     = 4455093024 (4248.707794189453MB)
>> >> >> >>> >>> >    free     = 2703535328 (2578.292205810547MB)
>> >> >> >>> >>> >    62.2338918146983% used
>> >> >> >>> >>> > PS Perm Generation
>> >> >> >>> >>> >    capacity = 90701824 (86.5MB)
>> >> >> >>> >>> >    used     = 45348832 (43.248016357421875MB)
>> >> >> >>> >>> >    free     = 45352992 (43.251983642578125MB)
>> >> >> >>> >>> >    49.99770677158598% used
>> >> >> >>> >>> >
>> >> >> >>> >>> > 8432 interned Strings occupying 714672 bytes.
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > Executor GC log snippet:
>> >> >> >>> >>> >
>> >> >> >>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
>> >> >> >>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times:
>> >> >> >>> >>> > user=5.13
>> >> >> >>> >>> > sys=0.39,
>> >> >> >>> >>> > real=0.32 secs]
>> >> >> >>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)]
>> >> >> >>> >>> > [ParOldGen:
>> >> >> >>> >>> > 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K)
>> >> >> >>> >>> > [PSPermGen:
>> >> >> >>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times:
>> >> >> >>> >>> > user=24.22
>> >> >> >>> >>> > sys=0.18,
>> >> >> >>> >>> > real=4.55 secs]
>> >> >> >>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
>> >> >> >>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times:
>> >> >> >>> >>> > user=1.66
>> >> >> >>> >>> > sys=0.00,
>> >> >> >>> >>> > real=0.09 secs]
>> >> >> >>> >>> >
>> >> >> >>> >>> > I tried to map partitions to cores on the nodes.
>> >> >> >>> >>> > Increasing
>> >> >> >>> >>> > the
>> >> >> >>> >>> > number
>> >> >> >>> >>> > of
>> >> >> >>> >>> > partitions (say to 80 or 100) would result in progress
>> >> >> >>> >>> > till
>> >> >> >>> >>> > the
>> >> >> >>> >>> > 6th
>> >> >> >>> >>> > iteration or so, but the next stage would stall as before
>> >> >> >>> >>> > with
>> >> >> >>> >>> > apparent
>> >> >> >>> >>> > root
>> >> >> >>> >>> > cause / logs. With increased partitions, the last stage
>> >> >> >>> >>> > that
>> >> >> >>> >>> > completed
>> >> >> >>> >>> > had
>> >> >> >>> >>> > the following task times:
>> >> >> >>> >>> >
>> >> >> >>> >>> > Metric                        Min        25th     Median
>> >> >> >>> >>> > 75th
>> >> >> >>> >>> > Max
>> >> >> >>> >>> > Result serialization time    11 ms    12 ms    13 ms    15
>> >> >> >>> >>> > ms
>> >> >> >>> >>> > 0.4 s
>> >> >> >>> >>> > Duration                    0.5 s    0.9 s    1 s        3
>> >> >> >>> >>> > s
>> >> >> >>> >>> > 7 s
>> >> >> >>> >>> > Time spent fetching            0 ms    0 ms    0 ms    0
>> >> >> >>> >>> > ms
>> >> >> >>> >>> > 0
>> >> >> >>> >>> > ms
>> >> >> >>> >>> > task results
>> >> >> >>> >>> > Scheduler delay                5 s        6 s        6 s
>> >> >> >>> >>> > 7 s
>> >> >> >>> >>> > 12 s
>> >> >> >>> >>> >
>> >> >> >>> >>> > My hypothesis is that as the coefficient array becomes
>> >> >> >>> >>> > less
>> >> >> >>> >>> > sparse
>> >> >> >>> >>> > (with
>> >> >> >>> >>> > successive iterations), the cost of the aggregate goes up
>> >> >> >>> >>> > to
>> >> >> >>> >>> > the
>> >> >> >>> >>> > point
>> >> >> >>> >>> > that
>> >> >> >>> >>> > it stalls (which I failed to explain). Reducing the batch
>> >> >> >>> >>> > fraction
>> >> >> >>> >>> > to a
>> >> >> >>> >>> > very
>> >> >> >>> >>> > low number like 0.01 saw the iterations progress further,
>> >> >> >>> >>> > but
>> >> >> >>> >>> > the
>> >> >> >>> >>> > model
>> >> >> >>> >>> > failed to converge in that case after a small number of
>> >> >> >>> >>> > iterations.
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > I also tried reducing the number of records by aggregating
>> >> >> >>> >>> > on
>> >> >> >>> >>> > (x,y)
>> >> >> >>> >>> > as
>> >> >> >>> >>> > the
>> >> >> >>> >>> > key (i.e. using aggregations instead of training on every
>> >> >> >>> >>> > raw
>> >> >> >>> >>> > record),
>> >> >> >>> >>> > but
>> >> >> >>> >>> > encountered by the following exception:
>> >> >> >>> >>> >
>> >> >> >>> >>> > Loss was due to java.lang.NullPointerException
>> >> >> >>> >>> > java.lang.NullPointerException
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> > scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> > org.apache.spark.scheduler.Task.run(Task.scala:51)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >>> >>> >         at
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >>> >>> >         at java.lang.Thread.run(Thread.java:745)
>> >> >> >>> >>> >
>> >> >> >>> >>> >
>> >> >> >>> >>> > I'd appreciate any insights/comments about what may be
>> >> >> >>> >>> > causing
>> >> >> >>> >>> > the
>> >> >> >>> >>> > execution
>> >> >> >>> >>> > to stall.
>> >> >> >>> >>> >
>> >> >> >>> >>> > If logs/tables appear poorly indented in the email, here's
>> >> >> >>> >>> > a
>> >> >> >>> >>> > gist
>> >> >> >>> >>> > with
>> >> >> >>> >>> > relevant details:
>> >> >> >>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
>> >> >> >>> >>> >
>> >> >> >>> >>> > Thanks,
>> >> >> >>> >>> > Bharath
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

Re: Execution stalls in LogisticRegressionWithSGD

Posted by Bharath Ravi Kumar <re...@gmail.com>.

Hi Xiangrui,

1) Yes, I used the same build (compiled locally from source) to the host
that has (master, slave1) and the second host with slave2.

2) The execution was successful when run in local mode with reduced number
of partitions. Does this imply issues communicating/coordinating across
processes (i.e. driver, master and workers)?

Thanks,
Bharath



On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Bharath,
>
> 1) Did you sync the spark jar and conf to the worker nodes after build?
> 2) Since the dataset is not large, could you try local mode first
> using `spark-summit --driver-memory 12g --master local[*]`?
> 3) Try to use less number of partitions, say 5.
>
> If the problem is still there, please attach the full master/worker log
> files.
>
> Best,
> Xiangrui
>
> On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar <re...@gmail.com>
> wrote:
> > Xiangrui,
> >
> > Leaving the frameSize unspecified led to an error message (and failure)
> > stating that the task size (~11M) was larger. I hence set it to an
> > arbitrarily large value ( I realize 500 was unrealistic & unnecessary in
> > this case). I've now set the size to 20M and repeated the runs. The
> earlier
> > runs were on an uncached RDD. Caching the RDD (and setting
> > spark.storage.memoryFraction=0.5) resulted in marginal speed up of
> > execution, but the end result remained the same. The cached RDD size is
> as
> > follows:
> >
> > RDD Name    Storage Level                        Cached Partitions
> > Fraction Cached    Size in Memory    Size in Tachyon        Size on Disk
> > 1084         Memory Deserialized 1x Replicated     80
> > 100%         165.9 MB             0.0 B                 0.0 B
> >
> >
> >
> > The corresponding master logs were:
> >
> > 14/07/04 06:29:34 INFO Master: Removing executor
> app-20140704062238-0033/1
> > because it is EXITED
> > 14/07/04 06:29:34 INFO Master: Launching executor
> app-20140704062238-0033/2
> > on worker worker-20140630124441-slave1-40182
> > 14/07/04 06:29:34 INFO Master: Removing executor
> app-20140704062238-0033/0
> > because it is EXITED
> > 14/07/04 06:29:34 INFO Master: Launching executor
> app-20140704062238-0033/3
> > on worker worker-20140630102913-slave2-44735
> > 14/07/04 06:29:37 INFO Master: Removing executor
> app-20140704062238-0033/2
> > because it is EXITED
> > 14/07/04 06:29:37 INFO Master: Launching executor
> app-20140704062238-0033/4
> > on worker worker-20140630124441-slave1-40182
> > 14/07/04 06:29:37 INFO Master: Removing executor
> app-20140704062238-0033/3
> > because it is EXITED
> > 14/07/04 06:29:37 INFO Master: Launching executor
> app-20140704062238-0033/5
> > on worker worker-20140630102913-slave2-44735
> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> > disassociated, removing it.
> > 14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
> > 14/07/04 06:29:39 INFO LocalActorRef: Message
> > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
> > Actor[akka://sparkMaster/deadLetters] to
> >
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
> > was not delivered. [39] dead letters encountered. This logging can be
> turned
> > off or adjusted with configuration settings 'akka.log-dead-letters' and
> > 'akka.log-dead-letters-during-shutdown'.
> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> > disassociated, removing it.
> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> > disassociated, removing it.
> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> > akka.remote.EndpointAssociationException: Association failed with
> > [akka.tcp://spark@slave2:45172]
> > Caused by:
> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> > Connection refused: slave2/10.3.1.135:45172
> > ]
> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> > disassociated, removing it.
> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> > akka.remote.EndpointAssociationException: Association failed with
> > [akka.tcp://spark@slave2:45172]
> > Caused by:
> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> > Connection refused: slave2/10.3.1.135:45172
> > ]
> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> > [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> > akka.remote.EndpointAssociationException: Association failed with
> > [akka.tcp://spark@slave2:45172]
> > Caused by:
> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> > Connection refused: slave2/10.3.1.135:45172
> > ]
> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> > disassociated, removing it.
> > 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
> > app-20140704062238-0033/5
> > 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
> > app-20140704062238-0033/4
> >
> >
> > Coincidentally, after the initial executor failed, each following
> executor
> > that was re-spawned failed with the following logs:
> > (e.g the following was from
> > slave1:~/spark-1.0.1-rc1/work/app-20140704062238-0033/2/stderr)
> >
> > log4j:WARN No appenders could be found for logger
> > (org.apache.hadoop.conf.Configuration).
> > log4j:WARN Please initialize the log4j system properly.
> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> > more info.
> > 14/07/04 06:29:35 INFO SparkHadoopUtil: Using Spark's default log4j
> profile:
> > org/apache/spark/log4j-defaults.properties
> > 14/07/04 06:29:35 INFO SecurityManager: Changing view acls to: user1
> > 14/07/04 06:29:35 INFO SecurityManager: SecurityManager: authentication
> > disabled; ui acls disabled; users with view permissions: Set(user1)
> > 14/07/04 06:29:35 INFO Slf4jLogger: Slf4jLogger started
> > 14/07/04 06:29:35 INFO Remoting: Starting remoting
> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on addresses
> > :[akka.tcp://sparkExecutor@slave1:54782]
> > 14/07/04 06:29:36 INFO Remoting: Remoting now listens on addresses:
> > [akka.tcp://sparkExecutor@slave1:54782]
> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Connecting to
> driver:
> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
> > 14/07/04 06:29:36 INFO WorkerWatcher: Connecting to worker
> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> > 14/07/04 06:29:36 INFO WorkerWatcher: Successfully connected to
> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Successfully
> registered
> > with driver
> > 14/07/04 06:29:36 INFO SecurityManager: Changing view acls to: user1
> > 14/07/04 06:29:36 INFO SecurityManager: SecurityManager: authentication
> > disabled; ui acls disabled; users with view permissions: Set(user1)
> > 14/07/04 06:29:36 INFO Slf4jLogger: Slf4jLogger started
> > 14/07/04 06:29:36 INFO Remoting: Starting remoting
> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on addresses
> > :[akka.tcp://spark@slave1:39753]
> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to MapOutputTracker:
> > akka.tcp://spark@master:45172/user/MapOutputTracker
> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to BlockManagerMaster:
> > akka.tcp://spark@master:45172/user/BlockManagerMaster
> > 14/07/04 06:29:36 INFO DiskBlockManager: Created local directory at
> > /tmp/spark-local-20140704062936-6123
> > 14/07/04 06:29:36 INFO MemoryStore: MemoryStore started with capacity 6.7
> > GB.
> > 14/07/04 06:29:36 INFO ConnectionManager: Bound socket to port 50960
> with id
> > = ConnectionManagerId(slave1,50960)
> > 14/07/04 06:29:36 INFO BlockManagerMaster: Trying to register
> BlockManager
> > 14/07/04 06:29:36 INFO BlockManagerMaster: Registered BlockManager
> > 14/07/04 06:29:36 INFO HttpFileServer: HTTP File server directory is
> > /tmp/spark-42c2782f-60f8-45a7-9e11-c789fc87fe2e
> > 14/07/04 06:29:36 INFO HttpServer: Starting HTTP Server
> > 14/07/04 06:29:36 ERROR CoarseGrainedExecutorBackend: Driver
> Disassociated
> > [akka.tcp://sparkExecutor@slave1:54782] -> [akka.tcp://spark@master
> :45172]
> > disassociated! Shutting down.
> >
> > In case of the initial executor that successfully started, the
> corresponding
> > log messages (from
> spark-1.0.1-rc1/work/app-20140704062238-0033/1/stderr) on
> > the executor were:
> > log4j:WARN No appenders could be found for logger
> > (org.apache.hadoop.conf.Configuration).
> > log4j:WARN Please initialize the log4j system properly.
> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> > more info.
> > 14/07/04 06:22:39 INFO SparkHadoopUtil: Using Spark's default log4j
> profile:
> > org/apache/spark/log4j-defaults.properties
> > 14/07/04 06:22:39 INFO SecurityManager: Changing view acls to: user1
> > 14/07/04 06:22:39 INFO SecurityManager: SecurityManager: authentication
> > disabled; ui acls disabled; users with view permissions: Set(user1)
> > 14/07/04 06:22:39 INFO Slf4jLogger: Slf4jLogger started
> > 14/07/04 06:22:39 INFO Remoting: Starting remoting
> > 14/07/04 06:22:39 INFO Remoting: Remoting started; listening on addresses
> > :[akka.tcp://sparkExecutor@slave1:50806]
> > 14/07/04 06:22:39 INFO Remoting: Remoting now listens on addresses:
> > [akka.tcp://sparkExecutor@slave1:50806]
> > 14/07/04 06:22:39 INFO CoarseGrainedExecutorBackend: Connecting to
> driver:
> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
> > 14/07/04 06:22:39 INFO WorkerWatcher: Connecting to worker
> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> > 14/07/04 06:22:39 INFO WorkerWatcher: Successfully connected to
> > akka.tcp://sparkWorker@slave1:40182/user/Worker
> > 14/07/04 06:22:40 INFO CoarseGrainedExecutorBackend: Successfully
> registered
> > with driver
> > 14/07/04 06:22:40 INFO SecurityManager: Changing view acls to: user1
> > 14/07/04 06:22:40 INFO SecurityManager: SecurityManager: authentication
> > disabled; ui acls disabled; users with view permissions: Set(user1)
> > 14/07/04 06:22:40 INFO Slf4jLogger: Slf4jLogger started
> > 14/07/04 06:22:40 INFO Remoting: Starting remoting
> > 14/07/04 06:22:40 INFO Remoting: Remoting started; listening on addresses
> > :[akka.tcp://spark@slave1:38558]
> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to MapOutputTracker:
> > akka.tcp://spark@master:45172/user/MapOutputTracker
> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to BlockManagerMaster:
> > akka.tcp://spark@master:45172/user/BlockManagerMaster
> > 14/07/04 06:22:40 INFO DiskBlockManager: Created local directory at
> > /tmp/spark-local-20140704062240-6a65
> > 14/07/04 06:22:40 INFO MemoryStore: MemoryStore started with capacity 6.7
> > GB.
> > 14/07/04 06:22:40 INFO ConnectionManager: Bound socket to port 46901
> with id
> > = ConnectionManagerId(slave1,46901)
> > 14/07/04 06:22:40 INFO BlockManagerMaster: Trying to register
> BlockManager
> > 14/07/04 06:22:40 INFO BlockManagerMaster: Registered BlockManager
> > 14/07/04 06:22:40 INFO HttpFileServer: HTTP File server directory is
> > /tmp/spark-9eba78f9-8ae9-477c-9338-7222ae6fe306
> > 14/07/04 06:22:40 INFO HttpServer: Starting HTTP Server
> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> > 14/07/04 06:22:42 INFO Executor: Running task ID 0
> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned task 2
> > 14/07/04 06:22:42 INFO Executor: Running task ID 2
> > ...
> >
> >
> >
> > On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng <me...@gmail.com> wrote:
> >>
> >> The feature dimension is small. You don't need a big akka.frameSize.
> >> The default one (10M) should be sufficient. Did you cache the data
> >> before calling LRWithSGD? -Xiangrui
> >>
> >> On Thu, Jul 3, 2014 at 10:02 AM, Bharath Ravi Kumar <
> reachbach@gmail.com>
> >> wrote:
> >> > I tried another run after setting the driver memory to 8G (and
> >> > spark.akka.frameSize = 500 on the executors and the driver). In
> >> > addition, I
> >> > also tried to reduce the amount of data that a single task processes,
> by
> >> > increasing the number of partitions (of the labeled points) to 120
> >> > (instead
> >> > of 2 used earlier), and then setting max cores to 2. That made no
> >> > difference
> >> > since, at the end of 120 tasks, the familiar error message appeared
> on a
> >> > slave:
> >> >
> >> > <snipped earlier logs>
> >> > 14/07/03 16:18:48 INFO CoarseGrainedExecutorBackend: Got assigned task
> >> > 1436
> >> > 14/07/03 16:18:48 INFO Executor: Running task ID 1436
> >> > 14/07/03 16:18:53 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00014:0+2215337
> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00014:2215337+2215338
> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00003:0+2196429
> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00003:2196429+2196430
> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00010:0+2186751
> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00010:2186751+2186751
> >> > 14/07/03 16:18:54 INFO Executor: Serialized size of result for 1436 is
> >> > 5958822
> >> > 14/07/03 16:18:54 INFO Executor: Sending result for 1436 directly to
> >> > driver
> >> > 14/07/03 16:18:54 INFO Executor: Finished task ID 1436
> >> > 14/07/03 16:18:54 INFO CoarseGrainedExecutorBackend: Got assigned task
> >> > 1438
> >> > 14/07/03 16:18:54 INFO Executor: Running task ID 1438
> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00004:0+2209615
> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00004:2209615+2209616
> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00011:0+2202240
> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00011:2202240+2202240
> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00009:0+2194423
> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> >> > file:~//2014-05-24-02/part-r-00009:2194423+2194424
> >> > 14/07/03 16:19:00 INFO Executor: Serialized size of result for 1438 is
> >> > 5958822
> >> > 14/07/03 16:19:00 INFO Executor: Sending result for 1438 directly to
> >> > driver
> >> > 14/07/03 16:19:00 INFO Executor: Finished task ID 1438
> >> > 14/07/03 16:19:14 ERROR CoarseGrainedExecutorBackend: Driver
> >> > Disassociated
> >> > [akka.tcp://sparkExecutor@slave1:51099] ->
> >> > [akka.tcp://spark@master:58272]
> >> > disassociated! Shutting down.
> >> >
> >> >
> >> > The corresponding master logs were:
> >> >
> >> > 4/07/03 16:02:14 INFO Master: Registering app LogRegExp
> >> > 14/07/03 16:02:14 INFO Master: Registered app LogRegExp with ID
> >> > app-20140703160214-0028
> >> > 14/07/03 16:02:14 INFO Master: Launching executor
> >> > app-20140703160214-0028/1
> >> > on worker worker-20140630124441-slave1-40182
> >> > 14/07/03 16:19:15 INFO Master: Removing executor
> >> > app-20140703160214-0028/1
> >> > because it is EXITED
> >> > 14/07/03 16:19:15 INFO Master: Launching executor
> >> > app-20140703160214-0028/2
> >> > on worker worker-20140630124441-slave1-40182
> >> > 14/07/03 16:19:15 INFO Master: Removing executor
> >> > app-20140703160214-0028/0
> >> > because it is EXITED
> >> > 14/07/03 16:19:15 INFO Master: Launching executor
> >> > app-20140703160214-0028/3
> >> > on worker worker-20140630102913-slave2-44735
> >> > 14/07/03 16:19:18 INFO Master: Removing executor
> >> > app-20140703160214-0028/2
> >> > because it is EXITED
> >> > 14/07/03 16:19:18 INFO Master: Launching executor
> >> > app-20140703160214-0028/4
> >> > on worker worker-20140630124441-slave1-40182
> >> > 14/07/03 16:19:18 INFO Master: Removing executor
> >> > app-20140703160214-0028/3
> >> > because it is EXITED
> >> > 14/07/03 16:19:18 INFO Master: Launching executor
> >> > app-20140703160214-0028/5
> >> > on worker worker-20140630102913-slave2-44735
> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
> >> > disassociated, removing it.
> >> > 14/07/03 16:19:20 INFO Master: Removing app app-20140703160214-0028
> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
> >> > disassociated, removing it.
> >> >
> >> >
> >> > Throughout the execution, I confirmed in the UI that driver memory
> used
> >> > was
> >> > 0.0 B / 6.9 GB and each executor's memory showed 0.0 B / 12.1 GB even
> >> > when
> >> > aggregate was being executed. On a related note, I noticed in the
> >> > executors
> >> > tab that just before the entire job terminated, executors on slave1,
> >> > slave2
> >> > and the driver "disappeared" momentarily from the active executors
> list.
> >> > The
> >> > replacement  executors on slave1 and slave2 were re-spawned a couple
> of
> >> > times and appeared on the executors list again before they too died
> and
> >> > the
> >> > job failed.
> >> > So it appears that no matter what the task input-result size, the
> >> > execution
> >> > fails at the end of the stage corresponding to
> GradientDescent.aggregate
> >> > (and the preceding count() in GradientDescent goes through fine). Let
> me
> >> > know if you need any additional information.
> >> >
> >> >
> >> > On Thu, Jul 3, 2014 at 12:27 PM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >> >>
> >> >> Could you check the driver memory in the executor tab of the Spark UI
> >> >> when the job is running? If it is too small, please set
> >> >> --driver-memory with spark-submit, e.g. 10g. Could you also attach
> the
> >> >> master log under spark/logs as well? -Xiangrui
> >> >>
> >> >> On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar
> >> >> <re...@gmail.com>
> >> >> wrote:
> >> >> > Hi Xiangrui,
> >> >> >
> >> >> > The issue with aggergating/counting over large feature vectors (as
> >> >> > part
> >> >> > of
> >> >> > LogisticRegressionWithSGD) continues to exist, but now in another
> >> >> > form:
> >> >> > while the execution doesn't freeze (due to SPARK-1112), it now
> fails
> >> >> > at
> >> >> > the
> >> >> > second or third gradient descent iteration consistently with an
> error
> >> >> > level
> >> >> > log message, but no stacktrace. I'm running against 1.0.1-rc1, and
> >> >> > have
> >> >> > tried setting spark.akka.frameSize as high as 500. When the
> execution
> >> >> > fails,
> >> >> > each of the two executors log the following message (corresponding
> to
> >> >> > aggregate at GradientDescent.scala:178) :
> >> >> >
> >> >> > 14/07/02 14:09:09 INFO
> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
> >> >> > maxBytesInFlight: 50331648, targetRequestSize: 10066329
> >> >> > 14/07/02 14:09:09 INFO
> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
> >> >> > Getting 2 non-empty blocks out of 2 blocks
> >> >> > 14/07/02 14:09:09 INFO
> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
> >> >> > Started 1 remote fetches in 0 ms
> >> >> > 14/07/02 14:09:11 INFO Executor: Serialized size of result for 737
> is
> >> >> > 5959086
> >> >> > 14/07/02 14:09:11 INFO Executor: Sending result for 737 directly to
> >> >> > driver
> >> >> > 14/07/02 14:09:11 INFO Executor: Finished task ID 737
> >> >> > 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver
> >> >> > Disassociated
> >> >> > [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
> >> >> > [akka.tcp://spark@master:59487] disassociated! Shutting down.
> >> >> >
> >> >> >
> >> >> > There is no separate stacktrace on the driver side.
> >> >> >
> >> >> > Each input record is of the form p1, p2, (p1,p2) where p1, p2 &
> >> >> > (p1,p2)
> >> >> > are
> >> >> > categorical features with large cardinality, and X is the double
> >> >> > label
> >> >> > with
> >> >> > a continuous value. The categorical variables are converted to
> binary
> >> >> > variables which results in a feature vector of size 741092
> (composed
> >> >> > of
> >> >> > all
> >> >> > unique categories across p1, p2 and (p1,p2)). Thus, the labeled
> point
> >> >> > for
> >> >> > input record is a sparse vector of size 741092 with only 3
> variables
> >> >> > set
> >> >> > in
> >> >> > the record. The total number of records is 683233 after aggregating
> >> >> > the
> >> >> > input data on (p1, p2). When attempting to train on the
> unaggregated
> >> >> > records
> >> >> > (1337907 in number spread across 455 files), the execution fails at
> >> >> > count,
> >> >> > GradientDescent.scala:161 with the following log
> >> >> >
> >> >> >
> >> >> > (Snipped lines corresponding to other input files)
> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> >> > file:~/part-r-00012:2834590+2834590
> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> >> > file:~/part-r-00005:0+2845559
> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> >> > file:~/part-r-00005:2845559+2845560
> >> >> > 14/07/02 16:02:03 INFO Executor: Serialized size of result for 726
> is
> >> >> > 615
> >> >> > 14/07/02 16:02:03 INFO Executor: Sending result for 726 directly to
> >> >> > driver
> >> >> > 14/07/02 16:02:03 INFO Executor: Finished task ID 726
> >> >> > 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver
> >> >> > Disassociated
> >> >> > [akka.tcp://sparkExecutor@slave1:48423] ->
> >> >> > [akka.tcp://spark@master:55792]
> >> >> > disassociated! Shutting down.
> >> >> >
> >> >> > A count() attempted on the input RDD before beginning training has
> >> >> > the
> >> >> > following metrics:
> >> >> >
> >> >> >
> >> >> > Metric            Min        25th    Median    75th     Max
> >> >> >
> >> >> > Result
> >> >> > serialization
> >> >> > time            0 ms    0 ms    0 ms    0 ms    0 ms
> >> >> >
> >> >> > Duration        33 s    33 s    35 s    35 s    35 s
> >> >> >
> >> >> > Time spent
> >> >> > fetching task
> >> >> > results            0 ms    0 ms    0 ms    0 ms    0 ms
> >> >> >
> >> >> > Scheduler
> >> >> > delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
> >> >> >
> >> >> > Aggregated Metrics by Executor
> >> >> >
> >> >> > ID     Address Task             Time Total Failed Succeeded Shuffle
> >> >> > Read
> >> >> > Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
> >> >> > 0     CANNOT FIND ADDRESS     34 s     1     0         1
> 0.0
> >> >> > B
> >> >> > 0.0 B             0.0 B                 0.0 B
> >> >> > 1     CANNOT FIND ADDRESS     36 s     1     0         1
> 0.0
> >> >> > B
> >> >> > 0.0 B             0.0 B                 0.0 B
> >> >> >
> >> >> > Tasks
> >> >> >
> >> >> > Task Index    Task ID    Status    Locality Level    Executor
> >> >> > Launch
> >> >> > Time
> >> >> > Duration    GC Time    Result Ser Time    Errors
> >> >> > 0     726     SUCCESS         PROCESS_LOCAL     slave1
> >> >> > 2014/07/02
> >> >> > 16:01:28 35 s         0.1 s
> >> >> > 1     727     SUCCESS         PROCESS_LOCAL     slave2
> >> >> > 2014/07/02
> >> >> > 16:01:28 33 s         99 ms
> >> >> >
> >> >> > Any pointers / diagnosis please?
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar
> >> >> > <re...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Thanks. I'll await the fix to re-run my test.
> >> >> >>
> >> >> >>
> >> >> >> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng <me...@gmail.com>
> >> >> >> wrote:
> >> >> >>>
> >> >> >>> Hi Bharath,
> >> >> >>>
> >> >> >>> This is related to SPARK-1112, which we already found the root
> >> >> >>> cause.
> >> >> >>> I will let you know when this is fixed.
> >> >> >>>
> >> >> >>> Best,
> >> >> >>> Xiangrui
> >> >> >>>
> >> >> >>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar
> >> >> >>> <re...@gmail.com>
> >> >> >>> wrote:
> >> >> >>> > Couple more points:
> >> >> >>> > 1)The inexplicable stalling of execution with large feature
> sets
> >> >> >>> > appears
> >> >> >>> > similar to that reported with the news-20 dataset:
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
> >> >> >>> >
> >> >> >>> > 2) The NPE trying to call mapToPair convert an RDD<Long, Long,
> >> >> >>> > Integer,
> >> >> >>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>,
> >> >> >>> > Tuple2<Integer,Integer>>
> >> >> >>> > is
> >> >> >>> > unrelated to mllib.
> >> >> >>> >
> >> >> >>> > Thanks,
> >> >> >>> > Bharath
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
> >> >> >>> > <re...@gmail.com>
> >> >> >>> > wrote:
> >> >> >>> >>
> >> >> >>> >> Hi  Xiangrui ,
> >> >> >>> >>
> >> >> >>> >> I'm using 1.0.0.
> >> >> >>> >>
> >> >> >>> >> Thanks,
> >> >> >>> >> Bharath
> >> >> >>> >>
> >> >> >>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <me...@gmail.com>
> >> >> >>> >> wrote:
> >> >> >>> >>>
> >> >> >>> >>> Hi Bharath,
> >> >> >>> >>>
> >> >> >>> >>> Thanks for posting the details! Which Spark version are you
> >> >> >>> >>> using?
> >> >> >>> >>>
> >> >> >>> >>> Best,
> >> >> >>> >>> Xiangrui
> >> >> >>> >>>
> >> >> >>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
> >> >> >>> >>> <re...@gmail.com>
> >> >> >>> >>> wrote:
> >> >> >>> >>> > Hi,
> >> >> >>> >>> >
> >> >> >>> >>> > (Apologies for the long mail, but it's necessary to provide
> >> >> >>> >>> > sufficient
> >> >> >>> >>> > details considering the number of issues faced.)
> >> >> >>> >>> >
> >> >> >>> >>> > I'm running into issues testing LogisticRegressionWithSGD a
> >> >> >>> >>> > two
> >> >> >>> >>> > node
> >> >> >>> >>> > cluster
> >> >> >>> >>> > (each node with 24 cores and 16G available to slaves out of
> >> >> >>> >>> > 24G
> >> >> >>> >>> > on
> >> >> >>> >>> > the
> >> >> >>> >>> > system). Here's a description of the application:
> >> >> >>> >>> >
> >> >> >>> >>> > The model is being trained based on categorical features x,
> >> >> >>> >>> > y,
> >> >> >>> >>> > and
> >> >> >>> >>> > (x,y).
> >> >> >>> >>> > The categorical features are mapped to binary features by
> >> >> >>> >>> > converting
> >> >> >>> >>> > each
> >> >> >>> >>> > distinct value in the category enum into a binary feature
> by
> >> >> >>> >>> > itself
> >> >> >>> >>> > (i.e
> >> >> >>> >>> > presence of that value in a record implies corresponding
> >> >> >>> >>> > feature
> >> >> >>> >>> > =
> >> >> >>> >>> > 1,
> >> >> >>> >>> > else
> >> >> >>> >>> > feature = 0. So, there'd be as many distinct features as
> enum
> >> >> >>> >>> > values) .
> >> >> >>> >>> > The
> >> >> >>> >>> > training vector is laid out as
> >> >> >>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)]. Each
> >> >> >>> >>> > record
> >> >> >>> >>> > in
> >> >> >>> >>> > the
> >> >> >>> >>> > training data has only one combination (Xk,Yk) and a label
> >> >> >>> >>> > appearing in
> >> >> >>> >>> > the
> >> >> >>> >>> > record. Thus, the corresponding labeledpoint sparse vector
> >> >> >>> >>> > would
> >> >> >>> >>> > only
> >> >> >>> >>> > have 3
> >> >> >>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total length
> of
> >> >> >>> >>> > the
> >> >> >>> >>> > vector
> >> >> >>> >>> > (though parse) would be nearly 614000.  The number of
> records
> >> >> >>> >>> > is
> >> >> >>> >>> > about
> >> >> >>> >>> > 1.33
> >> >> >>> >>> > million. The records have been coalesced into 20 partitions
> >> >> >>> >>> > across
> >> >> >>> >>> > two
> >> >> >>> >>> > nodes. The input data has not been cached.
> >> >> >>> >>> > (NOTE: I do realize the records & features may seem large
> for
> >> >> >>> >>> > a
> >> >> >>> >>> > two
> >> >> >>> >>> > node
> >> >> >>> >>> > setup, but given the memory & cpu, and the fact that I'm
> >> >> >>> >>> > willing
> >> >> >>> >>> > to
> >> >> >>> >>> > give up
> >> >> >>> >>> > some turnaround time, I don't see why tasks should
> >> >> >>> >>> > inexplicably
> >> >> >>> >>> > fail)
> >> >> >>> >>> >
> >> >> >>> >>> > Additional parameters include:
> >> >> >>> >>> >
> >> >> >>> >>> > spark.executor.memory = 14G
> >> >> >>> >>> > spark.default.parallelism = 1
> >> >> >>> >>> > spark.cores.max=20
> >> >> >>> >>> > spark.storage.memoryFraction=0.8 //No cache space required
> >> >> >>> >>> > (Trying to set spark.akka.frameSize to a larger number,
> say,
> >> >> >>> >>> > 20
> >> >> >>> >>> > didn't
> >> >> >>> >>> > help
> >> >> >>> >>> > either)
> >> >> >>> >>> >
> >> >> >>> >>> > The model training was initialized as : new
> >> >> >>> >>> > LogisticRegressionWithSGD(1,
> >> >> >>> >>> > maxIterations, 0.0, 0.05)
> >> >> >>> >>> >
> >> >> >>> >>> > However, after 4 iterations of gradient descent, the entire
> >> >> >>> >>> > execution
> >> >> >>> >>> > appeared to stall inexplicably. The corresponding executor
> >> >> >>> >>> > details
> >> >> >>> >>> > and
> >> >> >>> >>> > details of the stalled stage (number 14) are as follows:
> >> >> >>> >>> >
> >> >> >>> >>> > Metric                        Min        25th     Median
> >> >> >>> >>> > 75th
> >> >> >>> >>> > Max
> >> >> >>> >>> > Result serialization time    12 ms    13 ms    14 ms    16
> ms
> >> >> >>> >>> > 18
> >> >> >>> >>> > ms
> >> >> >>> >>> > Duration                    4 s        4 s        5 s
> >> >> >>> >>> > 5 s
> >> >> >>> >>> > 5 s
> >> >> >>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0 ms
> >> >> >>> >>> > 0
> >> >> >>> >>> > ms
> >> >> >>> >>> > results
> >> >> >>> >>> > Scheduler delay                6 s        6 s        6 s
> >> >> >>> >>> > 6 s
> >> >> >>> >>> > 12 s
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> > Stage Id
> >> >> >>> >>> > 14 aggregate at GradientDescent.scala:178
> >> >> >>> >>> >
> >> >> >>> >>> > Task Index    Task ID    Status    Locality Level
> >> >> >>> >>> > Executor
> >> >> >>> >>> > Launch Time                Duration    GC     Result Ser
> Time
> >> >> >>> >>> > Errors
> >> >> >>> >>> >
> >> >> >>> >>> > Time
> >> >> >>> >>> >
> >> >> >>> >>> > 0     600     RUNNING     PROCESS_LOCAL
> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >> >>> >>> > 1     601     RUNNING     PROCESS_LOCAL
> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
> >> >> >>> >>> > 2     602     RUNNING     PROCESS_LOCAL
> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >> >>> >>> > 3     603     RUNNING     PROCESS_LOCAL
> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
> >> >> >>> >>> > 4     604     RUNNING     PROCESS_LOCAL
> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >> >>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
> >> >> >>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
> >> >> >>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
> >> >> >>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >> >>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
> >> >> >>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >> >>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
> >> >> >>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
> >> >> >>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
> >> >> >>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
> >> >> >>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
> >> >> >>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >> >>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
> >> >> >>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > serious.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
> >> >> >>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
> >> >> >>> >>> > casual.dataone.foo.bar.com
> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
> >> >> >>> >>> >
> >> >> >>> >>> > Executor stats:
> >> >> >>> >>> >
> >> >> >>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks
> >> >> >>> >>> > Failed
> >> >> >>> >>> > Tasks
> >> >> >>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle Read
> >> >> >>> >>> > Shuffle
> >> >> >>> >>> > Write
> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2
> >> >> >>> >>> > 0
> >> >> >>> >>> > 307         309         23.2 m         0.0 B
> 0.0
> >> >> >>> >>> > B
> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3
> >> >> >>> >>> > 0
> >> >> >>> >>> > 308         311         22.4 m         0.0 B
> 0.0
> >> >> >>> >>> > B
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> > Executor jmap output:
> >> >> >>> >>> >
> >> >> >>> >>> > Server compiler detected.
> >> >> >>> >>> > JVM version is 24.55-b03
> >> >> >>> >>> >
> >> >> >>> >>> > using thread-local object allocation.
> >> >> >>> >>> > Parallel GC with 18 thread(s)
> >> >> >>> >>> >
> >> >> >>> >>> > Heap Configuration:
> >> >> >>> >>> >    MinHeapFreeRatio = 40
> >> >> >>> >>> >    MaxHeapFreeRatio = 70
> >> >> >>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
> >> >> >>> >>> >    NewSize          = 1310720 (1.25MB)
> >> >> >>> >>> >    MaxNewSize       = 17592186044415 MB
> >> >> >>> >>> >    OldSize          = 5439488 (5.1875MB)
> >> >> >>> >>> >    NewRatio         = 2
> >> >> >>> >>> >    SurvivorRatio    = 8
> >> >> >>> >>> >    PermSize         = 21757952 (20.75MB)
> >> >> >>> >>> >    MaxPermSize      = 134217728 (128.0MB)
> >> >> >>> >>> >    G1HeapRegionSize = 0 (0.0MB)
> >> >> >>> >>> >
> >> >> >>> >>> > Heap Usage:
> >> >> >>> >>> > PS Young Generation
> >> >> >>> >>> > Eden Space:
> >> >> >>> >>> >    capacity = 2783969280 (2655.0MB)
> >> >> >>> >>> >    used     = 192583816 (183.66223907470703MB)
> >> >> >>> >>> >    free     = 2591385464 (2471.337760925293MB)
> >> >> >>> >>> >    6.917598458557704% used
> >> >> >>> >>> > From Space:
> >> >> >>> >>> >    capacity = 409993216 (391.0MB)
> >> >> >>> >>> >    used     = 1179808 (1.125152587890625MB)
> >> >> >>> >>> >    free     = 408813408 (389.8748474121094MB)
> >> >> >>> >>> >    0.2877628102022059% used
> >> >> >>> >>> > To Space:
> >> >> >>> >>> >    capacity = 385351680 (367.5MB)
> >> >> >>> >>> >    used     = 0 (0.0MB)
> >> >> >>> >>> >    free     = 385351680 (367.5MB)
> >> >> >>> >>> >    0.0% used
> >> >> >>> >>> > PS Old Generation
> >> >> >>> >>> >    capacity = 7158628352 (6827.0MB)
> >> >> >>> >>> >    used     = 4455093024 (4248.707794189453MB)
> >> >> >>> >>> >    free     = 2703535328 (2578.292205810547MB)
> >> >> >>> >>> >    62.2338918146983% used
> >> >> >>> >>> > PS Perm Generation
> >> >> >>> >>> >    capacity = 90701824 (86.5MB)
> >> >> >>> >>> >    used     = 45348832 (43.248016357421875MB)
> >> >> >>> >>> >    free     = 45352992 (43.251983642578125MB)
> >> >> >>> >>> >    49.99770677158598% used
> >> >> >>> >>> >
> >> >> >>> >>> > 8432 interned Strings occupying 714672 bytes.
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> > Executor GC log snippet:
> >> >> >>> >>> >
> >> >> >>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
> >> >> >>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times:
> >> >> >>> >>> > user=5.13
> >> >> >>> >>> > sys=0.39,
> >> >> >>> >>> > real=0.32 secs]
> >> >> >>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)]
> >> >> >>> >>> > [ParOldGen:
> >> >> >>> >>> > 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K)
> >> >> >>> >>> > [PSPermGen:
> >> >> >>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times: user=24.22
> >> >> >>> >>> > sys=0.18,
> >> >> >>> >>> > real=4.55 secs]
> >> >> >>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
> >> >> >>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times:
> >> >> >>> >>> > user=1.66
> >> >> >>> >>> > sys=0.00,
> >> >> >>> >>> > real=0.09 secs]
> >> >> >>> >>> >
> >> >> >>> >>> > I tried to map partitions to cores on the nodes. Increasing
> >> >> >>> >>> > the
> >> >> >>> >>> > number
> >> >> >>> >>> > of
> >> >> >>> >>> > partitions (say to 80 or 100) would result in progress till
> >> >> >>> >>> > the
> >> >> >>> >>> > 6th
> >> >> >>> >>> > iteration or so, but the next stage would stall as before
> >> >> >>> >>> > with
> >> >> >>> >>> > apparent
> >> >> >>> >>> > root
> >> >> >>> >>> > cause / logs. With increased partitions, the last stage
> that
> >> >> >>> >>> > completed
> >> >> >>> >>> > had
> >> >> >>> >>> > the following task times:
> >> >> >>> >>> >
> >> >> >>> >>> > Metric                        Min        25th     Median
> >> >> >>> >>> > 75th
> >> >> >>> >>> > Max
> >> >> >>> >>> > Result serialization time    11 ms    12 ms    13 ms    15
> ms
> >> >> >>> >>> > 0.4 s
> >> >> >>> >>> > Duration                    0.5 s    0.9 s    1 s        3
> s
> >> >> >>> >>> > 7 s
> >> >> >>> >>> > Time spent fetching            0 ms    0 ms    0 ms    0 ms
> >> >> >>> >>> > 0
> >> >> >>> >>> > ms
> >> >> >>> >>> > task results
> >> >> >>> >>> > Scheduler delay                5 s        6 s        6 s
> >> >> >>> >>> > 7 s
> >> >> >>> >>> > 12 s
> >> >> >>> >>> >
> >> >> >>> >>> > My hypothesis is that as the coefficient array becomes less
> >> >> >>> >>> > sparse
> >> >> >>> >>> > (with
> >> >> >>> >>> > successive iterations), the cost of the aggregate goes up
> to
> >> >> >>> >>> > the
> >> >> >>> >>> > point
> >> >> >>> >>> > that
> >> >> >>> >>> > it stalls (which I failed to explain). Reducing the batch
> >> >> >>> >>> > fraction
> >> >> >>> >>> > to a
> >> >> >>> >>> > very
> >> >> >>> >>> > low number like 0.01 saw the iterations progress further,
> but
> >> >> >>> >>> > the
> >> >> >>> >>> > model
> >> >> >>> >>> > failed to converge in that case after a small number of
> >> >> >>> >>> > iterations.
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> > I also tried reducing the number of records by aggregating
> on
> >> >> >>> >>> > (x,y)
> >> >> >>> >>> > as
> >> >> >>> >>> > the
> >> >> >>> >>> > key (i.e. using aggregations instead of training on every
> raw
> >> >> >>> >>> > record),
> >> >> >>> >>> > but
> >> >> >>> >>> > encountered by the following exception:
> >> >> >>> >>> >
> >> >> >>> >>> > Loss was due to java.lang.NullPointerException
> >> >> >>> >>> > java.lang.NullPointerException
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> >> >> >>> >>> >         at
> >> >> >>> >>> > scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> >> >> >>> >>> >         at
> >> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> >> >> >>> >>> >         at
> >> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >> >> >>> >>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> >> >> >>> >>> >         at
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >> >>> >>> >         at
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >> >>> >>> >         at java.lang.Thread.run(Thread.java:745)
> >> >> >>> >>> >
> >> >> >>> >>> >
> >> >> >>> >>> > I'd appreciate any insights/comments about what may be
> >> >> >>> >>> > causing
> >> >> >>> >>> > the
> >> >> >>> >>> > execution
> >> >> >>> >>> > to stall.
> >> >> >>> >>> >
> >> >> >>> >>> > If logs/tables appear poorly indented in the email, here's
> a
> >> >> >>> >>> > gist
> >> >> >>> >>> > with
> >> >> >>> >>> > relevant details:
> >> >> >>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
> >> >> >>> >>> >
> >> >> >>> >>> > Thanks,
> >> >> >>> >>> > Bharath
> >> >> >>> >
> >> >> >>> >
> >> >> >>
> >> >> >>
> >> >> >
> >> >
> >> >
> >
> >
>

Re: Execution stalls in LogisticRegressionWithSGD

Posted by Xiangrui Meng <me...@gmail.com>.

Hi Bharath,

1) Did you sync the spark jar and conf to the worker nodes after build?
2) Since the dataset is not large, could you try local mode first
using `spark-summit --driver-memory 12g --master local[*]`?
3) Try to use less number of partitions, say 5.

If the problem is still there, please attach the full master/worker log files.

Best,
Xiangrui

On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar <re...@gmail.com> wrote:
> Xiangrui,
>
> Leaving the frameSize unspecified led to an error message (and failure)
> stating that the task size (~11M) was larger. I hence set it to an
> arbitrarily large value ( I realize 500 was unrealistic & unnecessary in
> this case). I've now set the size to 20M and repeated the runs. The earlier
> runs were on an uncached RDD. Caching the RDD (and setting
> spark.storage.memoryFraction=0.5) resulted in marginal speed up of
> execution, but the end result remained the same. The cached RDD size is as
> follows:
>
> RDD Name    Storage Level                        Cached Partitions
> Fraction Cached    Size in Memory    Size in Tachyon        Size on Disk
> 1084         Memory Deserialized 1x Replicated     80
> 100%         165.9 MB             0.0 B                 0.0 B
>
>
>
> The corresponding master logs were:
>
> 14/07/04 06:29:34 INFO Master: Removing executor app-20140704062238-0033/1
> because it is EXITED
> 14/07/04 06:29:34 INFO Master: Launching executor app-20140704062238-0033/2
> on worker worker-20140630124441-slave1-40182
> 14/07/04 06:29:34 INFO Master: Removing executor app-20140704062238-0033/0
> because it is EXITED
> 14/07/04 06:29:34 INFO Master: Launching executor app-20140704062238-0033/3
> on worker worker-20140630102913-slave2-44735
> 14/07/04 06:29:37 INFO Master: Removing executor app-20140704062238-0033/2
> because it is EXITED
> 14/07/04 06:29:37 INFO Master: Launching executor app-20140704062238-0033/4
> on worker worker-20140630124441-slave1-40182
> 14/07/04 06:29:37 INFO Master: Removing executor app-20140704062238-0033/3
> because it is EXITED
> 14/07/04 06:29:37 INFO Master: Launching executor app-20140704062238-0033/5
> on worker worker-20140630102913-slave2-44735
> 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> disassociated, removing it.
> 14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
> 14/07/04 06:29:39 INFO LocalActorRef: Message
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
> Actor[akka://sparkMaster/deadLetters] to
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
> was not delivered. [39] dead letters encountered. This logging can be turned
> off or adjusted with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> disassociated, removing it.
> 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> disassociated, removing it.
> 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
> Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://spark@slave2:45172]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: slave2/10.3.1.135:45172
> ]
> 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> disassociated, removing it.
> 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
> Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://spark@slave2:45172]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: slave2/10.3.1.135:45172
> ]
> 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
> Error [Association failed with [akka.tcp://spark@slave2:45172]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://spark@slave2:45172]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: slave2/10.3.1.135:45172
> ]
> 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
> disassociated, removing it.
> 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
> app-20140704062238-0033/5
> 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
> app-20140704062238-0033/4
>
>
> Coincidentally, after the initial executor failed, each following executor
> that was re-spawned failed with the following logs:
> (e.g the following was from
> slave1:~/spark-1.0.1-rc1/work/app-20140704062238-0033/2/stderr)
>
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.conf.Configuration).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> 14/07/04 06:29:35 INFO SparkHadoopUtil: Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 14/07/04 06:29:35 INFO SecurityManager: Changing view acls to: user1
> 14/07/04 06:29:35 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(user1)
> 14/07/04 06:29:35 INFO Slf4jLogger: Slf4jLogger started
> 14/07/04 06:29:35 INFO Remoting: Starting remoting
> 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkExecutor@slave1:54782]
> 14/07/04 06:29:36 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://sparkExecutor@slave1:54782]
> 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Connecting to driver:
> akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
> 14/07/04 06:29:36 INFO WorkerWatcher: Connecting to worker
> akka.tcp://sparkWorker@slave1:40182/user/Worker
> 14/07/04 06:29:36 INFO WorkerWatcher: Successfully connected to
> akka.tcp://sparkWorker@slave1:40182/user/Worker
> 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Successfully registered
> with driver
> 14/07/04 06:29:36 INFO SecurityManager: Changing view acls to: user1
> 14/07/04 06:29:36 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(user1)
> 14/07/04 06:29:36 INFO Slf4jLogger: Slf4jLogger started
> 14/07/04 06:29:36 INFO Remoting: Starting remoting
> 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@slave1:39753]
> 14/07/04 06:29:36 INFO SparkEnv: Connecting to MapOutputTracker:
> akka.tcp://spark@master:45172/user/MapOutputTracker
> 14/07/04 06:29:36 INFO SparkEnv: Connecting to BlockManagerMaster:
> akka.tcp://spark@master:45172/user/BlockManagerMaster
> 14/07/04 06:29:36 INFO DiskBlockManager: Created local directory at
> /tmp/spark-local-20140704062936-6123
> 14/07/04 06:29:36 INFO MemoryStore: MemoryStore started with capacity 6.7
> GB.
> 14/07/04 06:29:36 INFO ConnectionManager: Bound socket to port 50960 with id
> = ConnectionManagerId(slave1,50960)
> 14/07/04 06:29:36 INFO BlockManagerMaster: Trying to register BlockManager
> 14/07/04 06:29:36 INFO BlockManagerMaster: Registered BlockManager
> 14/07/04 06:29:36 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-42c2782f-60f8-45a7-9e11-c789fc87fe2e
> 14/07/04 06:29:36 INFO HttpServer: Starting HTTP Server
> 14/07/04 06:29:36 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
> [akka.tcp://sparkExecutor@slave1:54782] -> [akka.tcp://spark@master:45172]
> disassociated! Shutting down.
>
> In case of the initial executor that successfully started, the corresponding
> log messages (from spark-1.0.1-rc1/work/app-20140704062238-0033/1/stderr) on
> the executor were:
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.conf.Configuration).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> 14/07/04 06:22:39 INFO SparkHadoopUtil: Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 14/07/04 06:22:39 INFO SecurityManager: Changing view acls to: user1
> 14/07/04 06:22:39 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(user1)
> 14/07/04 06:22:39 INFO Slf4jLogger: Slf4jLogger started
> 14/07/04 06:22:39 INFO Remoting: Starting remoting
> 14/07/04 06:22:39 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkExecutor@slave1:50806]
> 14/07/04 06:22:39 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://sparkExecutor@slave1:50806]
> 14/07/04 06:22:39 INFO CoarseGrainedExecutorBackend: Connecting to driver:
> akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
> 14/07/04 06:22:39 INFO WorkerWatcher: Connecting to worker
> akka.tcp://sparkWorker@slave1:40182/user/Worker
> 14/07/04 06:22:39 INFO WorkerWatcher: Successfully connected to
> akka.tcp://sparkWorker@slave1:40182/user/Worker
> 14/07/04 06:22:40 INFO CoarseGrainedExecutorBackend: Successfully registered
> with driver
> 14/07/04 06:22:40 INFO SecurityManager: Changing view acls to: user1
> 14/07/04 06:22:40 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(user1)
> 14/07/04 06:22:40 INFO Slf4jLogger: Slf4jLogger started
> 14/07/04 06:22:40 INFO Remoting: Starting remoting
> 14/07/04 06:22:40 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@slave1:38558]
> 14/07/04 06:22:40 INFO SparkEnv: Connecting to MapOutputTracker:
> akka.tcp://spark@master:45172/user/MapOutputTracker
> 14/07/04 06:22:40 INFO SparkEnv: Connecting to BlockManagerMaster:
> akka.tcp://spark@master:45172/user/BlockManagerMaster
> 14/07/04 06:22:40 INFO DiskBlockManager: Created local directory at
> /tmp/spark-local-20140704062240-6a65
> 14/07/04 06:22:40 INFO MemoryStore: MemoryStore started with capacity 6.7
> GB.
> 14/07/04 06:22:40 INFO ConnectionManager: Bound socket to port 46901 with id
> = ConnectionManagerId(slave1,46901)
> 14/07/04 06:22:40 INFO BlockManagerMaster: Trying to register BlockManager
> 14/07/04 06:22:40 INFO BlockManagerMaster: Registered BlockManager
> 14/07/04 06:22:40 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-9eba78f9-8ae9-477c-9338-7222ae6fe306
> 14/07/04 06:22:40 INFO HttpServer: Starting HTTP Server
> 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 14/07/04 06:22:42 INFO Executor: Running task ID 0
> 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned task 2
> 14/07/04 06:22:42 INFO Executor: Running task ID 2
> ...
>
>
>
> On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> The feature dimension is small. You don't need a big akka.frameSize.
>> The default one (10M) should be sufficient. Did you cache the data
>> before calling LRWithSGD? -Xiangrui
>>
>> On Thu, Jul 3, 2014 at 10:02 AM, Bharath Ravi Kumar <re...@gmail.com>
>> wrote:
>> > I tried another run after setting the driver memory to 8G (and
>> > spark.akka.frameSize = 500 on the executors and the driver). In
>> > addition, I
>> > also tried to reduce the amount of data that a single task processes, by
>> > increasing the number of partitions (of the labeled points) to 120
>> > (instead
>> > of 2 used earlier), and then setting max cores to 2. That made no
>> > difference
>> > since, at the end of 120 tasks, the familiar error message appeared on a
>> > slave:
>> >
>> > <snipped earlier logs>
>> > 14/07/03 16:18:48 INFO CoarseGrainedExecutorBackend: Got assigned task
>> > 1436
>> > 14/07/03 16:18:48 INFO Executor: Running task ID 1436
>> > 14/07/03 16:18:53 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00014:0+2215337
>> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00014:2215337+2215338
>> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00003:0+2196429
>> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00003:2196429+2196430
>> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00010:0+2186751
>> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00010:2186751+2186751
>> > 14/07/03 16:18:54 INFO Executor: Serialized size of result for 1436 is
>> > 5958822
>> > 14/07/03 16:18:54 INFO Executor: Sending result for 1436 directly to
>> > driver
>> > 14/07/03 16:18:54 INFO Executor: Finished task ID 1436
>> > 14/07/03 16:18:54 INFO CoarseGrainedExecutorBackend: Got assigned task
>> > 1438
>> > 14/07/03 16:18:54 INFO Executor: Running task ID 1438
>> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00004:0+2209615
>> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00004:2209615+2209616
>> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00011:0+2202240
>> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00011:2202240+2202240
>> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00009:0+2194423
>> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> > file:~//2014-05-24-02/part-r-00009:2194423+2194424
>> > 14/07/03 16:19:00 INFO Executor: Serialized size of result for 1438 is
>> > 5958822
>> > 14/07/03 16:19:00 INFO Executor: Sending result for 1438 directly to
>> > driver
>> > 14/07/03 16:19:00 INFO Executor: Finished task ID 1438
>> > 14/07/03 16:19:14 ERROR CoarseGrainedExecutorBackend: Driver
>> > Disassociated
>> > [akka.tcp://sparkExecutor@slave1:51099] ->
>> > [akka.tcp://spark@master:58272]
>> > disassociated! Shutting down.
>> >
>> >
>> > The corresponding master logs were:
>> >
>> > 4/07/03 16:02:14 INFO Master: Registering app LogRegExp
>> > 14/07/03 16:02:14 INFO Master: Registered app LogRegExp with ID
>> > app-20140703160214-0028
>> > 14/07/03 16:02:14 INFO Master: Launching executor
>> > app-20140703160214-0028/1
>> > on worker worker-20140630124441-slave1-40182
>> > 14/07/03 16:19:15 INFO Master: Removing executor
>> > app-20140703160214-0028/1
>> > because it is EXITED
>> > 14/07/03 16:19:15 INFO Master: Launching executor
>> > app-20140703160214-0028/2
>> > on worker worker-20140630124441-slave1-40182
>> > 14/07/03 16:19:15 INFO Master: Removing executor
>> > app-20140703160214-0028/0
>> > because it is EXITED
>> > 14/07/03 16:19:15 INFO Master: Launching executor
>> > app-20140703160214-0028/3
>> > on worker worker-20140630102913-slave2-44735
>> > 14/07/03 16:19:18 INFO Master: Removing executor
>> > app-20140703160214-0028/2
>> > because it is EXITED
>> > 14/07/03 16:19:18 INFO Master: Launching executor
>> > app-20140703160214-0028/4
>> > on worker worker-20140630124441-slave1-40182
>> > 14/07/03 16:19:18 INFO Master: Removing executor
>> > app-20140703160214-0028/3
>> > because it is EXITED
>> > 14/07/03 16:19:18 INFO Master: Launching executor
>> > app-20140703160214-0028/5
>> > on worker worker-20140630102913-slave2-44735
>> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
>> > disassociated, removing it.
>> > 14/07/03 16:19:20 INFO Master: Removing app app-20140703160214-0028
>> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
>> > disassociated, removing it.
>> >
>> >
>> > Throughout the execution, I confirmed in the UI that driver memory used
>> > was
>> > 0.0 B / 6.9 GB and each executor's memory showed 0.0 B / 12.1 GB even
>> > when
>> > aggregate was being executed. On a related note, I noticed in the
>> > executors
>> > tab that just before the entire job terminated, executors on slave1,
>> > slave2
>> > and the driver "disappeared" momentarily from the active executors list.
>> > The
>> > replacement  executors on slave1 and slave2 were re-spawned a couple of
>> > times and appeared on the executors list again before they too died and
>> > the
>> > job failed.
>> > So it appears that no matter what the task input-result size, the
>> > execution
>> > fails at the end of the stage corresponding to GradientDescent.aggregate
>> > (and the preceding count() in GradientDescent goes through fine). Let me
>> > know if you need any additional information.
>> >
>> >
>> > On Thu, Jul 3, 2014 at 12:27 PM, Xiangrui Meng <me...@gmail.com> wrote:
>> >>
>> >> Could you check the driver memory in the executor tab of the Spark UI
>> >> when the job is running? If it is too small, please set
>> >> --driver-memory with spark-submit, e.g. 10g. Could you also attach the
>> >> master log under spark/logs as well? -Xiangrui
>> >>
>> >> On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar
>> >> <re...@gmail.com>
>> >> wrote:
>> >> > Hi Xiangrui,
>> >> >
>> >> > The issue with aggergating/counting over large feature vectors (as
>> >> > part
>> >> > of
>> >> > LogisticRegressionWithSGD) continues to exist, but now in another
>> >> > form:
>> >> > while the execution doesn't freeze (due to SPARK-1112), it now fails
>> >> > at
>> >> > the
>> >> > second or third gradient descent iteration consistently with an error
>> >> > level
>> >> > log message, but no stacktrace. I'm running against 1.0.1-rc1, and
>> >> > have
>> >> > tried setting spark.akka.frameSize as high as 500. When the execution
>> >> > fails,
>> >> > each of the two executors log the following message (corresponding to
>> >> > aggregate at GradientDescent.scala:178) :
>> >> >
>> >> > 14/07/02 14:09:09 INFO
>> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> > maxBytesInFlight: 50331648, targetRequestSize: 10066329
>> >> > 14/07/02 14:09:09 INFO
>> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> > Getting 2 non-empty blocks out of 2 blocks
>> >> > 14/07/02 14:09:09 INFO
>> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> > Started 1 remote fetches in 0 ms
>> >> > 14/07/02 14:09:11 INFO Executor: Serialized size of result for 737 is
>> >> > 5959086
>> >> > 14/07/02 14:09:11 INFO Executor: Sending result for 737 directly to
>> >> > driver
>> >> > 14/07/02 14:09:11 INFO Executor: Finished task ID 737
>> >> > 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver
>> >> > Disassociated
>> >> > [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
>> >> > [akka.tcp://spark@master:59487] disassociated! Shutting down.
>> >> >
>> >> >
>> >> > There is no separate stacktrace on the driver side.
>> >> >
>> >> > Each input record is of the form p1, p2, (p1,p2) where p1, p2 &
>> >> > (p1,p2)
>> >> > are
>> >> > categorical features with large cardinality, and X is the double
>> >> > label
>> >> > with
>> >> > a continuous value. The categorical variables are converted to binary
>> >> > variables which results in a feature vector of size 741092 (composed
>> >> > of
>> >> > all
>> >> > unique categories across p1, p2 and (p1,p2)). Thus, the labeled point
>> >> > for
>> >> > input record is a sparse vector of size 741092 with only 3 variables
>> >> > set
>> >> > in
>> >> > the record. The total number of records is 683233 after aggregating
>> >> > the
>> >> > input data on (p1, p2). When attempting to train on the unaggregated
>> >> > records
>> >> > (1337907 in number spread across 455 files), the execution fails at
>> >> > count,
>> >> > GradientDescent.scala:161 with the following log
>> >> >
>> >> >
>> >> > (Snipped lines corresponding to other input files)
>> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> > file:~/part-r-00012:2834590+2834590
>> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> > file:~/part-r-00005:0+2845559
>> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> > file:~/part-r-00005:2845559+2845560
>> >> > 14/07/02 16:02:03 INFO Executor: Serialized size of result for 726 is
>> >> > 615
>> >> > 14/07/02 16:02:03 INFO Executor: Sending result for 726 directly to
>> >> > driver
>> >> > 14/07/02 16:02:03 INFO Executor: Finished task ID 726
>> >> > 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver
>> >> > Disassociated
>> >> > [akka.tcp://sparkExecutor@slave1:48423] ->
>> >> > [akka.tcp://spark@master:55792]
>> >> > disassociated! Shutting down.
>> >> >
>> >> > A count() attempted on the input RDD before beginning training has
>> >> > the
>> >> > following metrics:
>> >> >
>> >> >
>> >> > Metric            Min        25th    Median    75th     Max
>> >> >
>> >> > Result
>> >> > serialization
>> >> > time            0 ms    0 ms    0 ms    0 ms    0 ms
>> >> >
>> >> > Duration        33 s    33 s    35 s    35 s    35 s
>> >> >
>> >> > Time spent
>> >> > fetching task
>> >> > results            0 ms    0 ms    0 ms    0 ms    0 ms
>> >> >
>> >> > Scheduler
>> >> > delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
>> >> >
>> >> > Aggregated Metrics by Executor
>> >> >
>> >> > ID     Address Task             Time Total Failed Succeeded Shuffle
>> >> > Read
>> >> > Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
>> >> > 0     CANNOT FIND ADDRESS     34 s     1     0         1         0.0
>> >> > B
>> >> > 0.0 B             0.0 B                 0.0 B
>> >> > 1     CANNOT FIND ADDRESS     36 s     1     0         1         0.0
>> >> > B
>> >> > 0.0 B             0.0 B                 0.0 B
>> >> >
>> >> > Tasks
>> >> >
>> >> > Task Index    Task ID    Status    Locality Level    Executor
>> >> > Launch
>> >> > Time
>> >> > Duration    GC Time    Result Ser Time    Errors
>> >> > 0     726     SUCCESS         PROCESS_LOCAL     slave1
>> >> > 2014/07/02
>> >> > 16:01:28 35 s         0.1 s
>> >> > 1     727     SUCCESS         PROCESS_LOCAL     slave2
>> >> > 2014/07/02
>> >> > 16:01:28 33 s         99 ms
>> >> >
>> >> > Any pointers / diagnosis please?
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar
>> >> > <re...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Thanks. I'll await the fix to re-run my test.
>> >> >>
>> >> >>
>> >> >> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng <me...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi Bharath,
>> >> >>>
>> >> >>> This is related to SPARK-1112, which we already found the root
>> >> >>> cause.
>> >> >>> I will let you know when this is fixed.
>> >> >>>
>> >> >>> Best,
>> >> >>> Xiangrui
>> >> >>>
>> >> >>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar
>> >> >>> <re...@gmail.com>
>> >> >>> wrote:
>> >> >>> > Couple more points:
>> >> >>> > 1)The inexplicable stalling of execution with large feature sets
>> >> >>> > appears
>> >> >>> > similar to that reported with the news-20 dataset:
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
>> >> >>> >
>> >> >>> > 2) The NPE trying to call mapToPair convert an RDD<Long, Long,
>> >> >>> > Integer,
>> >> >>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>,
>> >> >>> > Tuple2<Integer,Integer>>
>> >> >>> > is
>> >> >>> > unrelated to mllib.
>> >> >>> >
>> >> >>> > Thanks,
>> >> >>> > Bharath
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
>> >> >>> > <re...@gmail.com>
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Hi  Xiangrui ,
>> >> >>> >>
>> >> >>> >> I'm using 1.0.0.
>> >> >>> >>
>> >> >>> >> Thanks,
>> >> >>> >> Bharath
>> >> >>> >>
>> >> >>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <me...@gmail.com>
>> >> >>> >> wrote:
>> >> >>> >>>
>> >> >>> >>> Hi Bharath,
>> >> >>> >>>
>> >> >>> >>> Thanks for posting the details! Which Spark version are you
>> >> >>> >>> using?
>> >> >>> >>>
>> >> >>> >>> Best,
>> >> >>> >>> Xiangrui
>> >> >>> >>>
>> >> >>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
>> >> >>> >>> <re...@gmail.com>
>> >> >>> >>> wrote:
>> >> >>> >>> > Hi,
>> >> >>> >>> >
>> >> >>> >>> > (Apologies for the long mail, but it's necessary to provide
>> >> >>> >>> > sufficient
>> >> >>> >>> > details considering the number of issues faced.)
>> >> >>> >>> >
>> >> >>> >>> > I'm running into issues testing LogisticRegressionWithSGD a
>> >> >>> >>> > two
>> >> >>> >>> > node
>> >> >>> >>> > cluster
>> >> >>> >>> > (each node with 24 cores and 16G available to slaves out of
>> >> >>> >>> > 24G
>> >> >>> >>> > on
>> >> >>> >>> > the
>> >> >>> >>> > system). Here's a description of the application:
>> >> >>> >>> >
>> >> >>> >>> > The model is being trained based on categorical features x,
>> >> >>> >>> > y,
>> >> >>> >>> > and
>> >> >>> >>> > (x,y).
>> >> >>> >>> > The categorical features are mapped to binary features by
>> >> >>> >>> > converting
>> >> >>> >>> > each
>> >> >>> >>> > distinct value in the category enum into a binary feature by
>> >> >>> >>> > itself
>> >> >>> >>> > (i.e
>> >> >>> >>> > presence of that value in a record implies corresponding
>> >> >>> >>> > feature
>> >> >>> >>> > =
>> >> >>> >>> > 1,
>> >> >>> >>> > else
>> >> >>> >>> > feature = 0. So, there'd be as many distinct features as enum
>> >> >>> >>> > values) .
>> >> >>> >>> > The
>> >> >>> >>> > training vector is laid out as
>> >> >>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)]. Each
>> >> >>> >>> > record
>> >> >>> >>> > in
>> >> >>> >>> > the
>> >> >>> >>> > training data has only one combination (Xk,Yk) and a label
>> >> >>> >>> > appearing in
>> >> >>> >>> > the
>> >> >>> >>> > record. Thus, the corresponding labeledpoint sparse vector
>> >> >>> >>> > would
>> >> >>> >>> > only
>> >> >>> >>> > have 3
>> >> >>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total length of
>> >> >>> >>> > the
>> >> >>> >>> > vector
>> >> >>> >>> > (though parse) would be nearly 614000.  The number of records
>> >> >>> >>> > is
>> >> >>> >>> > about
>> >> >>> >>> > 1.33
>> >> >>> >>> > million. The records have been coalesced into 20 partitions
>> >> >>> >>> > across
>> >> >>> >>> > two
>> >> >>> >>> > nodes. The input data has not been cached.
>> >> >>> >>> > (NOTE: I do realize the records & features may seem large for
>> >> >>> >>> > a
>> >> >>> >>> > two
>> >> >>> >>> > node
>> >> >>> >>> > setup, but given the memory & cpu, and the fact that I'm
>> >> >>> >>> > willing
>> >> >>> >>> > to
>> >> >>> >>> > give up
>> >> >>> >>> > some turnaround time, I don't see why tasks should
>> >> >>> >>> > inexplicably
>> >> >>> >>> > fail)
>> >> >>> >>> >
>> >> >>> >>> > Additional parameters include:
>> >> >>> >>> >
>> >> >>> >>> > spark.executor.memory = 14G
>> >> >>> >>> > spark.default.parallelism = 1
>> >> >>> >>> > spark.cores.max=20
>> >> >>> >>> > spark.storage.memoryFraction=0.8 //No cache space required
>> >> >>> >>> > (Trying to set spark.akka.frameSize to a larger number, say,
>> >> >>> >>> > 20
>> >> >>> >>> > didn't
>> >> >>> >>> > help
>> >> >>> >>> > either)
>> >> >>> >>> >
>> >> >>> >>> > The model training was initialized as : new
>> >> >>> >>> > LogisticRegressionWithSGD(1,
>> >> >>> >>> > maxIterations, 0.0, 0.05)
>> >> >>> >>> >
>> >> >>> >>> > However, after 4 iterations of gradient descent, the entire
>> >> >>> >>> > execution
>> >> >>> >>> > appeared to stall inexplicably. The corresponding executor
>> >> >>> >>> > details
>> >> >>> >>> > and
>> >> >>> >>> > details of the stalled stage (number 14) are as follows:
>> >> >>> >>> >
>> >> >>> >>> > Metric                        Min        25th     Median
>> >> >>> >>> > 75th
>> >> >>> >>> > Max
>> >> >>> >>> > Result serialization time    12 ms    13 ms    14 ms    16 ms
>> >> >>> >>> > 18
>> >> >>> >>> > ms
>> >> >>> >>> > Duration                    4 s        4 s        5 s
>> >> >>> >>> > 5 s
>> >> >>> >>> > 5 s
>> >> >>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0 ms
>> >> >>> >>> > 0
>> >> >>> >>> > ms
>> >> >>> >>> > results
>> >> >>> >>> > Scheduler delay                6 s        6 s        6 s
>> >> >>> >>> > 6 s
>> >> >>> >>> > 12 s
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > Stage Id
>> >> >>> >>> > 14 aggregate at GradientDescent.scala:178
>> >> >>> >>> >
>> >> >>> >>> > Task Index    Task ID    Status    Locality Level
>> >> >>> >>> > Executor
>> >> >>> >>> > Launch Time                Duration    GC     Result Ser Time
>> >> >>> >>> > Errors
>> >> >>> >>> >
>> >> >>> >>> > Time
>> >> >>> >>> >
>> >> >>> >>> > 0     600     RUNNING     PROCESS_LOCAL
>> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >>> >>> > 1     601     RUNNING     PROCESS_LOCAL
>> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
>> >> >>> >>> > 2     602     RUNNING     PROCESS_LOCAL
>> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >>> >>> > 3     603     RUNNING     PROCESS_LOCAL
>> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
>> >> >>> >>> > 4     604     RUNNING     PROCESS_LOCAL
>> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >> >>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >> >>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >> >>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
>> >> >>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
>> >> >>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
>> >> >>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
>> >> >>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >> >>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
>> >> >>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
>> >> >>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
>> >> >>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
>> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
>> >> >>> >>> >
>> >> >>> >>> > Executor stats:
>> >> >>> >>> >
>> >> >>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks
>> >> >>> >>> > Failed
>> >> >>> >>> > Tasks
>> >> >>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle Read
>> >> >>> >>> > Shuffle
>> >> >>> >>> > Write
>> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2
>> >> >>> >>> > 0
>> >> >>> >>> > 307         309         23.2 m         0.0 B             0.0
>> >> >>> >>> > B
>> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3
>> >> >>> >>> > 0
>> >> >>> >>> > 308         311         22.4 m         0.0 B             0.0
>> >> >>> >>> > B
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > Executor jmap output:
>> >> >>> >>> >
>> >> >>> >>> > Server compiler detected.
>> >> >>> >>> > JVM version is 24.55-b03
>> >> >>> >>> >
>> >> >>> >>> > using thread-local object allocation.
>> >> >>> >>> > Parallel GC with 18 thread(s)
>> >> >>> >>> >
>> >> >>> >>> > Heap Configuration:
>> >> >>> >>> >    MinHeapFreeRatio = 40
>> >> >>> >>> >    MaxHeapFreeRatio = 70
>> >> >>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
>> >> >>> >>> >    NewSize          = 1310720 (1.25MB)
>> >> >>> >>> >    MaxNewSize       = 17592186044415 MB
>> >> >>> >>> >    OldSize          = 5439488 (5.1875MB)
>> >> >>> >>> >    NewRatio         = 2
>> >> >>> >>> >    SurvivorRatio    = 8
>> >> >>> >>> >    PermSize         = 21757952 (20.75MB)
>> >> >>> >>> >    MaxPermSize      = 134217728 (128.0MB)
>> >> >>> >>> >    G1HeapRegionSize = 0 (0.0MB)
>> >> >>> >>> >
>> >> >>> >>> > Heap Usage:
>> >> >>> >>> > PS Young Generation
>> >> >>> >>> > Eden Space:
>> >> >>> >>> >    capacity = 2783969280 (2655.0MB)
>> >> >>> >>> >    used     = 192583816 (183.66223907470703MB)
>> >> >>> >>> >    free     = 2591385464 (2471.337760925293MB)
>> >> >>> >>> >    6.917598458557704% used
>> >> >>> >>> > From Space:
>> >> >>> >>> >    capacity = 409993216 (391.0MB)
>> >> >>> >>> >    used     = 1179808 (1.125152587890625MB)
>> >> >>> >>> >    free     = 408813408 (389.8748474121094MB)
>> >> >>> >>> >    0.2877628102022059% used
>> >> >>> >>> > To Space:
>> >> >>> >>> >    capacity = 385351680 (367.5MB)
>> >> >>> >>> >    used     = 0 (0.0MB)
>> >> >>> >>> >    free     = 385351680 (367.5MB)
>> >> >>> >>> >    0.0% used
>> >> >>> >>> > PS Old Generation
>> >> >>> >>> >    capacity = 7158628352 (6827.0MB)
>> >> >>> >>> >    used     = 4455093024 (4248.707794189453MB)
>> >> >>> >>> >    free     = 2703535328 (2578.292205810547MB)
>> >> >>> >>> >    62.2338918146983% used
>> >> >>> >>> > PS Perm Generation
>> >> >>> >>> >    capacity = 90701824 (86.5MB)
>> >> >>> >>> >    used     = 45348832 (43.248016357421875MB)
>> >> >>> >>> >    free     = 45352992 (43.251983642578125MB)
>> >> >>> >>> >    49.99770677158598% used
>> >> >>> >>> >
>> >> >>> >>> > 8432 interned Strings occupying 714672 bytes.
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > Executor GC log snippet:
>> >> >>> >>> >
>> >> >>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
>> >> >>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times:
>> >> >>> >>> > user=5.13
>> >> >>> >>> > sys=0.39,
>> >> >>> >>> > real=0.32 secs]
>> >> >>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)]
>> >> >>> >>> > [ParOldGen:
>> >> >>> >>> > 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K)
>> >> >>> >>> > [PSPermGen:
>> >> >>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times: user=24.22
>> >> >>> >>> > sys=0.18,
>> >> >>> >>> > real=4.55 secs]
>> >> >>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
>> >> >>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times:
>> >> >>> >>> > user=1.66
>> >> >>> >>> > sys=0.00,
>> >> >>> >>> > real=0.09 secs]
>> >> >>> >>> >
>> >> >>> >>> > I tried to map partitions to cores on the nodes. Increasing
>> >> >>> >>> > the
>> >> >>> >>> > number
>> >> >>> >>> > of
>> >> >>> >>> > partitions (say to 80 or 100) would result in progress till
>> >> >>> >>> > the
>> >> >>> >>> > 6th
>> >> >>> >>> > iteration or so, but the next stage would stall as before
>> >> >>> >>> > with
>> >> >>> >>> > apparent
>> >> >>> >>> > root
>> >> >>> >>> > cause / logs. With increased partitions, the last stage that
>> >> >>> >>> > completed
>> >> >>> >>> > had
>> >> >>> >>> > the following task times:
>> >> >>> >>> >
>> >> >>> >>> > Metric                        Min        25th     Median
>> >> >>> >>> > 75th
>> >> >>> >>> > Max
>> >> >>> >>> > Result serialization time    11 ms    12 ms    13 ms    15 ms
>> >> >>> >>> > 0.4 s
>> >> >>> >>> > Duration                    0.5 s    0.9 s    1 s        3 s
>> >> >>> >>> > 7 s
>> >> >>> >>> > Time spent fetching            0 ms    0 ms    0 ms    0 ms
>> >> >>> >>> > 0
>> >> >>> >>> > ms
>> >> >>> >>> > task results
>> >> >>> >>> > Scheduler delay                5 s        6 s        6 s
>> >> >>> >>> > 7 s
>> >> >>> >>> > 12 s
>> >> >>> >>> >
>> >> >>> >>> > My hypothesis is that as the coefficient array becomes less
>> >> >>> >>> > sparse
>> >> >>> >>> > (with
>> >> >>> >>> > successive iterations), the cost of the aggregate goes up to
>> >> >>> >>> > the
>> >> >>> >>> > point
>> >> >>> >>> > that
>> >> >>> >>> > it stalls (which I failed to explain). Reducing the batch
>> >> >>> >>> > fraction
>> >> >>> >>> > to a
>> >> >>> >>> > very
>> >> >>> >>> > low number like 0.01 saw the iterations progress further, but
>> >> >>> >>> > the
>> >> >>> >>> > model
>> >> >>> >>> > failed to converge in that case after a small number of
>> >> >>> >>> > iterations.
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > I also tried reducing the number of records by aggregating on
>> >> >>> >>> > (x,y)
>> >> >>> >>> > as
>> >> >>> >>> > the
>> >> >>> >>> > key (i.e. using aggregations instead of training on every raw
>> >> >>> >>> > record),
>> >> >>> >>> > but
>> >> >>> >>> > encountered by the following exception:
>> >> >>> >>> >
>> >> >>> >>> > Loss was due to java.lang.NullPointerException
>> >> >>> >>> > java.lang.NullPointerException
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >> >>> >>> >         at
>> >> >>> >>> > scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>> >> >>> >>> >         at
>> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >> >>> >>> >         at
>> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> >> >>> >>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>> >> >>> >>> >         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >>> >>> >         at
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >>> >>> >         at java.lang.Thread.run(Thread.java:745)
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > I'd appreciate any insights/comments about what may be
>> >> >>> >>> > causing
>> >> >>> >>> > the
>> >> >>> >>> > execution
>> >> >>> >>> > to stall.
>> >> >>> >>> >
>> >> >>> >>> > If logs/tables appear poorly indented in the email, here's a
>> >> >>> >>> > gist
>> >> >>> >>> > with
>> >> >>> >>> > relevant details:
>> >> >>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
>> >> >>> >>> >
>> >> >>> >>> > Thanks,
>> >> >>> >>> > Bharath
>> >> >>> >
>> >> >>> >
>> >> >>
>> >> >>
>> >> >
>> >
>> >
>
>

Re: Execution stalls in LogisticRegressionWithSGD

Posted by Bharath Ravi Kumar <re...@gmail.com>.

Xiangrui,

Leaving the frameSize unspecified led to an error message (and failure)
stating that the task size (~11M) was larger. I hence set it to an
arbitrarily large value ( I realize 500 was unrealistic & unnecessary in
this case). I've now set the size to 20M and repeated the runs. The earlier
runs were on an uncached RDD. Caching the RDD (and setting
spark.storage.memoryFraction=0.5) resulted in marginal speed up of
execution, but the end result remained the same. The cached RDD size is as
follows:

RDD Name    Storage Level                        Cached Partitions
Fraction Cached    Size in Memory    Size in Tachyon        Size on Disk
1084         Memory Deserialized 1x Replicated     80
100%         165.9 MB             0.0 B                 0.0 B


The corresponding master logs were:

14/07/04 06:29:34 INFO Master: Removing executor app-20140704062238-0033/1
because it is EXITED
14/07/04 06:29:34 INFO Master: Launching executor app-20140704062238-0033/2
on worker worker-20140630124441-slave1-40182
14/07/04 06:29:34 INFO Master: Removing executor app-20140704062238-0033/0
because it is EXITED
14/07/04 06:29:34 INFO Master: Launching executor app-20140704062238-0033/3
on worker worker-20140630102913-slave2-44735
14/07/04 06:29:37 INFO Master: Removing executor app-20140704062238-0033/2
because it is EXITED
14/07/04 06:29:37 INFO Master: Launching executor app-20140704062238-0033/4
on worker worker-20140630124441-slave1-40182
14/07/04 06:29:37 INFO Master: Removing executor app-20140704062238-0033/3
because it is EXITED
14/07/04 06:29:37 INFO Master: Launching executor app-20140704062238-0033/5
on worker worker-20140630102913-slave2-44735
14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
disassociated, removing it.
14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
14/07/04 06:29:39 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
was not delivered. [39] dead letters encountered. This logging can be
turned off or adjusted with configuration settings 'akka.log-dead-letters'
and 'akka.log-dead-letters-during-shutdown'.
14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
disassociated, removing it.
14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
disassociated, removing it.
14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
Error [Association failed with [akka.tcp://spark@slave2:45172]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@slave2:45172]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: slave2/10.3.1.135:45172
]
14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
disassociated, removing it.
14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
Error [Association failed with [akka.tcp://spark@slave2:45172]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@slave2:45172]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: slave2/10.3.1.135:45172
]
14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@slave2:45172]:
Error [Association failed with [akka.tcp://spark@slave2:45172]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@slave2:45172]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: slave2/10.3.1.135:45172
]
14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
disassociated, removing it.
14/07/04 06:29:40 WARN Master: Got status update for unknown executor
app-20140704062238-0033/5
14/07/04 06:29:40 WARN Master: Got status update for unknown executor
app-20140704062238-0033/4


Coincidentally, after the initial executor failed, each following executor
that was re-spawned failed with the following logs:
(e.g the following was from
slave1:~/spark-1.0.1-rc1/work/app-20140704062238-0033/2/stderr)

log4j:WARN No appenders could be found for logger
(org.apache.hadoop.conf.Configuration).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
14/07/04 06:29:35 INFO SparkHadoopUtil: Using Spark's default log4j
profile: org/apache/spark/log4j-defaults.properties
14/07/04 06:29:35 INFO SecurityManager: Changing view acls to: user1
14/07/04 06:29:35 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(user1)
14/07/04 06:29:35 INFO Slf4jLogger: Slf4jLogger started
14/07/04 06:29:35 INFO Remoting: Starting remoting
14/07/04 06:29:36 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkExecutor@slave1:54782]
14/07/04 06:29:36 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkExecutor@slave1:54782]
14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
14/07/04 06:29:36 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@slave1:40182/user/Worker
14/07/04 06:29:36 INFO WorkerWatcher: Successfully connected to
akka.tcp://sparkWorker@slave1:40182/user/Worker
14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Successfully
registered with driver
14/07/04 06:29:36 INFO SecurityManager: Changing view acls to: user1
14/07/04 06:29:36 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(user1)
14/07/04 06:29:36 INFO Slf4jLogger: Slf4jLogger started
14/07/04 06:29:36 INFO Remoting: Starting remoting
14/07/04 06:29:36 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@slave1:39753]
14/07/04 06:29:36 INFO SparkEnv: Connecting to MapOutputTracker:
akka.tcp://spark@master:45172/user/MapOutputTracker
14/07/04 06:29:36 INFO SparkEnv: Connecting to BlockManagerMaster:
akka.tcp://spark@master:45172/user/BlockManagerMaster
14/07/04 06:29:36 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20140704062936-6123
14/07/04 06:29:36 INFO MemoryStore: MemoryStore started with capacity 6.7
GB.
14/07/04 06:29:36 INFO ConnectionManager: Bound socket to port 50960 with
id = ConnectionManagerId(slave1,50960)
14/07/04 06:29:36 INFO BlockManagerMaster: Trying to register BlockManager
14/07/04 06:29:36 INFO BlockManagerMaster: Registered BlockManager
14/07/04 06:29:36 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-42c2782f-60f8-45a7-9e11-c789fc87fe2e
14/07/04 06:29:36 INFO HttpServer: Starting HTTP Server
14/07/04 06:29:36 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@slave1:54782] -> [akka.tcp://spark@master:45172]
disassociated! Shutting down.

In case of the initial executor that successfully started, the
corresponding log messages (from
spark-1.0.1-rc1/work/app-20140704062238-0033/1/stderr) on the executor were:
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.conf.Configuration).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
14/07/04 06:22:39 INFO SparkHadoopUtil: Using Spark's default log4j
profile: org/apache/spark/log4j-defaults.properties
14/07/04 06:22:39 INFO SecurityManager: Changing view acls to: user1
14/07/04 06:22:39 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(user1)
14/07/04 06:22:39 INFO Slf4jLogger: Slf4jLogger started
14/07/04 06:22:39 INFO Remoting: Starting remoting
14/07/04 06:22:39 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkExecutor@slave1:50806]
14/07/04 06:22:39 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkExecutor@slave1:50806]
14/07/04 06:22:39 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
14/07/04 06:22:39 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@slave1:40182/user/Worker
14/07/04 06:22:39 INFO WorkerWatcher: Successfully connected to
akka.tcp://sparkWorker@slave1:40182/user/Worker
14/07/04 06:22:40 INFO CoarseGrainedExecutorBackend: Successfully
registered with driver
14/07/04 06:22:40 INFO SecurityManager: Changing view acls to: user1
14/07/04 06:22:40 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(user1)
14/07/04 06:22:40 INFO Slf4jLogger: Slf4jLogger started
14/07/04 06:22:40 INFO Remoting: Starting remoting
14/07/04 06:22:40 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@slave1:38558]
14/07/04 06:22:40 INFO SparkEnv: Connecting to MapOutputTracker:
akka.tcp://spark@master:45172/user/MapOutputTracker
14/07/04 06:22:40 INFO SparkEnv: Connecting to BlockManagerMaster:
akka.tcp://spark@master:45172/user/BlockManagerMaster
14/07/04 06:22:40 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20140704062240-6a65
14/07/04 06:22:40 INFO MemoryStore: MemoryStore started with capacity 6.7
GB.
14/07/04 06:22:40 INFO ConnectionManager: Bound socket to port 46901 with
id = ConnectionManagerId(slave1,46901)
14/07/04 06:22:40 INFO BlockManagerMaster: Trying to register BlockManager
14/07/04 06:22:40 INFO BlockManagerMaster: Registered BlockManager
14/07/04 06:22:40 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-9eba78f9-8ae9-477c-9338-7222ae6fe306
14/07/04 06:22:40 INFO HttpServer: Starting HTTP Server
14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned task 0
14/07/04 06:22:42 INFO Executor: Running task ID 0
14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned task 2
14/07/04 06:22:42 INFO Executor: Running task ID 2
...



On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng <me...@gmail.com> wrote:

> The feature dimension is small. You don't need a big akka.frameSize.
> The default one (10M) should be sufficient. Did you cache the data
> before calling LRWithSGD? -Xiangrui
>
> On Thu, Jul 3, 2014 at 10:02 AM, Bharath Ravi Kumar <re...@gmail.com>
> wrote:
> > I tried another run after setting the driver memory to 8G (and
> > spark.akka.frameSize = 500 on the executors and the driver). In
> addition, I
> > also tried to reduce the amount of data that a single task processes, by
> > increasing the number of partitions (of the labeled points) to 120
> (instead
> > of 2 used earlier), and then setting max cores to 2. That made no
> difference
> > since, at the end of 120 tasks, the familiar error message appeared on a
> > slave:
> >
> > <snipped earlier logs>
> > 14/07/03 16:18:48 INFO CoarseGrainedExecutorBackend: Got assigned task
> 1436
> > 14/07/03 16:18:48 INFO Executor: Running task ID 1436
> > 14/07/03 16:18:53 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00014:0+2215337
> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00014:2215337+2215338
> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00003:0+2196429
> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00003:2196429+2196430
> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00010:0+2186751
> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00010:2186751+2186751
> > 14/07/03 16:18:54 INFO Executor: Serialized size of result for 1436 is
> > 5958822
> > 14/07/03 16:18:54 INFO Executor: Sending result for 1436 directly to
> driver
> > 14/07/03 16:18:54 INFO Executor: Finished task ID 1436
> > 14/07/03 16:18:54 INFO CoarseGrainedExecutorBackend: Got assigned task
> 1438
> > 14/07/03 16:18:54 INFO Executor: Running task ID 1438
> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00004:0+2209615
> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00004:2209615+2209616
> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00011:0+2202240
> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00011:2202240+2202240
> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00009:0+2194423
> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> > file:~//2014-05-24-02/part-r-00009:2194423+2194424
> > 14/07/03 16:19:00 INFO Executor: Serialized size of result for 1438 is
> > 5958822
> > 14/07/03 16:19:00 INFO Executor: Sending result for 1438 directly to
> driver
> > 14/07/03 16:19:00 INFO Executor: Finished task ID 1438
> > 14/07/03 16:19:14 ERROR CoarseGrainedExecutorBackend: Driver
> Disassociated
> > [akka.tcp://sparkExecutor@slave1:51099] -> [akka.tcp://spark@master
> :58272]
> > disassociated! Shutting down.
> >
> >
> > The corresponding master logs were:
> >
> > 4/07/03 16:02:14 INFO Master: Registering app LogRegExp
> > 14/07/03 16:02:14 INFO Master: Registered app LogRegExp with ID
> > app-20140703160214-0028
> > 14/07/03 16:02:14 INFO Master: Launching executor
> app-20140703160214-0028/1
> > on worker worker-20140630124441-slave1-40182
> > 14/07/03 16:19:15 INFO Master: Removing executor
> app-20140703160214-0028/1
> > because it is EXITED
> > 14/07/03 16:19:15 INFO Master: Launching executor
> app-20140703160214-0028/2
> > on worker worker-20140630124441-slave1-40182
> > 14/07/03 16:19:15 INFO Master: Removing executor
> app-20140703160214-0028/0
> > because it is EXITED
> > 14/07/03 16:19:15 INFO Master: Launching executor
> app-20140703160214-0028/3
> > on worker worker-20140630102913-slave2-44735
> > 14/07/03 16:19:18 INFO Master: Removing executor
> app-20140703160214-0028/2
> > because it is EXITED
> > 14/07/03 16:19:18 INFO Master: Launching executor
> app-20140703160214-0028/4
> > on worker worker-20140630124441-slave1-40182
> > 14/07/03 16:19:18 INFO Master: Removing executor
> app-20140703160214-0028/3
> > because it is EXITED
> > 14/07/03 16:19:18 INFO Master: Launching executor
> app-20140703160214-0028/5
> > on worker worker-20140630102913-slave2-44735
> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
> > disassociated, removing it.
> > 14/07/03 16:19:20 INFO Master: Removing app app-20140703160214-0028
> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
> > disassociated, removing it.
> >
> >
> > Throughout the execution, I confirmed in the UI that driver memory used
> was
> > 0.0 B / 6.9 GB and each executor's memory showed 0.0 B / 12.1 GB even
> when
> > aggregate was being executed. On a related note, I noticed in the
> executors
> > tab that just before the entire job terminated, executors on slave1,
> slave2
> > and the driver "disappeared" momentarily from the active executors list.
> The
> > replacement  executors on slave1 and slave2 were re-spawned a couple of
> > times and appeared on the executors list again before they too died and
> the
> > job failed.
> > So it appears that no matter what the task input-result size, the
> execution
> > fails at the end of the stage corresponding to GradientDescent.aggregate
> > (and the preceding count() in GradientDescent goes through fine). Let me
> > know if you need any additional information.
> >
> >
> > On Thu, Jul 3, 2014 at 12:27 PM, Xiangrui Meng <me...@gmail.com> wrote:
> >>
> >> Could you check the driver memory in the executor tab of the Spark UI
> >> when the job is running? If it is too small, please set
> >> --driver-memory with spark-submit, e.g. 10g. Could you also attach the
> >> master log under spark/logs as well? -Xiangrui
> >>
> >> On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar <reachbach@gmail.com
> >
> >> wrote:
> >> > Hi Xiangrui,
> >> >
> >> > The issue with aggergating/counting over large feature vectors (as
> part
> >> > of
> >> > LogisticRegressionWithSGD) continues to exist, but now in another
> form:
> >> > while the execution doesn't freeze (due to SPARK-1112), it now fails
> at
> >> > the
> >> > second or third gradient descent iteration consistently with an error
> >> > level
> >> > log message, but no stacktrace. I'm running against 1.0.1-rc1, and
> have
> >> > tried setting spark.akka.frameSize as high as 500. When the execution
> >> > fails,
> >> > each of the two executors log the following message (corresponding to
> >> > aggregate at GradientDescent.scala:178) :
> >> >
> >> > 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> >> > maxBytesInFlight: 50331648, targetRequestSize: 10066329
> >> > 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> >> > Getting 2 non-empty blocks out of 2 blocks
> >> > 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> >> > Started 1 remote fetches in 0 ms
> >> > 14/07/02 14:09:11 INFO Executor: Serialized size of result for 737 is
> >> > 5959086
> >> > 14/07/02 14:09:11 INFO Executor: Sending result for 737 directly to
> >> > driver
> >> > 14/07/02 14:09:11 INFO Executor: Finished task ID 737
> >> > 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver
> >> > Disassociated
> >> > [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
> >> > [akka.tcp://spark@master:59487] disassociated! Shutting down.
> >> >
> >> >
> >> > There is no separate stacktrace on the driver side.
> >> >
> >> > Each input record is of the form p1, p2, (p1,p2) where p1, p2 &
> (p1,p2)
> >> > are
> >> > categorical features with large cardinality, and X is the double label
> >> > with
> >> > a continuous value. The categorical variables are converted to binary
> >> > variables which results in a feature vector of size 741092 (composed
> of
> >> > all
> >> > unique categories across p1, p2 and (p1,p2)). Thus, the labeled point
> >> > for
> >> > input record is a sparse vector of size 741092 with only 3 variables
> set
> >> > in
> >> > the record. The total number of records is 683233 after aggregating
> the
> >> > input data on (p1, p2). When attempting to train on the unaggregated
> >> > records
> >> > (1337907 in number spread across 455 files), the execution fails at
> >> > count,
> >> > GradientDescent.scala:161 with the following log
> >> >
> >> >
> >> > (Snipped lines corresponding to other input files)
> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> > file:~/part-r-00012:2834590+2834590
> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> > file:~/part-r-00005:0+2845559
> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> >> > file:~/part-r-00005:2845559+2845560
> >> > 14/07/02 16:02:03 INFO Executor: Serialized size of result for 726 is
> >> > 615
> >> > 14/07/02 16:02:03 INFO Executor: Sending result for 726 directly to
> >> > driver
> >> > 14/07/02 16:02:03 INFO Executor: Finished task ID 726
> >> > 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver
> >> > Disassociated
> >> > [akka.tcp://sparkExecutor@slave1:48423] ->
> >> > [akka.tcp://spark@master:55792]
> >> > disassociated! Shutting down.
> >> >
> >> > A count() attempted on the input RDD before beginning training has the
> >> > following metrics:
> >> >
> >> >
> >> > Metric            Min        25th    Median    75th     Max
> >> >
> >> > Result
> >> > serialization
> >> > time            0 ms    0 ms    0 ms    0 ms    0 ms
> >> >
> >> > Duration        33 s    33 s    35 s    35 s    35 s
> >> >
> >> > Time spent
> >> > fetching task
> >> > results            0 ms    0 ms    0 ms    0 ms    0 ms
> >> >
> >> > Scheduler
> >> > delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
> >> >
> >> > Aggregated Metrics by Executor
> >> >
> >> > ID     Address Task             Time Total Failed Succeeded Shuffle
> Read
> >> > Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
> >> > 0     CANNOT FIND ADDRESS     34 s     1     0         1         0.0 B
> >> > 0.0 B             0.0 B                 0.0 B
> >> > 1     CANNOT FIND ADDRESS     36 s     1     0         1         0.0 B
> >> > 0.0 B             0.0 B                 0.0 B
> >> >
> >> > Tasks
> >> >
> >> > Task Index    Task ID    Status    Locality Level    Executor
>  Launch
> >> > Time
> >> > Duration    GC Time    Result Ser Time    Errors
> >> > 0     726     SUCCESS         PROCESS_LOCAL     slave1
> >> > 2014/07/02
> >> > 16:01:28 35 s         0.1 s
> >> > 1     727     SUCCESS         PROCESS_LOCAL     slave2
> >> > 2014/07/02
> >> > 16:01:28 33 s         99 ms
> >> >
> >> > Any pointers / diagnosis please?
> >> >
> >> >
> >> >
> >> >
> >> > On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar
> >> > <re...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Thanks. I'll await the fix to re-run my test.
> >> >>
> >> >>
> >> >> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng <me...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi Bharath,
> >> >>>
> >> >>> This is related to SPARK-1112, which we already found the root
> cause.
> >> >>> I will let you know when this is fixed.
> >> >>>
> >> >>> Best,
> >> >>> Xiangrui
> >> >>>
> >> >>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar
> >> >>> <re...@gmail.com>
> >> >>> wrote:
> >> >>> > Couple more points:
> >> >>> > 1)The inexplicable stalling of execution with large feature sets
> >> >>> > appears
> >> >>> > similar to that reported with the news-20 dataset:
> >> >>> >
> >> >>> >
> >> >>> >
> http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
> >> >>> >
> >> >>> > 2) The NPE trying to call mapToPair convert an RDD<Long, Long,
> >> >>> > Integer,
> >> >>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>,
> >> >>> > Tuple2<Integer,Integer>>
> >> >>> > is
> >> >>> > unrelated to mllib.
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Bharath
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
> >> >>> > <re...@gmail.com>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Hi  Xiangrui ,
> >> >>> >>
> >> >>> >> I'm using 1.0.0.
> >> >>> >>
> >> >>> >> Thanks,
> >> >>> >> Bharath
> >> >>> >>
> >> >>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <me...@gmail.com>
> wrote:
> >> >>> >>>
> >> >>> >>> Hi Bharath,
> >> >>> >>>
> >> >>> >>> Thanks for posting the details! Which Spark version are you
> using?
> >> >>> >>>
> >> >>> >>> Best,
> >> >>> >>> Xiangrui
> >> >>> >>>
> >> >>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
> >> >>> >>> <re...@gmail.com>
> >> >>> >>> wrote:
> >> >>> >>> > Hi,
> >> >>> >>> >
> >> >>> >>> > (Apologies for the long mail, but it's necessary to provide
> >> >>> >>> > sufficient
> >> >>> >>> > details considering the number of issues faced.)
> >> >>> >>> >
> >> >>> >>> > I'm running into issues testing LogisticRegressionWithSGD a
> two
> >> >>> >>> > node
> >> >>> >>> > cluster
> >> >>> >>> > (each node with 24 cores and 16G available to slaves out of
> 24G
> >> >>> >>> > on
> >> >>> >>> > the
> >> >>> >>> > system). Here's a description of the application:
> >> >>> >>> >
> >> >>> >>> > The model is being trained based on categorical features x, y,
> >> >>> >>> > and
> >> >>> >>> > (x,y).
> >> >>> >>> > The categorical features are mapped to binary features by
> >> >>> >>> > converting
> >> >>> >>> > each
> >> >>> >>> > distinct value in the category enum into a binary feature by
> >> >>> >>> > itself
> >> >>> >>> > (i.e
> >> >>> >>> > presence of that value in a record implies corresponding
> feature
> >> >>> >>> > =
> >> >>> >>> > 1,
> >> >>> >>> > else
> >> >>> >>> > feature = 0. So, there'd be as many distinct features as enum
> >> >>> >>> > values) .
> >> >>> >>> > The
> >> >>> >>> > training vector is laid out as
> >> >>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)]. Each
> record
> >> >>> >>> > in
> >> >>> >>> > the
> >> >>> >>> > training data has only one combination (Xk,Yk) and a label
> >> >>> >>> > appearing in
> >> >>> >>> > the
> >> >>> >>> > record. Thus, the corresponding labeledpoint sparse vector
> would
> >> >>> >>> > only
> >> >>> >>> > have 3
> >> >>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total length of
> the
> >> >>> >>> > vector
> >> >>> >>> > (though parse) would be nearly 614000.  The number of records
> is
> >> >>> >>> > about
> >> >>> >>> > 1.33
> >> >>> >>> > million. The records have been coalesced into 20 partitions
> >> >>> >>> > across
> >> >>> >>> > two
> >> >>> >>> > nodes. The input data has not been cached.
> >> >>> >>> > (NOTE: I do realize the records & features may seem large for
> a
> >> >>> >>> > two
> >> >>> >>> > node
> >> >>> >>> > setup, but given the memory & cpu, and the fact that I'm
> willing
> >> >>> >>> > to
> >> >>> >>> > give up
> >> >>> >>> > some turnaround time, I don't see why tasks should
> inexplicably
> >> >>> >>> > fail)
> >> >>> >>> >
> >> >>> >>> > Additional parameters include:
> >> >>> >>> >
> >> >>> >>> > spark.executor.memory = 14G
> >> >>> >>> > spark.default.parallelism = 1
> >> >>> >>> > spark.cores.max=20
> >> >>> >>> > spark.storage.memoryFraction=0.8 //No cache space required
> >> >>> >>> > (Trying to set spark.akka.frameSize to a larger number, say,
> 20
> >> >>> >>> > didn't
> >> >>> >>> > help
> >> >>> >>> > either)
> >> >>> >>> >
> >> >>> >>> > The model training was initialized as : new
> >> >>> >>> > LogisticRegressionWithSGD(1,
> >> >>> >>> > maxIterations, 0.0, 0.05)
> >> >>> >>> >
> >> >>> >>> > However, after 4 iterations of gradient descent, the entire
> >> >>> >>> > execution
> >> >>> >>> > appeared to stall inexplicably. The corresponding executor
> >> >>> >>> > details
> >> >>> >>> > and
> >> >>> >>> > details of the stalled stage (number 14) are as follows:
> >> >>> >>> >
> >> >>> >>> > Metric                        Min        25th     Median
>  75th
> >> >>> >>> > Max
> >> >>> >>> > Result serialization time    12 ms    13 ms    14 ms    16 ms
> >> >>> >>> > 18
> >> >>> >>> > ms
> >> >>> >>> > Duration                    4 s        4 s        5 s
>  5 s
> >> >>> >>> > 5 s
> >> >>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0 ms    0
> >> >>> >>> > ms
> >> >>> >>> > results
> >> >>> >>> > Scheduler delay                6 s        6 s        6 s
> >> >>> >>> > 6 s
> >> >>> >>> > 12 s
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > Stage Id
> >> >>> >>> > 14 aggregate at GradientDescent.scala:178
> >> >>> >>> >
> >> >>> >>> > Task Index    Task ID    Status    Locality Level     Executor
> >> >>> >>> > Launch Time                Duration    GC     Result Ser Time
> >> >>> >>> > Errors
> >> >>> >>> >
> >> >>> >>> > Time
> >> >>> >>> >
> >> >>> >>> > 0     600     RUNNING     PROCESS_LOCAL
> >> >>> >>> > serious.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >>> >>> > 1     601     RUNNING     PROCESS_LOCAL
> >> >>> >>> > casual.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
> >> >>> >>> > 2     602     RUNNING     PROCESS_LOCAL
> >> >>> >>> > serious.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >>> >>> > 3     603     RUNNING     PROCESS_LOCAL
> >> >>> >>> > casual.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
> >> >>> >>> > 4     604     RUNNING     PROCESS_LOCAL
> >> >>> >>> > serious.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >> >>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > casual.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
> >> >>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > serious.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
> >> >>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > casual.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
> >> >>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > serious.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > casual.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
> >> >>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > serious.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > casual.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
> >> >>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > serious.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
> >> >>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > casual.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
> >> >>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > serious.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
> >> >>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > casual.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
> >> >>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > serious.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >> >>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > casual.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
> >> >>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > serious.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
> >> >>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
> >> >>> >>> > casual.dataone.foo.bar.com
> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
> >> >>> >>> >
> >> >>> >>> > Executor stats:
> >> >>> >>> >
> >> >>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks
>  Failed
> >> >>> >>> > Tasks
> >> >>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle Read
> >> >>> >>> > Shuffle
> >> >>> >>> > Write
> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2                 0
> >> >>> >>> > 307         309         23.2 m         0.0 B             0.0 B
> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3                 0
> >> >>> >>> > 308         311         22.4 m         0.0 B             0.0 B
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > Executor jmap output:
> >> >>> >>> >
> >> >>> >>> > Server compiler detected.
> >> >>> >>> > JVM version is 24.55-b03
> >> >>> >>> >
> >> >>> >>> > using thread-local object allocation.
> >> >>> >>> > Parallel GC with 18 thread(s)
> >> >>> >>> >
> >> >>> >>> > Heap Configuration:
> >> >>> >>> >    MinHeapFreeRatio = 40
> >> >>> >>> >    MaxHeapFreeRatio = 70
> >> >>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
> >> >>> >>> >    NewSize          = 1310720 (1.25MB)
> >> >>> >>> >    MaxNewSize       = 17592186044415 MB
> >> >>> >>> >    OldSize          = 5439488 (5.1875MB)
> >> >>> >>> >    NewRatio         = 2
> >> >>> >>> >    SurvivorRatio    = 8
> >> >>> >>> >    PermSize         = 21757952 (20.75MB)
> >> >>> >>> >    MaxPermSize      = 134217728 (128.0MB)
> >> >>> >>> >    G1HeapRegionSize = 0 (0.0MB)
> >> >>> >>> >
> >> >>> >>> > Heap Usage:
> >> >>> >>> > PS Young Generation
> >> >>> >>> > Eden Space:
> >> >>> >>> >    capacity = 2783969280 (2655.0MB)
> >> >>> >>> >    used     = 192583816 (183.66223907470703MB)
> >> >>> >>> >    free     = 2591385464 (2471.337760925293MB)
> >> >>> >>> >    6.917598458557704% used
> >> >>> >>> > From Space:
> >> >>> >>> >    capacity = 409993216 (391.0MB)
> >> >>> >>> >    used     = 1179808 (1.125152587890625MB)
> >> >>> >>> >    free     = 408813408 (389.8748474121094MB)
> >> >>> >>> >    0.2877628102022059% used
> >> >>> >>> > To Space:
> >> >>> >>> >    capacity = 385351680 (367.5MB)
> >> >>> >>> >    used     = 0 (0.0MB)
> >> >>> >>> >    free     = 385351680 (367.5MB)
> >> >>> >>> >    0.0% used
> >> >>> >>> > PS Old Generation
> >> >>> >>> >    capacity = 7158628352 (6827.0MB)
> >> >>> >>> >    used     = 4455093024 (4248.707794189453MB)
> >> >>> >>> >    free     = 2703535328 (2578.292205810547MB)
> >> >>> >>> >    62.2338918146983% used
> >> >>> >>> > PS Perm Generation
> >> >>> >>> >    capacity = 90701824 (86.5MB)
> >> >>> >>> >    used     = 45348832 (43.248016357421875MB)
> >> >>> >>> >    free     = 45352992 (43.251983642578125MB)
> >> >>> >>> >    49.99770677158598% used
> >> >>> >>> >
> >> >>> >>> > 8432 interned Strings occupying 714672 bytes.
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > Executor GC log snippet:
> >> >>> >>> >
> >> >>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
> >> >>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times:
> user=5.13
> >> >>> >>> > sys=0.39,
> >> >>> >>> > real=0.32 secs]
> >> >>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)]
> >> >>> >>> > [ParOldGen:
> >> >>> >>> > 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K)
> >> >>> >>> > [PSPermGen:
> >> >>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times: user=24.22
> >> >>> >>> > sys=0.18,
> >> >>> >>> > real=4.55 secs]
> >> >>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
> >> >>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times:
> user=1.66
> >> >>> >>> > sys=0.00,
> >> >>> >>> > real=0.09 secs]
> >> >>> >>> >
> >> >>> >>> > I tried to map partitions to cores on the nodes. Increasing
> the
> >> >>> >>> > number
> >> >>> >>> > of
> >> >>> >>> > partitions (say to 80 or 100) would result in progress till
> the
> >> >>> >>> > 6th
> >> >>> >>> > iteration or so, but the next stage would stall as before with
> >> >>> >>> > apparent
> >> >>> >>> > root
> >> >>> >>> > cause / logs. With increased partitions, the last stage that
> >> >>> >>> > completed
> >> >>> >>> > had
> >> >>> >>> > the following task times:
> >> >>> >>> >
> >> >>> >>> > Metric                        Min        25th     Median
>  75th
> >> >>> >>> > Max
> >> >>> >>> > Result serialization time    11 ms    12 ms    13 ms    15 ms
> >> >>> >>> > 0.4 s
> >> >>> >>> > Duration                    0.5 s    0.9 s    1 s        3 s
> >> >>> >>> > 7 s
> >> >>> >>> > Time spent fetching            0 ms    0 ms    0 ms    0 ms
>  0
> >> >>> >>> > ms
> >> >>> >>> > task results
> >> >>> >>> > Scheduler delay                5 s        6 s        6 s
> >> >>> >>> > 7 s
> >> >>> >>> > 12 s
> >> >>> >>> >
> >> >>> >>> > My hypothesis is that as the coefficient array becomes less
> >> >>> >>> > sparse
> >> >>> >>> > (with
> >> >>> >>> > successive iterations), the cost of the aggregate goes up to
> the
> >> >>> >>> > point
> >> >>> >>> > that
> >> >>> >>> > it stalls (which I failed to explain). Reducing the batch
> >> >>> >>> > fraction
> >> >>> >>> > to a
> >> >>> >>> > very
> >> >>> >>> > low number like 0.01 saw the iterations progress further, but
> >> >>> >>> > the
> >> >>> >>> > model
> >> >>> >>> > failed to converge in that case after a small number of
> >> >>> >>> > iterations.
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > I also tried reducing the number of records by aggregating on
> >> >>> >>> > (x,y)
> >> >>> >>> > as
> >> >>> >>> > the
> >> >>> >>> > key (i.e. using aggregations instead of training on every raw
> >> >>> >>> > record),
> >> >>> >>> > but
> >> >>> >>> > encountered by the following exception:
> >> >>> >>> >
> >> >>> >>> > Loss was due to java.lang.NullPointerException
> >> >>> >>> > java.lang.NullPointerException
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> >> >>> >>> >         at
> >> >>> >>> > scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> >> >>> >>> >         at
> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> >> >>> >>> >         at
> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> >> >>> >>> >         at
> >> >>> >>> >
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >> >>> >>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> >> >>> >>> >         at org.apache.spark.scheduler.Task.run(Task.scala:51)
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >>> >>> >         at
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >>> >>> >         at java.lang.Thread.run(Thread.java:745)
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > I'd appreciate any insights/comments about what may be causing
> >> >>> >>> > the
> >> >>> >>> > execution
> >> >>> >>> > to stall.
> >> >>> >>> >
> >> >>> >>> > If logs/tables appear poorly indented in the email, here's a
> >> >>> >>> > gist
> >> >>> >>> > with
> >> >>> >>> > relevant details:
> >> >>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
> >> >>> >>> >
> >> >>> >>> > Thanks,
> >> >>> >>> > Bharath
> >> >>> >
> >> >>> >
> >> >>
> >> >>
> >> >
> >
> >
>

Re: Execution stalls in LogisticRegressionWithSGD

Posted by Xiangrui Meng <me...@gmail.com>.

The feature dimension is small. You don't need a big akka.frameSize.
The default one (10M) should be sufficient. Did you cache the data
before calling LRWithSGD? -Xiangrui

On Thu, Jul 3, 2014 at 10:02 AM, Bharath Ravi Kumar <re...@gmail.com> wrote:
> I tried another run after setting the driver memory to 8G (and
> spark.akka.frameSize = 500 on the executors and the driver). In addition, I
> also tried to reduce the amount of data that a single task processes, by
> increasing the number of partitions (of the labeled points) to 120 (instead
> of 2 used earlier), and then setting max cores to 2. That made no difference
> since, at the end of 120 tasks, the familiar error message appeared on a
> slave:
>
> <snipped earlier logs>
> 14/07/03 16:18:48 INFO CoarseGrainedExecutorBackend: Got assigned task 1436
> 14/07/03 16:18:48 INFO Executor: Running task ID 1436
> 14/07/03 16:18:53 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00014:0+2215337
> 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00014:2215337+2215338
> 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00003:0+2196429
> 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00003:2196429+2196430
> 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00010:0+2186751
> 14/07/03 16:18:54 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00010:2186751+2186751
> 14/07/03 16:18:54 INFO Executor: Serialized size of result for 1436 is
> 5958822
> 14/07/03 16:18:54 INFO Executor: Sending result for 1436 directly to driver
> 14/07/03 16:18:54 INFO Executor: Finished task ID 1436
> 14/07/03 16:18:54 INFO CoarseGrainedExecutorBackend: Got assigned task 1438
> 14/07/03 16:18:54 INFO Executor: Running task ID 1438
> 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00004:0+2209615
> 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00004:2209615+2209616
> 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00011:0+2202240
> 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00011:2202240+2202240
> 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00009:0+2194423
> 14/07/03 16:19:00 INFO HadoopRDD: Input split:
> file:~//2014-05-24-02/part-r-00009:2194423+2194424
> 14/07/03 16:19:00 INFO Executor: Serialized size of result for 1438 is
> 5958822
> 14/07/03 16:19:00 INFO Executor: Sending result for 1438 directly to driver
> 14/07/03 16:19:00 INFO Executor: Finished task ID 1438
> 14/07/03 16:19:14 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
> [akka.tcp://sparkExecutor@slave1:51099] -> [akka.tcp://spark@master:58272]
> disassociated! Shutting down.
>
>
> The corresponding master logs were:
>
> 4/07/03 16:02:14 INFO Master: Registering app LogRegExp
> 14/07/03 16:02:14 INFO Master: Registered app LogRegExp with ID
> app-20140703160214-0028
> 14/07/03 16:02:14 INFO Master: Launching executor app-20140703160214-0028/1
> on worker worker-20140630124441-slave1-40182
> 14/07/03 16:19:15 INFO Master: Removing executor app-20140703160214-0028/1
> because it is EXITED
> 14/07/03 16:19:15 INFO Master: Launching executor app-20140703160214-0028/2
> on worker worker-20140630124441-slave1-40182
> 14/07/03 16:19:15 INFO Master: Removing executor app-20140703160214-0028/0
> because it is EXITED
> 14/07/03 16:19:15 INFO Master: Launching executor app-20140703160214-0028/3
> on worker worker-20140630102913-slave2-44735
> 14/07/03 16:19:18 INFO Master: Removing executor app-20140703160214-0028/2
> because it is EXITED
> 14/07/03 16:19:18 INFO Master: Launching executor app-20140703160214-0028/4
> on worker worker-20140630124441-slave1-40182
> 14/07/03 16:19:18 INFO Master: Removing executor app-20140703160214-0028/3
> because it is EXITED
> 14/07/03 16:19:18 INFO Master: Launching executor app-20140703160214-0028/5
> on worker worker-20140630102913-slave2-44735
> 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
> disassociated, removing it.
> 14/07/03 16:19:20 INFO Master: Removing app app-20140703160214-0028
> 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
> disassociated, removing it.
>
>
> Throughout the execution, I confirmed in the UI that driver memory used was
> 0.0 B / 6.9 GB and each executor's memory showed 0.0 B / 12.1 GB even when
> aggregate was being executed. On a related note, I noticed in the executors
> tab that just before the entire job terminated, executors on slave1, slave2
> and the driver "disappeared" momentarily from the active executors list. The
> replacement  executors on slave1 and slave2 were re-spawned a couple of
> times and appeared on the executors list again before they too died and the
> job failed.
> So it appears that no matter what the task input-result size, the execution
> fails at the end of the stage corresponding to GradientDescent.aggregate
> (and the preceding count() in GradientDescent goes through fine). Let me
> know if you need any additional information.
>
>
> On Thu, Jul 3, 2014 at 12:27 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> Could you check the driver memory in the executor tab of the Spark UI
>> when the job is running? If it is too small, please set
>> --driver-memory with spark-submit, e.g. 10g. Could you also attach the
>> master log under spark/logs as well? -Xiangrui
>>
>> On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar <re...@gmail.com>
>> wrote:
>> > Hi Xiangrui,
>> >
>> > The issue with aggergating/counting over large feature vectors (as part
>> > of
>> > LogisticRegressionWithSGD) continues to exist, but now in another form:
>> > while the execution doesn't freeze (due to SPARK-1112), it now fails at
>> > the
>> > second or third gradient descent iteration consistently with an error
>> > level
>> > log message, but no stacktrace. I'm running against 1.0.1-rc1, and have
>> > tried setting spark.akka.frameSize as high as 500. When the execution
>> > fails,
>> > each of the two executors log the following message (corresponding to
>> > aggregate at GradientDescent.scala:178) :
>> >
>> > 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> > maxBytesInFlight: 50331648, targetRequestSize: 10066329
>> > 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> > Getting 2 non-empty blocks out of 2 blocks
>> > 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> > Started 1 remote fetches in 0 ms
>> > 14/07/02 14:09:11 INFO Executor: Serialized size of result for 737 is
>> > 5959086
>> > 14/07/02 14:09:11 INFO Executor: Sending result for 737 directly to
>> > driver
>> > 14/07/02 14:09:11 INFO Executor: Finished task ID 737
>> > 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver
>> > Disassociated
>> > [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
>> > [akka.tcp://spark@master:59487] disassociated! Shutting down.
>> >
>> >
>> > There is no separate stacktrace on the driver side.
>> >
>> > Each input record is of the form p1, p2, (p1,p2) where p1, p2 & (p1,p2)
>> > are
>> > categorical features with large cardinality, and X is the double label
>> > with
>> > a continuous value. The categorical variables are converted to binary
>> > variables which results in a feature vector of size 741092 (composed of
>> > all
>> > unique categories across p1, p2 and (p1,p2)). Thus, the labeled point
>> > for
>> > input record is a sparse vector of size 741092 with only 3 variables set
>> > in
>> > the record. The total number of records is 683233 after aggregating the
>> > input data on (p1, p2). When attempting to train on the unaggregated
>> > records
>> > (1337907 in number spread across 455 files), the execution fails at
>> > count,
>> > GradientDescent.scala:161 with the following log
>> >
>> >
>> > (Snipped lines corresponding to other input files)
>> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> > file:~/part-r-00012:2834590+2834590
>> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> > file:~/part-r-00005:0+2845559
>> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> > file:~/part-r-00005:2845559+2845560
>> > 14/07/02 16:02:03 INFO Executor: Serialized size of result for 726 is
>> > 615
>> > 14/07/02 16:02:03 INFO Executor: Sending result for 726 directly to
>> > driver
>> > 14/07/02 16:02:03 INFO Executor: Finished task ID 726
>> > 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver
>> > Disassociated
>> > [akka.tcp://sparkExecutor@slave1:48423] ->
>> > [akka.tcp://spark@master:55792]
>> > disassociated! Shutting down.
>> >
>> > A count() attempted on the input RDD before beginning training has the
>> > following metrics:
>> >
>> >
>> > Metric            Min        25th    Median    75th     Max
>> >
>> > Result
>> > serialization
>> > time            0 ms    0 ms    0 ms    0 ms    0 ms
>> >
>> > Duration        33 s    33 s    35 s    35 s    35 s
>> >
>> > Time spent
>> > fetching task
>> > results            0 ms    0 ms    0 ms    0 ms    0 ms
>> >
>> > Scheduler
>> > delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
>> >
>> > Aggregated Metrics by Executor
>> >
>> > ID     Address Task             Time Total Failed Succeeded Shuffle Read
>> > Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
>> > 0     CANNOT FIND ADDRESS     34 s     1     0         1         0.0 B
>> > 0.0 B             0.0 B                 0.0 B
>> > 1     CANNOT FIND ADDRESS     36 s     1     0         1         0.0 B
>> > 0.0 B             0.0 B                 0.0 B
>> >
>> > Tasks
>> >
>> > Task Index    Task ID    Status    Locality Level    Executor    Launch
>> > Time
>> > Duration    GC Time    Result Ser Time    Errors
>> > 0     726     SUCCESS         PROCESS_LOCAL     slave1
>> > 2014/07/02
>> > 16:01:28 35 s         0.1 s
>> > 1     727     SUCCESS         PROCESS_LOCAL     slave2
>> > 2014/07/02
>> > 16:01:28 33 s         99 ms
>> >
>> > Any pointers / diagnosis please?
>> >
>> >
>> >
>> >
>> > On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar
>> > <re...@gmail.com>
>> > wrote:
>> >>
>> >> Thanks. I'll await the fix to re-run my test.
>> >>
>> >>
>> >> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng <me...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi Bharath,
>> >>>
>> >>> This is related to SPARK-1112, which we already found the root cause.
>> >>> I will let you know when this is fixed.
>> >>>
>> >>> Best,
>> >>> Xiangrui
>> >>>
>> >>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar
>> >>> <re...@gmail.com>
>> >>> wrote:
>> >>> > Couple more points:
>> >>> > 1)The inexplicable stalling of execution with large feature sets
>> >>> > appears
>> >>> > similar to that reported with the news-20 dataset:
>> >>> >
>> >>> >
>> >>> > http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
>> >>> >
>> >>> > 2) The NPE trying to call mapToPair convert an RDD<Long, Long,
>> >>> > Integer,
>> >>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>,
>> >>> > Tuple2<Integer,Integer>>
>> >>> > is
>> >>> > unrelated to mllib.
>> >>> >
>> >>> > Thanks,
>> >>> > Bharath
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
>> >>> > <re...@gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi  Xiangrui ,
>> >>> >>
>> >>> >> I'm using 1.0.0.
>> >>> >>
>> >>> >> Thanks,
>> >>> >> Bharath
>> >>> >>
>> >>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <me...@gmail.com> wrote:
>> >>> >>>
>> >>> >>> Hi Bharath,
>> >>> >>>
>> >>> >>> Thanks for posting the details! Which Spark version are you using?
>> >>> >>>
>> >>> >>> Best,
>> >>> >>> Xiangrui
>> >>> >>>
>> >>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
>> >>> >>> <re...@gmail.com>
>> >>> >>> wrote:
>> >>> >>> > Hi,
>> >>> >>> >
>> >>> >>> > (Apologies for the long mail, but it's necessary to provide
>> >>> >>> > sufficient
>> >>> >>> > details considering the number of issues faced.)
>> >>> >>> >
>> >>> >>> > I'm running into issues testing LogisticRegressionWithSGD a two
>> >>> >>> > node
>> >>> >>> > cluster
>> >>> >>> > (each node with 24 cores and 16G available to slaves out of 24G
>> >>> >>> > on
>> >>> >>> > the
>> >>> >>> > system). Here's a description of the application:
>> >>> >>> >
>> >>> >>> > The model is being trained based on categorical features x, y,
>> >>> >>> > and
>> >>> >>> > (x,y).
>> >>> >>> > The categorical features are mapped to binary features by
>> >>> >>> > converting
>> >>> >>> > each
>> >>> >>> > distinct value in the category enum into a binary feature by
>> >>> >>> > itself
>> >>> >>> > (i.e
>> >>> >>> > presence of that value in a record implies corresponding feature
>> >>> >>> > =
>> >>> >>> > 1,
>> >>> >>> > else
>> >>> >>> > feature = 0. So, there'd be as many distinct features as enum
>> >>> >>> > values) .
>> >>> >>> > The
>> >>> >>> > training vector is laid out as
>> >>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)]. Each record
>> >>> >>> > in
>> >>> >>> > the
>> >>> >>> > training data has only one combination (Xk,Yk) and a label
>> >>> >>> > appearing in
>> >>> >>> > the
>> >>> >>> > record. Thus, the corresponding labeledpoint sparse vector would
>> >>> >>> > only
>> >>> >>> > have 3
>> >>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total length of the
>> >>> >>> > vector
>> >>> >>> > (though parse) would be nearly 614000.  The number of records is
>> >>> >>> > about
>> >>> >>> > 1.33
>> >>> >>> > million. The records have been coalesced into 20 partitions
>> >>> >>> > across
>> >>> >>> > two
>> >>> >>> > nodes. The input data has not been cached.
>> >>> >>> > (NOTE: I do realize the records & features may seem large for a
>> >>> >>> > two
>> >>> >>> > node
>> >>> >>> > setup, but given the memory & cpu, and the fact that I'm willing
>> >>> >>> > to
>> >>> >>> > give up
>> >>> >>> > some turnaround time, I don't see why tasks should inexplicably
>> >>> >>> > fail)
>> >>> >>> >
>> >>> >>> > Additional parameters include:
>> >>> >>> >
>> >>> >>> > spark.executor.memory = 14G
>> >>> >>> > spark.default.parallelism = 1
>> >>> >>> > spark.cores.max=20
>> >>> >>> > spark.storage.memoryFraction=0.8 //No cache space required
>> >>> >>> > (Trying to set spark.akka.frameSize to a larger number, say, 20
>> >>> >>> > didn't
>> >>> >>> > help
>> >>> >>> > either)
>> >>> >>> >
>> >>> >>> > The model training was initialized as : new
>> >>> >>> > LogisticRegressionWithSGD(1,
>> >>> >>> > maxIterations, 0.0, 0.05)
>> >>> >>> >
>> >>> >>> > However, after 4 iterations of gradient descent, the entire
>> >>> >>> > execution
>> >>> >>> > appeared to stall inexplicably. The corresponding executor
>> >>> >>> > details
>> >>> >>> > and
>> >>> >>> > details of the stalled stage (number 14) are as follows:
>> >>> >>> >
>> >>> >>> > Metric                        Min        25th     Median    75th
>> >>> >>> > Max
>> >>> >>> > Result serialization time    12 ms    13 ms    14 ms    16 ms
>> >>> >>> > 18
>> >>> >>> > ms
>> >>> >>> > Duration                    4 s        4 s        5 s        5 s
>> >>> >>> > 5 s
>> >>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0 ms    0
>> >>> >>> > ms
>> >>> >>> > results
>> >>> >>> > Scheduler delay                6 s        6 s        6 s
>> >>> >>> > 6 s
>> >>> >>> > 12 s
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > Stage Id
>> >>> >>> > 14 aggregate at GradientDescent.scala:178
>> >>> >>> >
>> >>> >>> > Task Index    Task ID    Status    Locality Level     Executor
>> >>> >>> > Launch Time                Duration    GC     Result Ser Time
>> >>> >>> > Errors
>> >>> >>> >
>> >>> >>> > Time
>> >>> >>> >
>> >>> >>> > 0     600     RUNNING     PROCESS_LOCAL
>> >>> >>> > serious.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >>> >>> > 1     601     RUNNING     PROCESS_LOCAL
>> >>> >>> > casual.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27         1.1 h
>> >>> >>> > 2     602     RUNNING     PROCESS_LOCAL
>> >>> >>> > serious.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >>> >>> > 3     603     RUNNING     PROCESS_LOCAL
>> >>> >>> > casual.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27         1.1 h
>> >>> >>> > 4     604     RUNNING     PROCESS_LOCAL
>> >>> >>> > serious.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
>> >>> >>> > casual.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
>> >>> >>> > serious.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
>> >>> >>> > casual.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
>> >>> >>> > serious.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
>> >>> >>> > casual.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
>> >>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
>> >>> >>> > serious.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
>> >>> >>> > casual.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
>> >>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
>> >>> >>> > serious.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
>> >>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
>> >>> >>> > casual.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
>> >>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
>> >>> >>> > serious.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
>> >>> >>> > casual.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
>> >>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
>> >>> >>> > serious.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
>> >>> >>> > casual.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
>> >>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
>> >>> >>> > serious.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
>> >>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
>> >>> >>> > casual.dataone.foo.bar.com
>> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
>> >>> >>> >
>> >>> >>> > Executor stats:
>> >>> >>> >
>> >>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks    Failed
>> >>> >>> > Tasks
>> >>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle Read
>> >>> >>> > Shuffle
>> >>> >>> > Write
>> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2                 0
>> >>> >>> > 307         309         23.2 m         0.0 B             0.0 B
>> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3                 0
>> >>> >>> > 308         311         22.4 m         0.0 B             0.0 B
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > Executor jmap output:
>> >>> >>> >
>> >>> >>> > Server compiler detected.
>> >>> >>> > JVM version is 24.55-b03
>> >>> >>> >
>> >>> >>> > using thread-local object allocation.
>> >>> >>> > Parallel GC with 18 thread(s)
>> >>> >>> >
>> >>> >>> > Heap Configuration:
>> >>> >>> >    MinHeapFreeRatio = 40
>> >>> >>> >    MaxHeapFreeRatio = 70
>> >>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
>> >>> >>> >    NewSize          = 1310720 (1.25MB)
>> >>> >>> >    MaxNewSize       = 17592186044415 MB
>> >>> >>> >    OldSize          = 5439488 (5.1875MB)
>> >>> >>> >    NewRatio         = 2
>> >>> >>> >    SurvivorRatio    = 8
>> >>> >>> >    PermSize         = 21757952 (20.75MB)
>> >>> >>> >    MaxPermSize      = 134217728 (128.0MB)
>> >>> >>> >    G1HeapRegionSize = 0 (0.0MB)
>> >>> >>> >
>> >>> >>> > Heap Usage:
>> >>> >>> > PS Young Generation
>> >>> >>> > Eden Space:
>> >>> >>> >    capacity = 2783969280 (2655.0MB)
>> >>> >>> >    used     = 192583816 (183.66223907470703MB)
>> >>> >>> >    free     = 2591385464 (2471.337760925293MB)
>> >>> >>> >    6.917598458557704% used
>> >>> >>> > From Space:
>> >>> >>> >    capacity = 409993216 (391.0MB)
>> >>> >>> >    used     = 1179808 (1.125152587890625MB)
>> >>> >>> >    free     = 408813408 (389.8748474121094MB)
>> >>> >>> >    0.2877628102022059% used
>> >>> >>> > To Space:
>> >>> >>> >    capacity = 385351680 (367.5MB)
>> >>> >>> >    used     = 0 (0.0MB)
>> >>> >>> >    free     = 385351680 (367.5MB)
>> >>> >>> >    0.0% used
>> >>> >>> > PS Old Generation
>> >>> >>> >    capacity = 7158628352 (6827.0MB)
>> >>> >>> >    used     = 4455093024 (4248.707794189453MB)
>> >>> >>> >    free     = 2703535328 (2578.292205810547MB)
>> >>> >>> >    62.2338918146983% used
>> >>> >>> > PS Perm Generation
>> >>> >>> >    capacity = 90701824 (86.5MB)
>> >>> >>> >    used     = 45348832 (43.248016357421875MB)
>> >>> >>> >    free     = 45352992 (43.251983642578125MB)
>> >>> >>> >    49.99770677158598% used
>> >>> >>> >
>> >>> >>> > 8432 interned Strings occupying 714672 bytes.
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > Executor GC log snippet:
>> >>> >>> >
>> >>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
>> >>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times: user=5.13
>> >>> >>> > sys=0.39,
>> >>> >>> > real=0.32 secs]
>> >>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)]
>> >>> >>> > [ParOldGen:
>> >>> >>> > 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K)
>> >>> >>> > [PSPermGen:
>> >>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times: user=24.22
>> >>> >>> > sys=0.18,
>> >>> >>> > real=4.55 secs]
>> >>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
>> >>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times: user=1.66
>> >>> >>> > sys=0.00,
>> >>> >>> > real=0.09 secs]
>> >>> >>> >
>> >>> >>> > I tried to map partitions to cores on the nodes. Increasing the
>> >>> >>> > number
>> >>> >>> > of
>> >>> >>> > partitions (say to 80 or 100) would result in progress till the
>> >>> >>> > 6th
>> >>> >>> > iteration or so, but the next stage would stall as before with
>> >>> >>> > apparent
>> >>> >>> > root
>> >>> >>> > cause / logs. With increased partitions, the last stage that
>> >>> >>> > completed
>> >>> >>> > had
>> >>> >>> > the following task times:
>> >>> >>> >
>> >>> >>> > Metric                        Min        25th     Median    75th
>> >>> >>> > Max
>> >>> >>> > Result serialization time    11 ms    12 ms    13 ms    15 ms
>> >>> >>> > 0.4 s
>> >>> >>> > Duration                    0.5 s    0.9 s    1 s        3 s
>> >>> >>> > 7 s
>> >>> >>> > Time spent fetching            0 ms    0 ms    0 ms    0 ms    0
>> >>> >>> > ms
>> >>> >>> > task results
>> >>> >>> > Scheduler delay                5 s        6 s        6 s
>> >>> >>> > 7 s
>> >>> >>> > 12 s
>> >>> >>> >
>> >>> >>> > My hypothesis is that as the coefficient array becomes less
>> >>> >>> > sparse
>> >>> >>> > (with
>> >>> >>> > successive iterations), the cost of the aggregate goes up to the
>> >>> >>> > point
>> >>> >>> > that
>> >>> >>> > it stalls (which I failed to explain). Reducing the batch
>> >>> >>> > fraction
>> >>> >>> > to a
>> >>> >>> > very
>> >>> >>> > low number like 0.01 saw the iterations progress further, but
>> >>> >>> > the
>> >>> >>> > model
>> >>> >>> > failed to converge in that case after a small number of
>> >>> >>> > iterations.
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > I also tried reducing the number of records by aggregating on
>> >>> >>> > (x,y)
>> >>> >>> > as
>> >>> >>> > the
>> >>> >>> > key (i.e. using aggregations instead of training on every raw
>> >>> >>> > record),
>> >>> >>> > but
>> >>> >>> > encountered by the following exception:
>> >>> >>> >
>> >>> >>> > Loss was due to java.lang.NullPointerException
>> >>> >>> > java.lang.NullPointerException
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >>> >>> >         at
>> >>> >>> > scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>> >>> >>> >         at
>> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >>> >>> >         at
>> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> >>> >>> >         at
>> >>> >>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> >>> >>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>> >>> >>> >         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>> >>> >         at
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>> >>> >         at java.lang.Thread.run(Thread.java:745)
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > I'd appreciate any insights/comments about what may be causing
>> >>> >>> > the
>> >>> >>> > execution
>> >>> >>> > to stall.
>> >>> >>> >
>> >>> >>> > If logs/tables appear poorly indented in the email, here's a
>> >>> >>> > gist
>> >>> >>> > with
>> >>> >>> > relevant details:
>> >>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
>> >>> >>> >
>> >>> >>> > Thanks,
>> >>> >>> > Bharath
>> >>> >
>> >>> >
>> >>
>> >>
>> >
>
>

Re: Execution stalls in LogisticRegressionWithSGD

Posted by Bharath Ravi Kumar <re...@gmail.com>.

I tried another run after setting the driver memory to 8G (and
spark.akka.frameSize = 500 on the executors and the driver). In addition, I
also tried to reduce the amount of data that a single task processes, by
increasing the number of partitions (of the labeled points) to 120 (instead
of 2 used earlier), and then setting max cores to 2. That made no
difference since, at the end of 120 tasks, the familiar error message
appeared on a slave:

<snipped earlier logs>
14/07/03 16:18:48 INFO CoarseGrainedExecutorBackend: Got assigned task 1436
14/07/03 16:18:48 INFO Executor: Running task ID 1436
14/07/03 16:18:53 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00014:0+2215337
14/07/03 16:18:54 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00014:2215337+2215338
14/07/03 16:18:54 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00003:0+2196429
14/07/03 16:18:54 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00003:2196429+2196430
14/07/03 16:18:54 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00010:0+2186751
14/07/03 16:18:54 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00010:2186751+2186751
14/07/03 16:18:54 INFO Executor: Serialized size of result for 1436 is
5958822
14/07/03 16:18:54 INFO Executor: Sending result for 1436 directly to driver
14/07/03 16:18:54 INFO Executor: Finished task ID 1436
14/07/03 16:18:54 INFO CoarseGrainedExecutorBackend: Got assigned task 1438
14/07/03 16:18:54 INFO Executor: Running task ID 1438
14/07/03 16:19:00 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00004:0+2209615
14/07/03 16:19:00 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00004:2209615+2209616
14/07/03 16:19:00 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00011:0+2202240
14/07/03 16:19:00 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00011:2202240+2202240
14/07/03 16:19:00 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00009:0+2194423
14/07/03 16:19:00 INFO HadoopRDD: Input split:
file:~//2014-05-24-02/part-r-00009:2194423+2194424
14/07/03 16:19:00 INFO Executor: Serialized size of result for 1438 is
5958822
14/07/03 16:19:00 INFO Executor: Sending result for 1438 directly to driver
14/07/03 16:19:00 INFO Executor: Finished task ID 1438
14/07/03 16:19:14 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@slave1:51099] -> [akka.tcp://spark@master:58272]
disassociated! Shutting down.


The corresponding master logs were:

4/07/03 16:02:14 INFO Master: Registering app LogRegExp
14/07/03 16:02:14 INFO Master: Registered app LogRegExp with ID
app-20140703160214-0028
14/07/03 16:02:14 INFO Master: Launching executor app-20140703160214-0028/1
on worker worker-20140630124441-slave1-40182
14/07/03 16:19:15 INFO Master: Removing executor app-20140703160214-0028/1
because it is EXITED
14/07/03 16:19:15 INFO Master: Launching executor app-20140703160214-0028/2
on worker worker-20140630124441-slave1-40182
14/07/03 16:19:15 INFO Master: Removing executor app-20140703160214-0028/0
because it is EXITED
14/07/03 16:19:15 INFO Master: Launching executor app-20140703160214-0028/3
on worker worker-20140630102913-slave2-44735
14/07/03 16:19:18 INFO Master: Removing executor app-20140703160214-0028/2
because it is EXITED
14/07/03 16:19:18 INFO Master: Launching executor app-20140703160214-0028/4
on worker worker-20140630124441-slave1-40182
14/07/03 16:19:18 INFO Master: Removing executor app-20140703160214-0028/3
because it is EXITED
14/07/03 16:19:18 INFO Master: Launching executor app-20140703160214-0028/5
on worker worker-20140630102913-slave2-44735
14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
disassociated, removing it.
14/07/03 16:19:20 INFO Master: Removing app app-20140703160214-0028
14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
disassociated, removing it.


Throughout the execution, I confirmed in the UI that driver memory used was
0.0 B / 6.9 GB and each executor's memory showed 0.0 B / 12.1 GB even when
aggregate was being executed. On a related note, I noticed in the executors
tab that just before the entire job terminated, executors on slave1, slave2
and the driver "disappeared" momentarily from the active executors list.
The replacement  executors on slave1 and slave2 were re-spawned a couple of
times and appeared on the executors list again before they too died and the
job failed.
So it appears that no matter what the task input-result size, the execution
fails at the end of the stage corresponding to GradientDescent.aggregate
(and the preceding count() in GradientDescent goes through fine). Let me
know if you need any additional information.


On Thu, Jul 3, 2014 at 12:27 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Could you check the driver memory in the executor tab of the Spark UI
> when the job is running? If it is too small, please set
> --driver-memory with spark-submit, e.g. 10g. Could you also attach the
> master log under spark/logs as well? -Xiangrui
>
> On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar <re...@gmail.com>
> wrote:
> > Hi Xiangrui,
> >
> > The issue with aggergating/counting over large feature vectors (as part
> of
> > LogisticRegressionWithSGD) continues to exist, but now in another form:
> > while the execution doesn't freeze (due to SPARK-1112), it now fails at
> the
> > second or third gradient descent iteration consistently with an error
> level
> > log message, but no stacktrace. I'm running against 1.0.1-rc1, and have
> > tried setting spark.akka.frameSize as high as 500. When the execution
> fails,
> > each of the two executors log the following message (corresponding to
> > aggregate at GradientDescent.scala:178) :
> >
> > 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> > maxBytesInFlight: 50331648, targetRequestSize: 10066329
> > 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> > Getting 2 non-empty blocks out of 2 blocks
> > 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> > Started 1 remote fetches in 0 ms
> > 14/07/02 14:09:11 INFO Executor: Serialized size of result for 737 is
> > 5959086
> > 14/07/02 14:09:11 INFO Executor: Sending result for 737 directly to
> driver
> > 14/07/02 14:09:11 INFO Executor: Finished task ID 737
> > 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver
> Disassociated
> > [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
> > [akka.tcp://spark@master:59487] disassociated! Shutting down.
> >
> >
> > There is no separate stacktrace on the driver side.
> >
> > Each input record is of the form p1, p2, (p1,p2) where p1, p2 & (p1,p2)
> are
> > categorical features with large cardinality, and X is the double label
> with
> > a continuous value. The categorical variables are converted to binary
> > variables which results in a feature vector of size 741092 (composed of
> all
> > unique categories across p1, p2 and (p1,p2)). Thus, the labeled point for
> > input record is a sparse vector of size 741092 with only 3 variables set
> in
> > the record. The total number of records is 683233 after aggregating the
> > input data on (p1, p2). When attempting to train on the unaggregated
> records
> > (1337907 in number spread across 455 files), the execution fails at
> count,
> > GradientDescent.scala:161 with the following log
> >
> >
> > (Snipped lines corresponding to other input files)
> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> > file:~/part-r-00012:2834590+2834590
> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> file:~/part-r-00005:0+2845559
> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> > file:~/part-r-00005:2845559+2845560
> > 14/07/02 16:02:03 INFO Executor: Serialized size of result for 726 is 615
> > 14/07/02 16:02:03 INFO Executor: Sending result for 726 directly to
> driver
> > 14/07/02 16:02:03 INFO Executor: Finished task ID 726
> > 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver
> Disassociated
> > [akka.tcp://sparkExecutor@slave1:48423] -> [akka.tcp://spark@master
> :55792]
> > disassociated! Shutting down.
> >
> > A count() attempted on the input RDD before beginning training has the
> > following metrics:
> >
> >
> > Metric            Min        25th    Median    75th     Max
> >
> > Result
> > serialization
> > time            0 ms    0 ms    0 ms    0 ms    0 ms
> >
> > Duration        33 s    33 s    35 s    35 s    35 s
> >
> > Time spent
> > fetching task
> > results            0 ms    0 ms    0 ms    0 ms    0 ms
> >
> > Scheduler
> > delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
> >
> > Aggregated Metrics by Executor
> >
> > ID     Address Task             Time Total Failed Succeeded Shuffle Read
> > Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
> > 0     CANNOT FIND ADDRESS     34 s     1     0         1         0.0 B
> > 0.0 B             0.0 B                 0.0 B
> > 1     CANNOT FIND ADDRESS     36 s     1     0         1         0.0 B
> > 0.0 B             0.0 B                 0.0 B
> >
> > Tasks
> >
> > Task Index    Task ID    Status    Locality Level    Executor    Launch
> Time
> > Duration    GC Time    Result Ser Time    Errors
> > 0     726     SUCCESS         PROCESS_LOCAL     slave1         2014/07/02
> > 16:01:28 35 s         0.1 s
> > 1     727     SUCCESS         PROCESS_LOCAL     slave2         2014/07/02
> > 16:01:28 33 s         99 ms
> >
> > Any pointers / diagnosis please?
> >
> >
> >
> >
> > On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar <
> reachbach@gmail.com>
> > wrote:
> >>
> >> Thanks. I'll await the fix to re-run my test.
> >>
> >>
> >> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>>
> >>> Hi Bharath,
> >>>
> >>> This is related to SPARK-1112, which we already found the root cause.
> >>> I will let you know when this is fixed.
> >>>
> >>> Best,
> >>> Xiangrui
> >>>
> >>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar <
> reachbach@gmail.com>
> >>> wrote:
> >>> > Couple more points:
> >>> > 1)The inexplicable stalling of execution with large feature sets
> >>> > appears
> >>> > similar to that reported with the news-20 dataset:
> >>> >
> >>> >
> http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
> >>> >
> >>> > 2) The NPE trying to call mapToPair convert an RDD<Long, Long,
> Integer,
> >>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>,
> Tuple2<Integer,Integer>>
> >>> > is
> >>> > unrelated to mllib.
> >>> >
> >>> > Thanks,
> >>> > Bharath
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
> >>> > <re...@gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> Hi  Xiangrui ,
> >>> >>
> >>> >> I'm using 1.0.0.
> >>> >>
> >>> >> Thanks,
> >>> >> Bharath
> >>> >>
> >>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <me...@gmail.com> wrote:
> >>> >>>
> >>> >>> Hi Bharath,
> >>> >>>
> >>> >>> Thanks for posting the details! Which Spark version are you using?
> >>> >>>
> >>> >>> Best,
> >>> >>> Xiangrui
> >>> >>>
> >>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
> >>> >>> <re...@gmail.com>
> >>> >>> wrote:
> >>> >>> > Hi,
> >>> >>> >
> >>> >>> > (Apologies for the long mail, but it's necessary to provide
> >>> >>> > sufficient
> >>> >>> > details considering the number of issues faced.)
> >>> >>> >
> >>> >>> > I'm running into issues testing LogisticRegressionWithSGD a two
> >>> >>> > node
> >>> >>> > cluster
> >>> >>> > (each node with 24 cores and 16G available to slaves out of 24G
> on
> >>> >>> > the
> >>> >>> > system). Here's a description of the application:
> >>> >>> >
> >>> >>> > The model is being trained based on categorical features x, y,
> and
> >>> >>> > (x,y).
> >>> >>> > The categorical features are mapped to binary features by
> >>> >>> > converting
> >>> >>> > each
> >>> >>> > distinct value in the category enum into a binary feature by
> itself
> >>> >>> > (i.e
> >>> >>> > presence of that value in a record implies corresponding feature
> =
> >>> >>> > 1,
> >>> >>> > else
> >>> >>> > feature = 0. So, there'd be as many distinct features as enum
> >>> >>> > values) .
> >>> >>> > The
> >>> >>> > training vector is laid out as
> >>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)]. Each record
> in
> >>> >>> > the
> >>> >>> > training data has only one combination (Xk,Yk) and a label
> >>> >>> > appearing in
> >>> >>> > the
> >>> >>> > record. Thus, the corresponding labeledpoint sparse vector would
> >>> >>> > only
> >>> >>> > have 3
> >>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total length of the
> >>> >>> > vector
> >>> >>> > (though parse) would be nearly 614000.  The number of records is
> >>> >>> > about
> >>> >>> > 1.33
> >>> >>> > million. The records have been coalesced into 20 partitions
> across
> >>> >>> > two
> >>> >>> > nodes. The input data has not been cached.
> >>> >>> > (NOTE: I do realize the records & features may seem large for a
> two
> >>> >>> > node
> >>> >>> > setup, but given the memory & cpu, and the fact that I'm willing
> to
> >>> >>> > give up
> >>> >>> > some turnaround time, I don't see why tasks should inexplicably
> >>> >>> > fail)
> >>> >>> >
> >>> >>> > Additional parameters include:
> >>> >>> >
> >>> >>> > spark.executor.memory = 14G
> >>> >>> > spark.default.parallelism = 1
> >>> >>> > spark.cores.max=20
> >>> >>> > spark.storage.memoryFraction=0.8 //No cache space required
> >>> >>> > (Trying to set spark.akka.frameSize to a larger number, say, 20
> >>> >>> > didn't
> >>> >>> > help
> >>> >>> > either)
> >>> >>> >
> >>> >>> > The model training was initialized as : new
> >>> >>> > LogisticRegressionWithSGD(1,
> >>> >>> > maxIterations, 0.0, 0.05)
> >>> >>> >
> >>> >>> > However, after 4 iterations of gradient descent, the entire
> >>> >>> > execution
> >>> >>> > appeared to stall inexplicably. The corresponding executor
> details
> >>> >>> > and
> >>> >>> > details of the stalled stage (number 14) are as follows:
> >>> >>> >
> >>> >>> > Metric                        Min        25th     Median    75th
> >>> >>> > Max
> >>> >>> > Result serialization time    12 ms    13 ms    14 ms    16 ms
>  18
> >>> >>> > ms
> >>> >>> > Duration                    4 s        4 s        5 s        5 s
> >>> >>> > 5 s
> >>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0 ms    0 ms
> >>> >>> > results
> >>> >>> > Scheduler delay                6 s        6 s        6 s
>  6 s
> >>> >>> > 12 s
> >>> >>> >
> >>> >>> >
> >>> >>> > Stage Id
> >>> >>> > 14 aggregate at GradientDescent.scala:178
> >>> >>> >
> >>> >>> > Task Index    Task ID    Status    Locality Level     Executor
> >>> >>> > Launch Time                Duration    GC     Result Ser Time
> >>> >>> > Errors
> >>> >>> >
> >>> >>> > Time
> >>> >>> >
> >>> >>> > 0     600     RUNNING     PROCESS_LOCAL
> >>> >>> > serious.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >>> >>> > 1     601     RUNNING     PROCESS_LOCAL
> >>> >>> > casual.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27         1.1 h
> >>> >>> > 2     602     RUNNING     PROCESS_LOCAL
> >>> >>> > serious.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >>> >>> > 3     603     RUNNING     PROCESS_LOCAL
> >>> >>> > casual.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27         1.1 h
> >>> >>> > 4     604     RUNNING     PROCESS_LOCAL
> >>> >>> > serious.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27     1.1 h
> >>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
> >>> >>> > casual.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
> >>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
> >>> >>> > serious.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
> >>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
> >>> >>> > casual.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
> >>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
> >>> >>> > serious.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
> >>> >>> > casual.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
> >>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
> >>> >>> > serious.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
> >>> >>> > casual.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
> >>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
> >>> >>> > serious.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
> >>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
> >>> >>> > casual.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
> >>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
> >>> >>> > serious.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
> >>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
> >>> >>> > casual.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
> >>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
> >>> >>> > serious.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
> >>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
> >>> >>> > casual.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
> >>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
> >>> >>> > serious.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
> >>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
> >>> >>> > casual.dataone.foo.bar.com
> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
> >>> >>> >
> >>> >>> > Executor stats:
> >>> >>> >
> >>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks    Failed
> >>> >>> > Tasks
> >>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle Read
> >>> >>> > Shuffle
> >>> >>> > Write
> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2                 0
> >>> >>> > 307         309         23.2 m         0.0 B             0.0 B
> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3                 0
> >>> >>> > 308         311         22.4 m         0.0 B             0.0 B
> >>> >>> >
> >>> >>> >
> >>> >>> > Executor jmap output:
> >>> >>> >
> >>> >>> > Server compiler detected.
> >>> >>> > JVM version is 24.55-b03
> >>> >>> >
> >>> >>> > using thread-local object allocation.
> >>> >>> > Parallel GC with 18 thread(s)
> >>> >>> >
> >>> >>> > Heap Configuration:
> >>> >>> >    MinHeapFreeRatio = 40
> >>> >>> >    MaxHeapFreeRatio = 70
> >>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
> >>> >>> >    NewSize          = 1310720 (1.25MB)
> >>> >>> >    MaxNewSize       = 17592186044415 MB
> >>> >>> >    OldSize          = 5439488 (5.1875MB)
> >>> >>> >    NewRatio         = 2
> >>> >>> >    SurvivorRatio    = 8
> >>> >>> >    PermSize         = 21757952 (20.75MB)
> >>> >>> >    MaxPermSize      = 134217728 (128.0MB)
> >>> >>> >    G1HeapRegionSize = 0 (0.0MB)
> >>> >>> >
> >>> >>> > Heap Usage:
> >>> >>> > PS Young Generation
> >>> >>> > Eden Space:
> >>> >>> >    capacity = 2783969280 (2655.0MB)
> >>> >>> >    used     = 192583816 (183.66223907470703MB)
> >>> >>> >    free     = 2591385464 (2471.337760925293MB)
> >>> >>> >    6.917598458557704% used
> >>> >>> > From Space:
> >>> >>> >    capacity = 409993216 (391.0MB)
> >>> >>> >    used     = 1179808 (1.125152587890625MB)
> >>> >>> >    free     = 408813408 (389.8748474121094MB)
> >>> >>> >    0.2877628102022059% used
> >>> >>> > To Space:
> >>> >>> >    capacity = 385351680 (367.5MB)
> >>> >>> >    used     = 0 (0.0MB)
> >>> >>> >    free     = 385351680 (367.5MB)
> >>> >>> >    0.0% used
> >>> >>> > PS Old Generation
> >>> >>> >    capacity = 7158628352 (6827.0MB)
> >>> >>> >    used     = 4455093024 (4248.707794189453MB)
> >>> >>> >    free     = 2703535328 (2578.292205810547MB)
> >>> >>> >    62.2338918146983% used
> >>> >>> > PS Perm Generation
> >>> >>> >    capacity = 90701824 (86.5MB)
> >>> >>> >    used     = 45348832 (43.248016357421875MB)
> >>> >>> >    free     = 45352992 (43.251983642578125MB)
> >>> >>> >    49.99770677158598% used
> >>> >>> >
> >>> >>> > 8432 interned Strings occupying 714672 bytes.
> >>> >>> >
> >>> >>> >
> >>> >>> > Executor GC log snippet:
> >>> >>> >
> >>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
> >>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times: user=5.13
> >>> >>> > sys=0.39,
> >>> >>> > real=0.32 secs]
> >>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)] [ParOldGen:
> >>> >>> > 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K)
> >>> >>> > [PSPermGen:
> >>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times: user=24.22
> >>> >>> > sys=0.18,
> >>> >>> > real=4.55 secs]
> >>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
> >>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times: user=1.66
> >>> >>> > sys=0.00,
> >>> >>> > real=0.09 secs]
> >>> >>> >
> >>> >>> > I tried to map partitions to cores on the nodes. Increasing the
> >>> >>> > number
> >>> >>> > of
> >>> >>> > partitions (say to 80 or 100) would result in progress till the
> 6th
> >>> >>> > iteration or so, but the next stage would stall as before with
> >>> >>> > apparent
> >>> >>> > root
> >>> >>> > cause / logs. With increased partitions, the last stage that
> >>> >>> > completed
> >>> >>> > had
> >>> >>> > the following task times:
> >>> >>> >
> >>> >>> > Metric                        Min        25th     Median    75th
> >>> >>> > Max
> >>> >>> > Result serialization time    11 ms    12 ms    13 ms    15 ms
> >>> >>> > 0.4 s
> >>> >>> > Duration                    0.5 s    0.9 s    1 s        3 s
> >>> >>> > 7 s
> >>> >>> > Time spent fetching            0 ms    0 ms    0 ms    0 ms    0
> ms
> >>> >>> > task results
> >>> >>> > Scheduler delay                5 s        6 s        6 s
>  7 s
> >>> >>> > 12 s
> >>> >>> >
> >>> >>> > My hypothesis is that as the coefficient array becomes less
> sparse
> >>> >>> > (with
> >>> >>> > successive iterations), the cost of the aggregate goes up to the
> >>> >>> > point
> >>> >>> > that
> >>> >>> > it stalls (which I failed to explain). Reducing the batch
> fraction
> >>> >>> > to a
> >>> >>> > very
> >>> >>> > low number like 0.01 saw the iterations progress further, but the
> >>> >>> > model
> >>> >>> > failed to converge in that case after a small number of
> iterations.
> >>> >>> >
> >>> >>> >
> >>> >>> > I also tried reducing the number of records by aggregating on
> (x,y)
> >>> >>> > as
> >>> >>> > the
> >>> >>> > key (i.e. using aggregations instead of training on every raw
> >>> >>> > record),
> >>> >>> > but
> >>> >>> > encountered by the following exception:
> >>> >>> >
> >>> >>> > Loss was due to java.lang.NullPointerException
> >>> >>> > java.lang.NullPointerException
> >>> >>> >         at
> >>> >>> >
> >>> >>> >
> >>> >>> >
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> >>> >>> >         at
> >>> >>> >
> >>> >>> >
> >>> >>> >
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> >>> >>> >         at
> >>> >>> > scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >>> >>> >         at
> >>> >>> >
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
> >>> >>> >         at
> >>> >>> >
> >>> >>> >
> >>> >>> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
> >>> >>> >         at
> >>> >>> >
> >>> >>> >
> >>> >>> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> >>> >>> >         at
> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> >>> >>> >         at
> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> >>> >>> >         at
> >>> >>> >
> >>> >>> >
> >>> >>> >
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> >>> >>> >         at
> >>> >>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >>> >>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >>> >>> >         at
> >>> >>> >
> >>> >>> >
> >>> >>> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> >>> >>> >         at
> >>> >>> >
> >>> >>> >
> >>> >>> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> >>> >>> >         at org.apache.spark.scheduler.Task.run(Task.scala:51)
> >>> >>> >         at
> >>> >>> >
> >>> >>> >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> >>> >>> >         at
> >>> >>> >
> >>> >>> >
> >>> >>> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>> >>> >         at
> >>> >>> >
> >>> >>> >
> >>> >>> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>> >>> >         at java.lang.Thread.run(Thread.java:745)
> >>> >>> >
> >>> >>> >
> >>> >>> > I'd appreciate any insights/comments about what may be causing
> the
> >>> >>> > execution
> >>> >>> > to stall.
> >>> >>> >
> >>> >>> > If logs/tables appear poorly indented in the email, here's a gist
> >>> >>> > with
> >>> >>> > relevant details:
> >>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
> >>> >>> >
> >>> >>> > Thanks,
> >>> >>> > Bharath
> >>> >
> >>> >
> >>
> >>
> >
>

Re: Execution stalls in LogisticRegressionWithSGD

Posted by Xiangrui Meng <me...@gmail.com>.

Could you check the driver memory in the executor tab of the Spark UI
when the job is running? If it is too small, please set
--driver-memory with spark-submit, e.g. 10g. Could you also attach the
master log under spark/logs as well? -Xiangrui

On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar <re...@gmail.com> wrote:
> Hi Xiangrui,
>
> The issue with aggergating/counting over large feature vectors (as part of
> LogisticRegressionWithSGD) continues to exist, but now in another form:
> while the execution doesn't freeze (due to SPARK-1112), it now fails at the
> second or third gradient descent iteration consistently with an error level
> log message, but no stacktrace. I'm running against 1.0.1-rc1, and have
> tried setting spark.akka.frameSize as high as 500. When the execution fails,
> each of the two executors log the following message (corresponding to
> aggregate at GradientDescent.scala:178) :
>
> 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> maxBytesInFlight: 50331648, targetRequestSize: 10066329
> 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> Getting 2 non-empty blocks out of 2 blocks
> 14/07/02 14:09:09 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
> Started 1 remote fetches in 0 ms
> 14/07/02 14:09:11 INFO Executor: Serialized size of result for 737 is
> 5959086
> 14/07/02 14:09:11 INFO Executor: Sending result for 737 directly to driver
> 14/07/02 14:09:11 INFO Executor: Finished task ID 737
> 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
> [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
> [akka.tcp://spark@master:59487] disassociated! Shutting down.
>
>
> There is no separate stacktrace on the driver side.
>
> Each input record is of the form p1, p2, (p1,p2) where p1, p2 & (p1,p2) are
> categorical features with large cardinality, and X is the double label with
> a continuous value. The categorical variables are converted to binary
> variables which results in a feature vector of size 741092 (composed of all
> unique categories across p1, p2 and (p1,p2)). Thus, the labeled point for
> input record is a sparse vector of size 741092 with only 3 variables set in
> the record. The total number of records is 683233 after aggregating the
> input data on (p1, p2). When attempting to train on the unaggregated records
> (1337907 in number spread across 455 files), the execution fails at count,
> GradientDescent.scala:161 with the following log
>
>
> (Snipped lines corresponding to other input files)
> 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> file:~/part-r-00012:2834590+2834590
> 14/07/02 16:02:03 INFO HadoopRDD: Input split: file:~/part-r-00005:0+2845559
> 14/07/02 16:02:03 INFO HadoopRDD: Input split:
> file:~/part-r-00005:2845559+2845560
> 14/07/02 16:02:03 INFO Executor: Serialized size of result for 726 is 615
> 14/07/02 16:02:03 INFO Executor: Sending result for 726 directly to driver
> 14/07/02 16:02:03 INFO Executor: Finished task ID 726
> 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
> [akka.tcp://sparkExecutor@slave1:48423] -> [akka.tcp://spark@master:55792]
> disassociated! Shutting down.
>
> A count() attempted on the input RDD before beginning training has the
> following metrics:
>
>
> Metric            Min        25th    Median    75th     Max
>
> Result
> serialization
> time            0 ms    0 ms    0 ms    0 ms    0 ms
>
> Duration        33 s    33 s    35 s    35 s    35 s
>
> Time spent
> fetching task
> results            0 ms    0 ms    0 ms    0 ms    0 ms
>
> Scheduler
> delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
>
> Aggregated Metrics by Executor
>
> ID     Address Task             Time Total Failed Succeeded Shuffle Read
> Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
> 0     CANNOT FIND ADDRESS     34 s     1     0         1         0.0 B
> 0.0 B             0.0 B                 0.0 B
> 1     CANNOT FIND ADDRESS     36 s     1     0         1         0.0 B
> 0.0 B             0.0 B                 0.0 B
>
> Tasks
>
> Task Index    Task ID    Status    Locality Level    Executor    Launch Time
> Duration    GC Time    Result Ser Time    Errors
> 0     726     SUCCESS         PROCESS_LOCAL     slave1         2014/07/02
> 16:01:28 35 s         0.1 s
> 1     727     SUCCESS         PROCESS_LOCAL     slave2         2014/07/02
> 16:01:28 33 s         99 ms
>
> Any pointers / diagnosis please?
>
>
>
>
> On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar <re...@gmail.com>
> wrote:
>>
>> Thanks. I'll await the fix to re-run my test.
>>
>>
>> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng <me...@gmail.com> wrote:
>>>
>>> Hi Bharath,
>>>
>>> This is related to SPARK-1112, which we already found the root cause.
>>> I will let you know when this is fixed.
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar <re...@gmail.com>
>>> wrote:
>>> > Couple more points:
>>> > 1)The inexplicable stalling of execution with large feature sets
>>> > appears
>>> > similar to that reported with the news-20 dataset:
>>> >
>>> > http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3C53A03542.1010605@gmail.com%3E
>>> >
>>> > 2) The NPE trying to call mapToPair convert an RDD<Long, Long, Integer,
>>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>, Tuple2<Integer,Integer>>
>>> > is
>>> > unrelated to mllib.
>>> >
>>> > Thanks,
>>> > Bharath
>>> >
>>> >
>>> >
>>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
>>> > <re...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi  Xiangrui ,
>>> >>
>>> >> I'm using 1.0.0.
>>> >>
>>> >> Thanks,
>>> >> Bharath
>>> >>
>>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <me...@gmail.com> wrote:
>>> >>>
>>> >>> Hi Bharath,
>>> >>>
>>> >>> Thanks for posting the details! Which Spark version are you using?
>>> >>>
>>> >>> Best,
>>> >>> Xiangrui
>>> >>>
>>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
>>> >>> <re...@gmail.com>
>>> >>> wrote:
>>> >>> > Hi,
>>> >>> >
>>> >>> > (Apologies for the long mail, but it's necessary to provide
>>> >>> > sufficient
>>> >>> > details considering the number of issues faced.)
>>> >>> >
>>> >>> > I'm running into issues testing LogisticRegressionWithSGD a two
>>> >>> > node
>>> >>> > cluster
>>> >>> > (each node with 24 cores and 16G available to slaves out of 24G on
>>> >>> > the
>>> >>> > system). Here's a description of the application:
>>> >>> >
>>> >>> > The model is being trained based on categorical features x, y, and
>>> >>> > (x,y).
>>> >>> > The categorical features are mapped to binary features by
>>> >>> > converting
>>> >>> > each
>>> >>> > distinct value in the category enum into a binary feature by itself
>>> >>> > (i.e
>>> >>> > presence of that value in a record implies corresponding feature =
>>> >>> > 1,
>>> >>> > else
>>> >>> > feature = 0. So, there'd be as many distinct features as enum
>>> >>> > values) .
>>> >>> > The
>>> >>> > training vector is laid out as
>>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)]. Each record in
>>> >>> > the
>>> >>> > training data has only one combination (Xk,Yk) and a label
>>> >>> > appearing in
>>> >>> > the
>>> >>> > record. Thus, the corresponding labeledpoint sparse vector would
>>> >>> > only
>>> >>> > have 3
>>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total length of the
>>> >>> > vector
>>> >>> > (though parse) would be nearly 614000.  The number of records is
>>> >>> > about
>>> >>> > 1.33
>>> >>> > million. The records have been coalesced into 20 partitions across
>>> >>> > two
>>> >>> > nodes. The input data has not been cached.
>>> >>> > (NOTE: I do realize the records & features may seem large for a two
>>> >>> > node
>>> >>> > setup, but given the memory & cpu, and the fact that I'm willing to
>>> >>> > give up
>>> >>> > some turnaround time, I don't see why tasks should inexplicably
>>> >>> > fail)
>>> >>> >
>>> >>> > Additional parameters include:
>>> >>> >
>>> >>> > spark.executor.memory = 14G
>>> >>> > spark.default.parallelism = 1
>>> >>> > spark.cores.max=20
>>> >>> > spark.storage.memoryFraction=0.8 //No cache space required
>>> >>> > (Trying to set spark.akka.frameSize to a larger number, say, 20
>>> >>> > didn't
>>> >>> > help
>>> >>> > either)
>>> >>> >
>>> >>> > The model training was initialized as : new
>>> >>> > LogisticRegressionWithSGD(1,
>>> >>> > maxIterations, 0.0, 0.05)
>>> >>> >
>>> >>> > However, after 4 iterations of gradient descent, the entire
>>> >>> > execution
>>> >>> > appeared to stall inexplicably. The corresponding executor details
>>> >>> > and
>>> >>> > details of the stalled stage (number 14) are as follows:
>>> >>> >
>>> >>> > Metric                        Min        25th     Median    75th
>>> >>> > Max
>>> >>> > Result serialization time    12 ms    13 ms    14 ms    16 ms    18
>>> >>> > ms
>>> >>> > Duration                    4 s        4 s        5 s        5 s
>>> >>> > 5 s
>>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0 ms    0 ms
>>> >>> > results
>>> >>> > Scheduler delay                6 s        6 s        6 s        6 s
>>> >>> > 12 s
>>> >>> >
>>> >>> >
>>> >>> > Stage Id
>>> >>> > 14 aggregate at GradientDescent.scala:178
>>> >>> >
>>> >>> > Task Index    Task ID    Status    Locality Level     Executor
>>> >>> > Launch Time                Duration    GC     Result Ser Time
>>> >>> > Errors
>>> >>> >
>>> >>> > Time
>>> >>> >
>>> >>> > 0     600     RUNNING     PROCESS_LOCAL
>>> >>> > serious.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27     1.1 h
>>> >>> > 1     601     RUNNING     PROCESS_LOCAL
>>> >>> > casual.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27         1.1 h
>>> >>> > 2     602     RUNNING     PROCESS_LOCAL
>>> >>> > serious.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27     1.1 h
>>> >>> > 3     603     RUNNING     PROCESS_LOCAL
>>> >>> > casual.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27         1.1 h
>>> >>> > 4     604     RUNNING     PROCESS_LOCAL
>>> >>> > serious.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27     1.1 h
>>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
>>> >>> > casual.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
>>> >>> > serious.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
>>> >>> > casual.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
>>> >>> > serious.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
>>> >>> > casual.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
>>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
>>> >>> > serious.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
>>> >>> > casual.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
>>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
>>> >>> > serious.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
>>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
>>> >>> > casual.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
>>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
>>> >>> > serious.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
>>> >>> > casual.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
>>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
>>> >>> > serious.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
>>> >>> > casual.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
>>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
>>> >>> > serious.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
>>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
>>> >>> > casual.dataone.foo.bar.com
>>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
>>> >>> >
>>> >>> > Executor stats:
>>> >>> >
>>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks    Failed
>>> >>> > Tasks
>>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle Read
>>> >>> > Shuffle
>>> >>> > Write
>>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2                 0
>>> >>> > 307         309         23.2 m         0.0 B             0.0 B
>>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3                 0
>>> >>> > 308         311         22.4 m         0.0 B             0.0 B
>>> >>> >
>>> >>> >
>>> >>> > Executor jmap output:
>>> >>> >
>>> >>> > Server compiler detected.
>>> >>> > JVM version is 24.55-b03
>>> >>> >
>>> >>> > using thread-local object allocation.
>>> >>> > Parallel GC with 18 thread(s)
>>> >>> >
>>> >>> > Heap Configuration:
>>> >>> >    MinHeapFreeRatio = 40
>>> >>> >    MaxHeapFreeRatio = 70
>>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
>>> >>> >    NewSize          = 1310720 (1.25MB)
>>> >>> >    MaxNewSize       = 17592186044415 MB
>>> >>> >    OldSize          = 5439488 (5.1875MB)
>>> >>> >    NewRatio         = 2
>>> >>> >    SurvivorRatio    = 8
>>> >>> >    PermSize         = 21757952 (20.75MB)
>>> >>> >    MaxPermSize      = 134217728 (128.0MB)
>>> >>> >    G1HeapRegionSize = 0 (0.0MB)
>>> >>> >
>>> >>> > Heap Usage:
>>> >>> > PS Young Generation
>>> >>> > Eden Space:
>>> >>> >    capacity = 2783969280 (2655.0MB)
>>> >>> >    used     = 192583816 (183.66223907470703MB)
>>> >>> >    free     = 2591385464 (2471.337760925293MB)
>>> >>> >    6.917598458557704% used
>>> >>> > From Space:
>>> >>> >    capacity = 409993216 (391.0MB)
>>> >>> >    used     = 1179808 (1.125152587890625MB)
>>> >>> >    free     = 408813408 (389.8748474121094MB)
>>> >>> >    0.2877628102022059% used
>>> >>> > To Space:
>>> >>> >    capacity = 385351680 (367.5MB)
>>> >>> >    used     = 0 (0.0MB)
>>> >>> >    free     = 385351680 (367.5MB)
>>> >>> >    0.0% used
>>> >>> > PS Old Generation
>>> >>> >    capacity = 7158628352 (6827.0MB)
>>> >>> >    used     = 4455093024 (4248.707794189453MB)
>>> >>> >    free     = 2703535328 (2578.292205810547MB)
>>> >>> >    62.2338918146983% used
>>> >>> > PS Perm Generation
>>> >>> >    capacity = 90701824 (86.5MB)
>>> >>> >    used     = 45348832 (43.248016357421875MB)
>>> >>> >    free     = 45352992 (43.251983642578125MB)
>>> >>> >    49.99770677158598% used
>>> >>> >
>>> >>> > 8432 interned Strings occupying 714672 bytes.
>>> >>> >
>>> >>> >
>>> >>> > Executor GC log snippet:
>>> >>> >
>>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
>>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times: user=5.13
>>> >>> > sys=0.39,
>>> >>> > real=0.32 secs]
>>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)] [ParOldGen:
>>> >>> > 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K)
>>> >>> > [PSPermGen:
>>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times: user=24.22
>>> >>> > sys=0.18,
>>> >>> > real=4.55 secs]
>>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
>>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times: user=1.66
>>> >>> > sys=0.00,
>>> >>> > real=0.09 secs]
>>> >>> >
>>> >>> > I tried to map partitions to cores on the nodes. Increasing the
>>> >>> > number
>>> >>> > of
>>> >>> > partitions (say to 80 or 100) would result in progress till the 6th
>>> >>> > iteration or so, but the next stage would stall as before with
>>> >>> > apparent
>>> >>> > root
>>> >>> > cause / logs. With increased partitions, the last stage that
>>> >>> > completed
>>> >>> > had
>>> >>> > the following task times:
>>> >>> >
>>> >>> > Metric                        Min        25th     Median    75th
>>> >>> > Max
>>> >>> > Result serialization time    11 ms    12 ms    13 ms    15 ms
>>> >>> > 0.4 s
>>> >>> > Duration                    0.5 s    0.9 s    1 s        3 s
>>> >>> > 7 s
>>> >>> > Time spent fetching            0 ms    0 ms    0 ms    0 ms    0 ms
>>> >>> > task results
>>> >>> > Scheduler delay                5 s        6 s        6 s        7 s
>>> >>> > 12 s
>>> >>> >
>>> >>> > My hypothesis is that as the coefficient array becomes less sparse
>>> >>> > (with
>>> >>> > successive iterations), the cost of the aggregate goes up to the
>>> >>> > point
>>> >>> > that
>>> >>> > it stalls (which I failed to explain). Reducing the batch fraction
>>> >>> > to a
>>> >>> > very
>>> >>> > low number like 0.01 saw the iterations progress further, but the
>>> >>> > model
>>> >>> > failed to converge in that case after a small number of iterations.
>>> >>> >
>>> >>> >
>>> >>> > I also tried reducing the number of records by aggregating on (x,y)
>>> >>> > as
>>> >>> > the
>>> >>> > key (i.e. using aggregations instead of training on every raw
>>> >>> > record),
>>> >>> > but
>>> >>> > encountered by the following exception:
>>> >>> >
>>> >>> > Loss was due to java.lang.NullPointerException
>>> >>> > java.lang.NullPointerException
>>> >>> >         at
>>> >>> >
>>> >>> >
>>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>>> >>> >         at
>>> >>> >
>>> >>> >
>>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>>> >>> >         at
>>> >>> > scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> >>> >         at
>>> >>> > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>>> >>> >         at
>>> >>> >
>>> >>> >
>>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>>> >>> >         at
>>> >>> >
>>> >>> >
>>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>>> >>> >         at
>>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>>> >>> >         at
>>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>>> >>> >         at
>>> >>> >
>>> >>> >
>>> >>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> >>> >         at
>>> >>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>> >>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>> >>> >         at
>>> >>> >
>>> >>> >
>>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>>> >>> >         at
>>> >>> >
>>> >>> >
>>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>>> >>> >         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>>> >>> >         at
>>> >>> >
>>> >>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>>> >>> >         at
>>> >>> >
>>> >>> >
>>> >>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> >>> >         at
>>> >>> >
>>> >>> >
>>> >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >>> >         at java.lang.Thread.run(Thread.java:745)
>>> >>> >
>>> >>> >
>>> >>> > I'd appreciate any insights/comments about what may be causing the
>>> >>> > execution
>>> >>> > to stall.
>>> >>> >
>>> >>> > If logs/tables appear poorly indented in the email, here's a gist
>>> >>> > with
>>> >>> > relevant details:
>>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Bharath
>>> >
>>> >
>>
>>
>