You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mingda Jia (Jira)" <ji...@apache.org> on 2020/02/21 08:30:00 UTC

[jira] [Created] (SPARK-30916) Dead Lock when Loading BLAS Class

Mingda Jia created SPARK-30916:
----------------------------------

             Summary: Dead Lock when Loading BLAS Class
                 Key: SPARK-30916
                 URL: https://issues.apache.org/jira/browse/SPARK-30916
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 2.4.5, 2.3.0
            Reporter: Mingda Jia


When using transactions including aggregation and treeAggregation, and the seqOp and combOp accept level 1 and level 2 BLAS operations respectively, it will cause a JVM internal dead lock which is hard to detect.

 

Say the seqOp runs gemv, which is a level 2 BLAS operation and the combOp runs axpy, which is a level 1 BLAS operation. When a task takes seqOp meets another task takes combOp, the two task threads stuck. The call stacks are like this:

!image-2020-02-21-15-52-49-846.png!

!image-2020-02-21-15-53-22-870.png!

The threads states are all runnable, but actually they are not running.

 

When calling the function gemv, if there is not an existing BLAS instance, it will call the getInstance method to get a BLAS instance. The first entered thread will run the static code block of BLAS.scala, which tries loading a subclass of BLAS and instantiate the class with reflection.

!image-2020-02-21-16-00-40-552.png!

 

When calling the function axpy, if there is not an existing BLAS instance, it will new a F2jBLAS instance directly cause it is a level 1 BLAS operation.

!image-2020-02-21-16-02-25-136.png!

 

The problem is, the classes NativeSystemBLAS, NativeRefBLAS and F2jBLAS which BLAS wants to load in the static code block are all subclasses of F2jBLAS, or even F2jBLAS it self. The sequence of loading class in the static code block of BLAS is like this:
 # tries loading class BLAS -> lock the class BLAS
 # tries loading class NativeSystemBLAS in the static code block -> lock the class NativeSystemBLAS
 # recursively load F2jBLAS because it's the parent class of NativeSystemBLAS -> lock the class F2jBLAS
 # ......

Simultaneously, the sequence of new an F2jBLAS in the axpy operation is like this:
 # tries loading class F2jBLAS -> lock the class F2jBLAS
 # recursively load BLAS because it's the parent class of F2jBLAS -> lock the class BLAS
 # ......

When one task thread which runs the gemv operation just finished its second step above, and the other task thread which runs the axpy operation  just finished its first step above, the gemv thread wants to load class F2jBLAS but it is locked by the axpy thread, and the axpy thread wants to load class BLAS but it is locked by the gemv thread, in which case a dead lock is generated. 

 

A demo which can reproduce the problem is like this:
{code:java}
class Demo {
    public static void main(String[] args) {
        Thread t1 = new Thread(new Runnable() {
            @Override
            public void run() {
                BLAS blas = BLAS.getInstance();
                blas.print();
            }
        });
        Thread t2 = new Thread(new Runnable() {
            @Override
            public void run() {
                BLAS blas = new F2jBLAS();
                blas.print();
            }
        });
        t1.setName("native");
        t2.setName("f2j");
        t1.start();
        t2.start();
    }
}

abstract class BLAS {
    public static BLAS instance;
    abstract public void print();
    public static BLAS getInstance() {
        return instance;
    }
    private static BLAS load() throws Exception{
        Class klass = Class.forName("NativeSystemBlas");
        return (BLAS) klass.newInstance();
    }
    static {
        System.out.println("Entered static code block" );
        try {
            instance = load();
        } catch (Exception e) {
            System.out.println("error");
        }
    }
}

class F2jBLAS extends BLAS{
    @Override
    public void print() {
        System.out.println("print F2j");
    }
}

class NativeSystemBlas extends F2jBLAS {
    @Override
    public void print(){
        System.out.println("print NativeBlas");
    }
}

{code}
If BLAS operations in spark MLlib do not use F2jBLAS for level 1 operations but use the same instantiation as the nativeBLAS, there won't be such a problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org