You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Borislav Iordanov <bi...@liquidoperations.com> on 2016/01/20 19:31:19 UTC

How to debug join operations on a cluster.

Hi, 

I'm reading data from HBase using the latest (2.0.0-SNAPSHOT) Hbase-Spark integration module. HBase is deployed on a cluster of 3 machines and Spark is deployed as a Standalone cluster on the same machines. 

I am doing a join between two JavaPairRDDs that are constructed from two separate HBase tables. An RDD is obtained from an Hbase table scan, then it transformed into a pair RDD with the row key from the table. 

When I run my Spark program as a standalone process, either on my development machine or on one of the cluster machines, the join returns a correct, non-empty result. When I submit the exact same program to the Spark cluster, the join comes out empty. In both cases I'm connecting to the Spark master on the cluster.  In summary: 

1) mvn exec:java <my program>  prints out correct non-empty join 
2) spark-submit --deploy-mode client --class same_main_class --master cluster_master_url  prints out empty join 
3) spark-submit --deploy-mode cluster --class same_main_class --master cluster_master_url  also prints out empty join 

The spark version deployed is 1.5.1. The same version is declared as a Maven dependency. I've also tried with 1.5.2 and 1.6.0, redeploying the cluster etc. I've spent a few days trying to troubleshoot this but to no avail. I print out a count of the RDDs that I'm joining and it always gives me the correct. Only, the join doesn't work I submit it as a job to the cluster, regardless of where the Spark driver is. 

Can anybody give me some pointers how to debug this? I'm assuming the RDD is partitioned and shuffled and whatever is happening behind the scenes, except it is not behaving correctly, there aren't any exceptions, errors or even warnings and I have no clue why the join would be empty. Again: identical code run as a standalone program works, but when submitted to the cluster doesn't. 

I'm mainly looking for troubleshooting tips here! 

Thanks much in advance! 
Boris