You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Alchemist <al...@gmail.com> on 2018/05/19 22:40:07 UTC

Spark UNEVENLY distributing data

I am trying to parallelize a simple Spark program processes HBASE data in parallel.// Get Hbase RDD
    JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = jsc
            .newAPIHadoopRDD(conf, TableInputFormat.class,
                    ImmutableBytesWritable.class, Result.class);
    long count = hBaseRDD.count(); Only two lines I see in the logs.  Zookeeper starts and Zookeeper stops
The problem is my program is as SLOW as the largest bar. Found that ZK is taking long time before shutting.18/05/19 17:26:55 INFO zookeeper.ClientCnxn: Session establishment complete on server :2181, sessionid = 0x163662b64eb046d, negotiated timeout = 40000 18/05/19 17:38:00 INFO zookeeper.ZooKeeper: Session: 0x163662b64eb046d closed

Re: Spark UNEVENLY distributing data

Posted by Saad Mufti <sa...@gmail.com>.

I think TableInputFormat will try to maintain as much locality as possible,
assigning one Spark partition per region and trying to assign that
partition to a YARN container/executor on the same node (assuming you're
using Spark over YARN). So the reason for the uneven distribution could be
that your HBase is not balanced to begin with and has too many regions on
the same region server corresponding to your largest bar. It all depends on
what HBase balancer you have configured and tuned. Assuming that is
properly configured, try to balance your HBase cluster before running the
Spark job. Tere are command s in hbase shell to do it manually if required.

Hope this helps.

----
Saad

On Sat, May 19, 2018 at 6:40 PM, Alchemist <al...@gmail.com>
wrote:

> I am trying to parallelize a simple Spark program processes HBASE data in
> parallel.
>
> // Get Hbase RDD
>     JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = jsc
>             .newAPIHadoopRDD(conf, TableInputFormat.class,
>                     ImmutableBytesWritable.class, Result.class);
>     long count = hBaseRDD.count();
>
> Only two lines I see in the logs.  Zookeeper starts and Zookeeper stops
>
>
> The problem is my program is as SLOW as the largest bar. Found that ZK is taking long time before shutting.
> 18/05/19 17:26:55 INFO zookeeper.ClientCnxn: Session establishment complete on server :2181, sessionid = 0x163662b64eb046d, negotiated timeout = 40000 18/05/19
> 17:38:00 INFO zookeeper.ZooKeeper: Session: 0x163662b64eb046d closed
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

Re: Spark UNEVENLY distributing data

Posted by Saad Mufti <sa...@gmail.com>.

I think TableInputFormat will try to maintain as much locality as possible,
assigning one Spark partition per region and trying to assign that
partition to a YARN container/executor on the same node (assuming you're
using Spark over YARN). So the reason for the uneven distribution could be
that your HBase is not balanced to begin with and has too many regions on
the same region server corresponding to your largest bar. It all depends on
what HBase balancer you have configured and tuned. Assuming that is
properly configured, try to balance your HBase cluster before running the
Spark job. Tere are command s in hbase shell to do it manually if required.

Hope this helps.

----
Saad

On Sat, May 19, 2018 at 6:40 PM, Alchemist <al...@gmail.com>
wrote:

> I am trying to parallelize a simple Spark program processes HBASE data in
> parallel.
>
> // Get Hbase RDD
>     JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = jsc
>             .newAPIHadoopRDD(conf, TableInputFormat.class,
>                     ImmutableBytesWritable.class, Result.class);
>     long count = hBaseRDD.count();
>
> Only two lines I see in the logs.  Zookeeper starts and Zookeeper stops
>
>
> The problem is my program is as SLOW as the largest bar. Found that ZK is taking long time before shutting.
> 18/05/19 17:26:55 INFO zookeeper.ClientCnxn: Session establishment complete on server :2181, sessionid = 0x163662b64eb046d, negotiated timeout = 40000 18/05/19
> 17:38:00 INFO zookeeper.ZooKeeper: Session: 0x163662b64eb046d closed
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>