You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by francexo83 <fr...@gmail.com> on 2015/03/14 18:52:17 UTC
Spark and HBase join issue
Hi all,
I have the following cluster configurations:
- 5 nodes on a cloud environment.
- Hadoop 2.5.0.
- HBase 0.98.6.
- Spark 1.2.0.
- 8 cores and 16 GB of ram on each host.
- 1 NFS disk with 300 IOPS mounted on host 1 and 2.
- 1 NFS disk with 300 IOPS mounted on host 3,4 and 5.
I tried to run a spark job in cluster mode that computes the left outer
join between two hbase tables.
The first table stores about 4.1 GB of data spread across 3 regions with
Snappy compression.
The second one stores about 1.2 GB of data spread across 22 regions with
Snappy compression.
I sometimes get executor lost during in the shuffle phase during the last
stage (saveAsHadoopDataset).
Below my spark conf:
num-cpu-cores = 20
memory-per-node = 10G
spark.scheduler.mode = FAIR
spark.scheduler.pool = production
spark.shuffle.spill= true
spark.rdd.compress = true
spark.core.connection.auth.wait.timeout=2000
spark.sql.shuffle.partitions=100
spark.default.parallelism=50
spark.speculation=false
spark.shuffle.spill=true
spark.shuffle.memoryFraction=0.1
spark.cores.max=30
spark.driver.memory=10g
Are the resource to low to handle this kind of operation?
if yes, could you share with me the right configuration to perform this
kind of task?
Thank you in advance.
F.
Re: Spark and HBase join issue
Posted by Ted Yu <yu...@gmail.com>.
The 4.1 GB table has 3 regions. This means that there would be at least 2
nodes which don't carry its region.
Can you split this table into 12 (or more) regions ?
BTW what's the value for spark.yarn.executor.memoryOverhead ?
Cheers
On Sat, Mar 14, 2015 at 10:52 AM, francexo83 <fr...@gmail.com> wrote:
> Hi all,
>
>
> I have the following cluster configurations:
>
>
> - 5 nodes on a cloud environment.
> - Hadoop 2.5.0.
> - HBase 0.98.6.
> - Spark 1.2.0.
> - 8 cores and 16 GB of ram on each host.
> - 1 NFS disk with 300 IOPS mounted on host 1 and 2.
> - 1 NFS disk with 300 IOPS mounted on host 3,4 and 5.
>
> I tried to run a spark job in cluster mode that computes the left outer
> join between two hbase tables.
> The first table stores about 4.1 GB of data spread across 3 regions
> with Snappy compression.
> The second one stores about 1.2 GB of data spread across 22 regions with
> Snappy compression.
>
> I sometimes get executor lost during in the shuffle phase during the last
> stage (saveAsHadoopDataset).
>
> Below my spark conf:
>
> num-cpu-cores = 20
> memory-per-node = 10G
> spark.scheduler.mode = FAIR
> spark.scheduler.pool = production
> spark.shuffle.spill= true
> spark.rdd.compress = true
> spark.core.connection.auth.wait.timeout=2000
> spark.sql.shuffle.partitions=100
> spark.default.parallelism=50
> spark.speculation=false
> spark.shuffle.spill=true
> spark.shuffle.memoryFraction=0.1
> spark.cores.max=30
> spark.driver.memory=10g
>
> Are the resource to low to handle this kind of operation?
>
> if yes, could you share with me the right configuration to perform this
> kind of task?
>
> Thank you in advance.
>
> F.
>
>
>
>
>