You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gordon Benjamin <go...@gmail.com> on 2014/11/20 18:17:02 UTC
Incremental loading data slows performance
Hi,
We are seeing bad performance as we incrementally load data. Here is the
config
Spark standalone cluster
spark01 (spark master, shark, hadoop namenode): 15GB RAM, 4vCPU's
spark02 (spark worker, hadoop datanode): 15GB RAM, 8vCPU's
spark03 (spark worker): 15GB RAM, 8vCPU's
spark04 (spark worker): 15GB RAM, 8vCPU's
spark worker configuration:
spark.local.dir=/path/to/ssd/disk
spark.default.parallelism=64
spark.executor.memory=10g
spark.serializer=org.apache.spark.serializer.KryoSerializer
shark configuration:
spark.kryoserializer.buffer.mb=64
mapred.reduce.tasks=30
spark.scheduler.mode=FAIR
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.default.parallelism=64
and the performance decreases with more data being loaded into spark
simple query like this:
select count(*) from customers_cached
0.5 second on 12th Nov
4.24 seconds now
We have these errors all over the log
2014-11-20 16:56:42,125 WARN parse.TypeCheckProcFactory
(TypeCheckProcFactory.java:convert(180)) - Invalid type entry TOK_INT=null
2014-11-20 16:56:51,988 WARN parse.TypeCheckProcFactory
(TypeCheckProcFactory.java:convert(180)) - Invalid type entry
TOK_TABLE_OR_COL=null
Anyone any ideas to help us resolve this? Can post up anything you need
Re: Incremental loading data slows performance
Posted by Gordon Benjamin <go...@gmail.com>.
To follow on:
I asked the developer how we incrementally load data and the response was
no. union only for updated records (every night)
For every minutes export algorithm next:
1. upload file to hadoop.
2. load data inpath... overwrite into table ...._incremental;
3. insert into table ..._cached from ..._incremental
Perhaps this helps understand our issue
On Thursday, November 20, 2014, Gordon Benjamin <go...@gmail.com>
wrote:
> Hi,
>
> We are seeing bad performance as we incrementally load data. Here is the
> config
>
> Spark standalone cluster
>
> spark01 (spark master, shark, hadoop namenode): 15GB RAM, 4vCPU's
> spark02 (spark worker, hadoop datanode): 15GB RAM, 8vCPU's
> spark03 (spark worker): 15GB RAM, 8vCPU's
> spark04 (spark worker): 15GB RAM, 8vCPU's
>
> spark worker configuration:
> spark.local.dir=/path/to/ssd/disk
> spark.default.parallelism=64
> spark.executor.memory=10g
> spark.serializer=org.apache.spark.serializer.KryoSerializer
>
> shark configuration:
> spark.kryoserializer.buffer.mb=64
> mapred.reduce.tasks=30
> spark.scheduler.mode=FAIR
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> spark.default.parallelism=64
>
> and the performance decreases with more data being loaded into spark
>
> simple query like this:
> select count(*) from customers_cached
> 0.5 second on 12th Nov
> 4.24 seconds now
>
> We have these errors all over the log
>
> 2014-11-20 16:56:42,125 WARN parse.TypeCheckProcFactory
> (TypeCheckProcFactory.java:convert(180)) - Invalid type entry TOK_INT=null
> 2014-11-20 16:56:51,988 WARN parse.TypeCheckProcFactory
> (TypeCheckProcFactory.java:convert(180)) - Invalid type entry
> TOK_TABLE_OR_COL=null
>
> Anyone any ideas to help us resolve this? Can post up anything you need
>
>
>
>
>