You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gaspar Muñoz <gm...@stratio.com> on 2015/08/08 12:58:52 UTC

Pagination on big table, splitting joins

Hi,

I have two different parts in my system.

1. Batch application that every x minutes do sql queries between several
tables that contains millions of rows to compound a entity, and sent that
entities to Kafka.
2. Streaming application that processing data from Kafka.

Now, I have entire system working, but I want to improve the performance in
the batch part, because if I have 100 millions of entities I send them to
Kafka in a foreach method in a row, which makes no sense for the next
streaming application. I want, send each 10 millions events to Kafka, for
example.

I have a query, imagine

*select ... from table 1 left outer join table 2 on ... left outer join
table 3 on ... left outer join table 4 on ...*

My target is do *pagination* on table 1 and take 10 million in a separate
RDD, do the joins and send to Kafka,  then take another 10 million and do
the same... I have all tables in parquet format in hdfs.

I think to use *toLocalIterator* method and something like that, but I have
doubts about memory and parallelism and sure there is a better way to do it.

rdd.toLocalIterator.grouped(10000000).foreach( seq =>

val rdd: RDD[(String, Int)] = sc.parallelize(seq)
     // Do the processing

)

What do you think?

Regards.

-- 

Gaspar Muñoz
@gmunozsoria


<http://www.stratio.com/>
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*

Re: Pagination on big table, splitting joins

Posted by Michael Armbrust <mi...@databricks.com>.

>
> I think to use *toLocalIterator* method and something like that, but I
> have doubts about memory and parallelism and sure there is a better way to
> do it.
>

It will still run all earlier parts of the job in parallel.  Only the
actual retrieving of the final partitions will be serial.  This is how we
do pagination in the Spark SQL JDBC server.