You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Donni Khan <pr...@googlemail.com> on 2018/04/04 08:56:46 UTC

run huge number of queries in Spark

Hi all,

I want to run huge number of queries on Dataframe in Spark. I have a big
data of text documents, I loded all documents into SparkDataFrame and
create a temp table.

dataFrame.registerTempTable("table1");

I have more than 50,000 terms, I want to get the document frequency for
each by using the "table1".

I use the follwing:

DataFrame df=sqlContext.sql("select count(ID) from table1 where text like
'%"+term+"%'");

but this scenario needs much time to finish because I have t run it from
Spark Driver for each term.


Does anyone has idea how I can run all queries in distributed way?

Thank you && Best Regards,

Donni

Re: run huge number of queries in Spark

Posted by Georg Heiler <ge...@gmail.com>.
See https://gist.github.com/geoHeil/e07922229860262ceebf830859716bbf in
particular:

You will probably want to use sparks imperative (non SQL) API:
.rdd
.reduceByKey {
(count1, count2) => count1 + count2
}.map {
case ((word, path), n) => (word, (path, n))
}.toDF
i.e. builds an inverted index
which easily lets you then calculate TF / IDF
But spark also comes with
https://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf which
might help you to easily achieve the desired result.

Donni Khan <pr...@googlemail.com> schrieb am Mi., 4. Apr. 2018 um
10:56 Uhr:

> Hi all,
>
> I want to run huge number of queries on Dataframe in Spark. I have a big
> data of text documents, I loded all documents into SparkDataFrame and
> create a temp table.
>
> dataFrame.registerTempTable("table1");
>
> I have more than 50,000 terms, I want to get the document frequency for
> each by using the "table1".
>
> I use the follwing:
>
> DataFrame df=sqlContext.sql("select count(ID) from table1 where text like
> '%"+term+"%'");
>
> but this scenario needs much time to finish because I have t run it from
> Spark Driver for each term.
>
>
> Does anyone has idea how I can run all queries in distributed way?
>
> Thank you && Best Regards,
>
> Donni
>
>
>
>