You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ascot Moss <as...@gmail.com> on 2015/10/20 17:39:06 UTC

Spark: How to find similar text title

Hi,

I have my RDD that stores the titles of some articles:
1. "About Spark Streaming"
2. "About Spark MLlib"
3. "About Spark SQL"
4. "About Spark Installation"
5. "Kafka Streaming"
6. "Kafka Setup"
7. ....

I need to build a model to find titles by similarity,
e.g
if given "About Spark", hope to get:

"About Spark Installation", 0.98622 (where 0.98622 is the score
of similarity, range between 0 to 1)
"About Spark MLlib", 0.95394
"About Spark Streaming", 0.94332
"About Spark SQL", 0.9111

Any idea or reference to do so?

Thanks
Ascot





 and need to find out similar titles

Re: Spark: How to find similar text title

Posted by Sonal Goyal <so...@gmail.com>.
Do you want to compare within the rdd or do you have some external list or
data coming in ?

For matching, you could look at string edit distances or cosine similarity
if you are only comparing title strings.
On Oct 20, 2015 9:09 PM, "Ascot Moss" <as...@gmail.com> wrote:

> Hi,
>
> I have my RDD that stores the titles of some articles:
> 1. "About Spark Streaming"
> 2. "About Spark MLlib"
> 3. "About Spark SQL"
> 4. "About Spark Installation"
> 5. "Kafka Streaming"
> 6. "Kafka Setup"
> 7. ....
>
> I need to build a model to find titles by similarity,
> e.g
> if given "About Spark", hope to get:
>
> "About Spark Installation", 0.98622 (where 0.98622 is the score
> of similarity, range between 0 to 1)
> "About Spark MLlib", 0.95394
> "About Spark Streaming", 0.94332
> "About Spark SQL", 0.9111
>
> Any idea or reference to do so?
>
> Thanks
> Ascot
>
>
>
>
>
>  and need to find out similar titles
>