You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jesse F Chen <jf...@us.ibm.com> on 2015/08/27 23:46:21 UTC

tweet transformation ideas


This is a question on general usage/best practice/best transformation
method to use for
a sentiment analysis on tweets...

Input:
	Tweets (e.g, "@xyz, sorry but this movie is poorly scripted
http://t.co/uyser876") - large data set, ie. 1 billion tweets
	Sentiment dictionary (e.g, "sorry" -> positive score 0, negative
score 0.97) - fixed data set, 200K words

Output:
	tweet	positive total		negative total
	1	0.00			3.4
	2	0.875			0.12
	...

The implementation idea I have (and it worked) is

- turn the sentiment dictionary into data frame (with a schema of "word
posScore NegScore"), register as a table
- turn tweets into a dataframe ("body")
- Using Spark SQL to find matches and aggregate scores, like this:

SELECT t.body, sum(s.PosScore), sum(s.NegScore)
          FROM TweetsDF t, sentiment_dictionaryDF s
          WHERE not (t.body is null)
                and locate(upper(s.SynsetTerms), upper(t.body)) > 0
          GROUP BY t.body LIMIT 100

This works, BUT EXTREMELY SLOW.

Though the 200K-word dictionary isn't that big, I supposed the LOCATE
function is really slow...

Or using SQL is completely the wrong tool to use here?

How else should I transform tweet and/or sentiment dictionary to speed up
the code?

                                                                                                                                              
                                                                                                                                              
                                                                                                                                              
                                                                                                                                              
                                                                                                                                              
                                                                                                                                              
                                   JESSE CHEN                                                                                                 
                                   Big Data Performance | IBM Analytics                                                                       
                                   Email:   jfchen@us.ibm.com                                                                                 
                                                                                                                                              
                                                                                                                                              


Re: tweet transformation ideas

Posted by Josh Rosen <jo...@databricks.com>.
What about using a subquery / exists query and the new 1.5 broadcast hint
on the dictionary data frame?

On Thu, Aug 27, 2015 at 8:44 PM, Holden Karau <ho...@pigscanfly.ca> wrote:

> It seems like this might be better suited to a broadcasted hash map since
> 200k entries isn't that big. You can then map over the tweets and lookup
> each word in the broadcasted map.
>
>
> On Thursday, August 27, 2015, Jesse F Chen <jf...@us.ibm.com> wrote:
>
>> This is a question on general usage/best practice/best transformation
>> method to use for
>> a sentiment analysis on tweets...
>>
>> *Input:*
>> Tweets (e.g, "@xyz, sorry but this movie is poorly scripted
>> http://t.co/uyser876") - large data set, ie. 1 billion tweets
>> Sentiment dictionary (e.g, "sorry" -> positive score 0, negative score
>> 0.97) - fixed data set, 200K words
>>
>> *Output*:
>> tweet positive total negative total
>> 1 0.00 3.4
>> 2 0.875 0.12
>> ...
>>
>> The implementation idea I have (and it worked) is
>>
>> - turn the sentiment dictionary into data frame (with a schema of "word
>> posScore NegScore"), register as a table
>> - turn tweets into a dataframe ("body")
>> - Using Spark SQL to find matches and aggregate scores, like this:
>>
>> SELECT t.body, sum(s.PosScore), sum(s.NegScore)
>>           FROM TweetsDF t, sentiment_dictionaryDF s
>>           WHERE not (t.body is null)
>>                 and locate(upper(s.SynsetTerms), upper(t.body)) > 0
>>           GROUP BY t.body LIMIT 100
>>
>> This works, BUT EXTREMELY SLOW.
>>
>> Though the 200K-word dictionary isn't that big, I supposed the LOCATE
>> function is really slow...
>>
>> Or using SQL is completely the wrong tool to use here?
>>
>> How else should I transform tweet and/or sentiment dictionary to speed up
>> the code?
>>
>>
>>
>>    *JESSE CHEN*
>>    Big Data Performance | IBM Analytics
>>    Email:   jfchen@us.ibm.com
>>
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>
>

Re: tweet transformation ideas

Posted by Holden Karau <ho...@pigscanfly.ca>.
It seems like this might be better suited to a broadcasted hash map since
200k entries isn't that big. You can then map over the tweets and lookup
each word in the broadcasted map.

On Thursday, August 27, 2015, Jesse F Chen <jf...@us.ibm.com> wrote:

> This is a question on general usage/best practice/best transformation
> method to use for
> a sentiment analysis on tweets...
>
> *Input:*
> Tweets (e.g, "@xyz, sorry but this movie is poorly scripted
> http://t.co/uyser876") - large data set, ie. 1 billion tweets
> Sentiment dictionary (e.g, "sorry" -> positive score 0, negative score
> 0.97) - fixed data set, 200K words
>
> *Output*:
> tweet positive total negative total
> 1 0.00 3.4
> 2 0.875 0.12
> ...
>
> The implementation idea I have (and it worked) is
>
> - turn the sentiment dictionary into data frame (with a schema of "word
> posScore NegScore"), register as a table
> - turn tweets into a dataframe ("body")
> - Using Spark SQL to find matches and aggregate scores, like this:
>
> SELECT t.body, sum(s.PosScore), sum(s.NegScore)
>           FROM TweetsDF t, sentiment_dictionaryDF s
>           WHERE not (t.body is null)
>                 and locate(upper(s.SynsetTerms), upper(t.body)) > 0
>           GROUP BY t.body LIMIT 100
>
> This works, BUT EXTREMELY SLOW.
>
> Though the 200K-word dictionary isn't that big, I supposed the LOCATE
> function is really slow...
>
> Or using SQL is completely the wrong tool to use here?
>
> How else should I transform tweet and/or sentiment dictionary to speed up
> the code?
>
>
>
>    *JESSE CHEN*
>    Big Data Performance | IBM Analytics
>    Email:   jfchen@us.ibm.com
>    <javascript:_e(%7B%7D,'cvml','jfchen@us.ibm.com');>
>
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau