You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jesse F Chen <jf...@us.ibm.com> on 2015/08/27 23:46:21 UTC
tweet transformation ideas
This is a question on general usage/best practice/best transformation
method to use for
a sentiment analysis on tweets...
Input:
Tweets (e.g, "@xyz, sorry but this movie is poorly scripted
http://t.co/uyser876") - large data set, ie. 1 billion tweets
Sentiment dictionary (e.g, "sorry" -> positive score 0, negative
score 0.97) - fixed data set, 200K words
Output:
tweet positive total negative total
1 0.00 3.4
2 0.875 0.12
...
The implementation idea I have (and it worked) is
- turn the sentiment dictionary into data frame (with a schema of "word
posScore NegScore"), register as a table
- turn tweets into a dataframe ("body")
- Using Spark SQL to find matches and aggregate scores, like this:
SELECT t.body, sum(s.PosScore), sum(s.NegScore)
FROM TweetsDF t, sentiment_dictionaryDF s
WHERE not (t.body is null)
and locate(upper(s.SynsetTerms), upper(t.body)) > 0
GROUP BY t.body LIMIT 100
This works, BUT EXTREMELY SLOW.
Though the 200K-word dictionary isn't that big, I supposed the LOCATE
function is really slow...
Or using SQL is completely the wrong tool to use here?
How else should I transform tweet and/or sentiment dictionary to speed up
the code?
JESSE CHEN
Big Data Performance | IBM Analytics
Email: jfchen@us.ibm.com
Re: tweet transformation ideas
Posted by Josh Rosen <jo...@databricks.com>.
What about using a subquery / exists query and the new 1.5 broadcast hint
on the dictionary data frame?
On Thu, Aug 27, 2015 at 8:44 PM, Holden Karau <ho...@pigscanfly.ca> wrote:
> It seems like this might be better suited to a broadcasted hash map since
> 200k entries isn't that big. You can then map over the tweets and lookup
> each word in the broadcasted map.
>
>
> On Thursday, August 27, 2015, Jesse F Chen <jf...@us.ibm.com> wrote:
>
>> This is a question on general usage/best practice/best transformation
>> method to use for
>> a sentiment analysis on tweets...
>>
>> *Input:*
>> Tweets (e.g, "@xyz, sorry but this movie is poorly scripted
>> http://t.co/uyser876") - large data set, ie. 1 billion tweets
>> Sentiment dictionary (e.g, "sorry" -> positive score 0, negative score
>> 0.97) - fixed data set, 200K words
>>
>> *Output*:
>> tweet positive total negative total
>> 1 0.00 3.4
>> 2 0.875 0.12
>> ...
>>
>> The implementation idea I have (and it worked) is
>>
>> - turn the sentiment dictionary into data frame (with a schema of "word
>> posScore NegScore"), register as a table
>> - turn tweets into a dataframe ("body")
>> - Using Spark SQL to find matches and aggregate scores, like this:
>>
>> SELECT t.body, sum(s.PosScore), sum(s.NegScore)
>> FROM TweetsDF t, sentiment_dictionaryDF s
>> WHERE not (t.body is null)
>> and locate(upper(s.SynsetTerms), upper(t.body)) > 0
>> GROUP BY t.body LIMIT 100
>>
>> This works, BUT EXTREMELY SLOW.
>>
>> Though the 200K-word dictionary isn't that big, I supposed the LOCATE
>> function is really slow...
>>
>> Or using SQL is completely the wrong tool to use here?
>>
>> How else should I transform tweet and/or sentiment dictionary to speed up
>> the code?
>>
>>
>>
>> *JESSE CHEN*
>> Big Data Performance | IBM Analytics
>> Email: jfchen@us.ibm.com
>>
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>
>
Re: tweet transformation ideas
Posted by Holden Karau <ho...@pigscanfly.ca>.
It seems like this might be better suited to a broadcasted hash map since
200k entries isn't that big. You can then map over the tweets and lookup
each word in the broadcasted map.
On Thursday, August 27, 2015, Jesse F Chen <jf...@us.ibm.com> wrote:
> This is a question on general usage/best practice/best transformation
> method to use for
> a sentiment analysis on tweets...
>
> *Input:*
> Tweets (e.g, "@xyz, sorry but this movie is poorly scripted
> http://t.co/uyser876") - large data set, ie. 1 billion tweets
> Sentiment dictionary (e.g, "sorry" -> positive score 0, negative score
> 0.97) - fixed data set, 200K words
>
> *Output*:
> tweet positive total negative total
> 1 0.00 3.4
> 2 0.875 0.12
> ...
>
> The implementation idea I have (and it worked) is
>
> - turn the sentiment dictionary into data frame (with a schema of "word
> posScore NegScore"), register as a table
> - turn tweets into a dataframe ("body")
> - Using Spark SQL to find matches and aggregate scores, like this:
>
> SELECT t.body, sum(s.PosScore), sum(s.NegScore)
> FROM TweetsDF t, sentiment_dictionaryDF s
> WHERE not (t.body is null)
> and locate(upper(s.SynsetTerms), upper(t.body)) > 0
> GROUP BY t.body LIMIT 100
>
> This works, BUT EXTREMELY SLOW.
>
> Though the 200K-word dictionary isn't that big, I supposed the LOCATE
> function is really slow...
>
> Or using SQL is completely the wrong tool to use here?
>
> How else should I transform tweet and/or sentiment dictionary to speed up
> the code?
>
>
>
> *JESSE CHEN*
> Big Data Performance | IBM Analytics
> Email: jfchen@us.ibm.com
> <javascript:_e(%7B%7D,'cvml','jfchen@us.ibm.com');>
>
>
>
--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau