You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by xn...@o2.pl,
xn...@o2.pl on 2014/09/17 15:57:35 UTC
list of documents sentiment analysis - problem with defining proper approach with Spark
Hi,
For last few days I am working on an exercise where I want to understand the sentiment of a set of articles.
As the input I have XML file with articles and the AFINN-111.txt file defining sentiment of few hundred words.
What I am able to do without any problem is loading of the data, putting it into structures (classes for articles, tuples (word, sentiment-value) for sentiments).
Then what I think I need to do (from the logical pov) is:
foreach article
articleWords = split the body by " "
join the two lists (articleWords and sentimentWords) together.
calculate the sentiment for the article by summing up sentiments of all words that it includes
dump the article id, sentiment into a flat file
And this is where I am stuck :) I tried multiple combinations of map/reduceByKey all either didn't make too much sense (like getting sentiment for all articles combined) or resulted in errors that function cannot be serialised. Today I even tried to implement this with a brute-force approach doing:
articles.foreach(calculateSentiment)
where calculateSentiment looks like below:
val words = sc.parallelize(post.body.split(" ")) // split body by " "
val wordPairs = words.map(w => (w, 1)).reduceByKey(_+_, 1) // create tuples of (word, #occurrences in article)
val joinedValues = wordPairs.join(sentiments_) // join
But somehow I had a feeling this is not the best idea and I think I was right, since the job is running for like an hour (and I have few hundred GBs to process only).
So the question is - what I am doing wrong? Any hints or suggestions for direction are really appreciated!
Thank you,
Leszek
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org