You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by xn...@o2.pl, xn...@o2.pl on 2014/09/17 15:57:35 UTC

list of documents sentiment analysis - problem with defining proper approach with Spark

Hi,

For last few days I am working on an exercise where I want to understand the sentiment of a set of articles.

As the input I have XML file with articles and the AFINN-111.txt file defining sentiment of few hundred words.

What I am able to do without any problem is loading of the data, putting it into structures (classes for articles, tuples (word, sentiment-value) for sentiments).

Then what I think I need to do (from the logical pov) is:

foreach article
   articleWords = split the body by " " 
   join the two lists (articleWords and sentimentWords) together.
   calculate the sentiment for the article by summing up sentiments of all words that it includes
dump the article id, sentiment into a flat file

And this is where I am stuck :) I tried multiple combinations of map/reduceByKey all either didn't make too much sense (like getting sentiment for all articles combined) or resulted in errors that function cannot be serialised. Today I even tried to implement this with a brute-force approach doing:

articles.foreach(calculateSentiment)

where calculateSentiment looks like below:

val words = sc.parallelize(post.body.split(" ")) // split body by " " 
val wordPairs = words.map(w => (w, 1)).reduceByKey(_+_, 1) // create tuples of (word, #occurrences in article)
val joinedValues = wordPairs.join(sentiments_) // join 

But somehow I had a feeling this is not the best idea and I think I was right, since the job is running for like an hour (and I have few hundred GBs to process only).

So the question is - what I am doing wrong? Any hints or suggestions for direction are really appreciated!

Thank you,
Leszek



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org