You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Joe L <se...@yahoo.com> on 2014/05/20 06:07:26 UTC

facebook data mining with Spark

Is there any way to get facebook data into Spark and filter the content of
it?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: facebook data mining with Spark

Posted by Mayur Rustagi <ma...@gmail.com>.
Are you looking to connect as streaming source.
You should be able to integrate it like twitter API.
Regards
Mayur
On May 20, 2014 9:38 AM, "Joe L" <se...@yahoo.com> wrote:

> Is there any way to get facebook data into Spark and filter the content of
> it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: facebook data mining with Spark

Posted by Michael Cutler <mi...@tumra.com>.
Hello Joe,

The first step is acquiring some data, either through the Facebook
API<https://developers.facebook.com/>or a third-party service like
Datasift <https://datasift.com/> (paid).  Once you've acquired some data,
and got it somewhere Spark can access it (like HDFS), you can then load and
manipulate it just like any other data.

Here is a pretty-printed example JSON message I got from a
Datasift<https://datasift.com/> stream
this morning, it illustrates an anonymised someone with *clearly too much
time on their hands* having reached *level 576* on Candy Crush Saga.

{
    "demographic": {
        "gender": "mostly_female"
    },
    "facebook": {
        "application": "Candy Crush Saga",
        "author": {
            "type": "user",
            "hash_id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
        },
        "caption": "I just completed level 576, scored 494020 points and
got 3 stars.",
        "created_at": "Tue, 20 May 2014 03:08:09 +0000",
        "description": "Click here to follow my progress!",
        "id": "100000000000000_123456789012345",
        "link": "
http://apps.facebook.com/candycrush/?urlMessage=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
",
        "name": "Yay, I completed level 576 in Candy Crush Saga!",
        "source": "Candy Crush Saga (123456789012345)",
        "type": "link"
    },
    "interaction": {
        "schema": {
            "version": 3
        },
        "type": "facebook",
        "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "created_at": "Tue, 20 May 2014 03:08:09 +0000",
        "received_at": 1400555303.6832,
        "author": {
            "type": "user",
            "hash_id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
        },
        "title": "Yay, I completed level 576 in Candy Crush Saga!",
        "link": "http://www.facebook.com/100000000000000_123456789012345",
        "subtype": "link",
        "content": "Click here to follow my progress!",
        "source": "Candy Crush Saga (123456789012345)"
    },
    "language": {
        "tag": "en",
        "tag_extended": "en",
        "confidence": 97
    }
}

Much like processing Twitter streams, the data arrives as a single JSON
object on each line.  So you need to pass the RDD[String] you get from
opening the textFile through a JSON parser.  Spark has
json4s<https://github.com/json4s/json4s>and jackson JSON parsers
embedded in the assembly so you can basically use
those for 'free' without having to bundle them in your JAR.

Here is an example Spark job which answers the age-old question: "Who is
better at Candy Crush, boys? or girls?"

    // We want to extract the level number from "Yay, I completed
level 576 in Candy Crush Saga!"
    // the actual text will change based on the users language but
parsing the 'last number' works
    val pattern = """(\d+)""".r

    // Produces a RDD[String]
    val lines = sc.textFile("facebook-2014-05-19.json")
    lines.map(line => {
      // Parse the JSON
      parse(line)
    }).filter(json => {
      // Filter out only 'Candy Crush Saga' activity
      json \ "facebook" \ "application" == JString("Candy Crush Saga")
    }).map(json => {
      // Extract the 'level' or default to zero
      var level = 0;
      pattern.findAllIn( compact(json \ "interaction" \ "title")
).matchData.foreach(m => {
        level = m.group(1).toInt
      })
      // Extract the gender
      val gender = compact(json \ "demographic" \ "gender")
      // Return a Tuple of RDD[gender: String, (level: Int, count: Int)]
      ( gender, (level, 1) )
    }).filter(a => {
      // Filter out entries with a level of zero
      a._2._1 > 0
    }).reduceByKey( (a, b) => {
      // Sum the levels and counts so we can average later
      ( a._1 + b._1, a._2 + b._2 )
    }).collect().foreach(entry => {
      // Print the results
      val gender = entry._1
      val values = entry._2
      val average = values._1 / values._2
      println(gender + ": average=" + average + ", count=" + values._2 )
    })


See more: https://gist.github.com/cotdp/fda64b4248e43a3c8f46

If you run this on a small sample of data you get results like this:


   - "female": average=114, count=15422
   - "male": average=104, count=14727

 Which basically says the average level achieved by women is slightly
higher than guys.

Best of luck fishing through Facebook data!

MC



*Michael Cutler*
Founder, CTO


*Mobile: +44 789 990 7847Email:   michael@tumra.com <mi...@tumra.com>Web:
    tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>*
*Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
*Registered in England & Wales, 07916412. VAT No. 130595328*


This email and any files transmitted with it are confidential and may also
be privileged. It is intended only for the person to whom it is addressed.
If you have received this email in error, please inform the sender immediately.
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email.


On 20 May 2014 05:07, Joe L <se...@yahoo.com> wrote:

> Is there any way to get facebook data into Spark and filter the content of
> it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>