You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Soumya Simanta <so...@gmail.com> on 2014/02/04 18:31:49 UTC

Reading Tweets (JSON) in a file into RDD Spark

I'm new to Spark.

I have a bunch of files (in HDFS) that has a bunch of tweets (in JSON
format.)
I want to read and parse these into a RDD so that I can do some interactive
processing on these tweets.

Has someone done something like this before ? Example ?

I though I would ask before implementing one myself from scratch.

Thanks
-Soumya

Re: Reading Tweets (JSON) in a file into RDD Spark

Posted by Akhil Das <ak...@mobipulse.in>.
Yes, Soumya that file contents are newline separated.

You can run that program in 4 Steps (Hoping that you already have your
spark/hadoop up and running)

1. Copy the code and paste as SimpleApp.scala
2. Create a sbt build file with all dependencies, which is pasted below
3. do a *sbt package*
4. then *sbt run*

*simple.sbt*

name := "Simple Project"

version := "1.0"

scalaVersion := "2.9.3"

libraryDependencies += "org.apache.spark" %% "spark-streaming" %
"0.8.0-incubating"

resolvers ++= Seq("Akka Repository" at "http://repo.akka.io/releases/","Spray
Repository" at "http://repo.spray.cc/")


-
AkhilDas
CodeBreach.in

   - in.linkedin.com/in/akhildas/

Re: Reading Tweets (JSON) in a file into RDD Spark

Posted by Soumya Simanta <so...@gmail.com>.
Thanks Akhil.

In the above example, are you assuming that there is a tweet per line
(i.e., tweets are new line separated) ?

On an unrelated note, can you send pointers about how to run this
standalone example. Till now I've only played with the interactive
spark-shell and yet to run a standalone scala program in cluster mode.





On Tue, Feb 4, 2014 at 12:38 PM, Akhil Das <ak...@mobipulse.in> wrote:

> If those files arent going to grow, then you can use the simple textFile
> and do all your processing.
> Sample code is below:
>
> *import org.apache.spark.SparkContext*
> *import org.apache.spark.SparkContext._*
>
> *object SimpleApp{*
>
> * def main(args: Array[String]){*
>
> * val sc = new SparkContext("local", "Simple HDFS App",
> "/home/akhld/mobi/spark-streaming/spark-0.8.0-incubating",List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar"))*
>
> * val textFile = sc.textFile("hdfs://127.0.0.1:54310/akhld/tweet1.json
> <http://127.0.0.1:54310/akhld/tweet1.json>")*
>  * textFile.take(10).foreach(println) *
>
> * }*
> *}*
>
> If they are growing, then i think you might want to use textFileStream or
> FileStream which will takecare of the processing of new files.
>
>
> -
> AkhilDas
> CodeBreach.in
>
>    - in.linkedin.com/in/akhildas/
>
>

Re: Reading Tweets (JSON) in a file into RDD Spark

Posted by Akhil Das <ak...@mobipulse.in>.
If those files arent going to grow, then you can use the simple textFile
and do all your processing.
Sample code is below:

*import org.apache.spark.SparkContext*
*import org.apache.spark.SparkContext._*

*object SimpleApp{*

* def main(args: Array[String]){*

* val sc = new SparkContext("local", "Simple HDFS App",
"/home/akhld/mobi/spark-streaming/spark-0.8.0-incubating",List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar"))*

* val textFile = sc.textFile("hdfs://127.0.0.1:54310/akhld/tweet1.json
<http://127.0.0.1:54310/akhld/tweet1.json>")*
* textFile.take(10).foreach(println) *

* }*
*}*

If they are growing, then i think you might want to use textFileStream or
FileStream which will takecare of the processing of new files.


-
AkhilDas
CodeBreach.in

   - in.linkedin.com/in/akhildas/