You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "nicholas.chammas" <ni...@gmail.com> on 2014/02/24 06:10:02 UTC

Having Spark read a JSON file

I'm new to this field, but it seems like most "Big Data" examples --
Spark's included -- begin with reading in flat lines of text from a file.

How would I go about having Spark turn a large JSON file into an RDD?

So the file would just be a text file that looks like this:

[{...}, {...}, ...]


where the individual JSON objects are arbitrarily complex (i.e. not
necessarily flat) and may or may not be on separate lines.

Basically, I'm guessing Spark would need to parse the JSON since it cannot
rely on newlines as a delimiter. That sounds like a costly thing.

Is JSON a "bad" format to have to deal with, or can Spark efficiently
ingest and work with data in this format? If it can, can I get a pointer as
to how I would do that?

Nick




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-tp1963.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Having Spark read a JSON file

Posted by Debasish Das <de...@gmail.com>.
Nick,

If you don't want to use avro thrift protbuf etc use a library like
lift-json and write the json as string, read it as text file and de
serialize using lift json...you can use standard separators like comma tab
etc...

I am sure there will be better ways to do it but I am new to spark as
well...

Deb
On Feb 23, 2014 9:10 PM, "nicholas.chammas" <ni...@gmail.com>
wrote:

> I'm new to this field, but it seems like most "Big Data" examples --
> Spark's included -- begin with reading in flat lines of text from a file.
>
> How would I go about having Spark turn a large JSON file into an RDD?
>
> So the file would just be a text file that looks like this:
>
> [{...}, {...}, ...]
>
>
>  where the individual JSON objects are arbitrarily complex (i.e. not
> necessarily flat) and may or may not be on separate lines.
>
> Basically, I'm guessing Spark would need to parse the JSON since it cannot
> rely on newlines as a delimiter. That sounds like a costly thing.
>
> Is JSON a "bad" format to have to deal with, or can Spark efficiently
> ingest and work with data in this format? If it can, can I get a pointer as
> to how I would do that?
>
>  Nick
>
> ------------------------------
> View this message in context: Having Spark read a JSON file<http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-tp1963.html>
> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>

Re: Having Spark read a JSON file

Posted by Paul Brown <pr...@mult.ifario.us>.
Hi, Nick --

Not that it adds legitimacy, but there is even a MIME type line-delimited
JSON: application/x-ldjson (not to be confused with application/ld+json...)

 What I said about ser/de in inline blocks only applied in the Scala
dialect of Spark when using Jackson; for example:

  val om: ObjectMapper with ScalaObjectMapper = new ObjectMapper() with
ScalaObjectMapper

om.setPropertyNamingStrategy(PropertyNamingStrategy.CAMEL_CASE_TO_LOWER_CASE_WITH_UNDERSCORES)
  om.registerModule(DefaultScalaModule)
  om.registerModule(new JodaModule)
  val events: RDD[Event] =
sc.textFile("foo.ldj").map(om.readValue[Event](_))

That would attempt to send the ObjectMapper instance over the wire, and as
configured, the instance isn't serializable.  Instead, you can wrap the
functionality in an object that exists on the worker side:

object Foo {
  // initialize ObjectMapper here
  def mapStuff:...
}

and then in the job driver:

  val events: RDD[Event] = sc.textFile("foo.ldj").map(Foo.mapStuff)

There are probably analogs in the Python flavor as well, but IMHO things
like this are a nice object lesson (ho ho ho) about where code and data
live in a Spark system.  (/me waves hands about mobile code.)



—
prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


On Thu, Feb 27, 2014 at 8:52 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Thanks for the direction, Paul and Deb.
>
> I'm currently reading in the data using sc.textFile() and Python's
> json.loads(). It turns out that the big JSON data sources I care most about
> happen to be structured so that there is one object per line, even though
> the objects are correctly strung together in a JSON list.
>
> Deb,
>
> I get the files as JSON text, but they don't have to stay that way. Would
> you recommend I somehow convert the files into another format, say Avro,
> before handling them with Spark?
>
> Paul,
>
> When you say not to write your ser/de as inline blocks, could you provide
> a simple example (even in pseudocode) to illustrate?
>
> Nick
>
>
> On Mon, Feb 24, 2014 at 2:41 AM, Paul Brown <pr...@mult.ifario.us> wrote:
>
>>
>> JSON handling works great, although you have to be a little bit careful
>> with just what is loaded/used where.  One approach that works is:
>>
>> - Jackson Scala 2.3.1 (or your favorite JSON lib) shipped as a JAR for
>> the job.
>> - Read data as RDD[String].
>> - Implement your per-line JSON binding in a method on an object, e.g.,
>> apply(...) for a companion object for a case class that models your line
>> items.  For the Jackson case, this would mean an ObjectMapper as a val in
>> the companion object (only need one ObjectMapper instance).
>> - .map(YourObject.apply) to get RDD[YourObject]
>>
>> And there you go.  Something similar works for writing out JSON.
>>
>> Probably obvious if you're a seasoned Spark user, but DO NOT write your
>> JSON serialization/deserialization as inline blocks, else you'll be
>> transporting your ObjectMapper instances around the cluster when you don't
>> need to (and depending on your specific configuration, it may not work).
>>  That is a facility that should (IMHO) be encapsulated with the pieces of
>> the system that directly touch the data, i.e., on the worker.
>>
>>
>> —
>> prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>>
>>
>> On Sun, Feb 23, 2014 at 9:10 PM, nicholas.chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> I'm new to this field, but it seems like most "Big Data" examples --
>>> Spark's included -- begin with reading in flat lines of text from a file.
>>>
>>> How would I go about having Spark turn a large JSON file into an RDD?
>>>
>>> So the file would just be a text file that looks like this:
>>>
>>> [{...}, {...}, ...]
>>>
>>>
>>>  where the individual JSON objects are arbitrarily complex (i.e. not
>>> necessarily flat) and may or may not be on separate lines.
>>>
>>> Basically, I'm guessing Spark would need to parse the JSON since it
>>> cannot rely on newlines as a delimiter. That sounds like a costly thing.
>>>
>>> Is JSON a "bad" format to have to deal with, or can Spark efficiently
>>> ingest and work with data in this format? If it can, can I get a pointer as
>>> to how I would do that?
>>>
>>>  Nick
>>>
>>> ------------------------------
>>> View this message in context: Having Spark read a JSON file<http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-tp1963.html>
>>> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>>
>>
>>
>

Re: Having Spark read a JSON file

Posted by Nicholas Chammas <ni...@gmail.com>.
Thanks for the direction, Paul and Deb.

I'm currently reading in the data using sc.textFile() and Python's
json.loads(). It turns out that the big JSON data sources I care most about
happen to be structured so that there is one object per line, even though
the objects are correctly strung together in a JSON list.

Deb,

I get the files as JSON text, but they don't have to stay that way. Would
you recommend I somehow convert the files into another format, say Avro,
before handling them with Spark?

Paul,

When you say not to write your ser/de as inline blocks, could you provide a
simple example (even in pseudocode) to illustrate?

Nick


On Mon, Feb 24, 2014 at 2:41 AM, Paul Brown <pr...@mult.ifario.us> wrote:

>
> JSON handling works great, although you have to be a little bit careful
> with just what is loaded/used where.  One approach that works is:
>
> - Jackson Scala 2.3.1 (or your favorite JSON lib) shipped as a JAR for the
> job.
> - Read data as RDD[String].
> - Implement your per-line JSON binding in a method on an object, e.g.,
> apply(...) for a companion object for a case class that models your line
> items.  For the Jackson case, this would mean an ObjectMapper as a val in
> the companion object (only need one ObjectMapper instance).
> - .map(YourObject.apply) to get RDD[YourObject]
>
> And there you go.  Something similar works for writing out JSON.
>
> Probably obvious if you're a seasoned Spark user, but DO NOT write your
> JSON serialization/deserialization as inline blocks, else you'll be
> transporting your ObjectMapper instances around the cluster when you don't
> need to (and depending on your specific configuration, it may not work).
>  That is a facility that should (IMHO) be encapsulated with the pieces of
> the system that directly touch the data, i.e., on the worker.
>
>
> --
> prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>
>
> On Sun, Feb 23, 2014 at 9:10 PM, nicholas.chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> I'm new to this field, but it seems like most "Big Data" examples --
>> Spark's included -- begin with reading in flat lines of text from a file.
>>
>> How would I go about having Spark turn a large JSON file into an RDD?
>>
>> So the file would just be a text file that looks like this:
>>
>> [{...}, {...}, ...]
>>
>>
>>  where the individual JSON objects are arbitrarily complex (i.e. not
>> necessarily flat) and may or may not be on separate lines.
>>
>> Basically, I'm guessing Spark would need to parse the JSON since it
>> cannot rely on newlines as a delimiter. That sounds like a costly thing.
>>
>> Is JSON a "bad" format to have to deal with, or can Spark efficiently
>> ingest and work with data in this format? If it can, can I get a pointer as
>> to how I would do that?
>>
>>  Nick
>>
>> ------------------------------
>> View this message in context: Having Spark read a JSON file<http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-tp1963.html>
>> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>
>
>

Re: Having Spark read a JSON file

Posted by Paul Brown <pr...@mult.ifario.us>.
JSON handling works great, although you have to be a little bit careful
with just what is loaded/used where.  One approach that works is:

- Jackson Scala 2.3.1 (or your favorite JSON lib) shipped as a JAR for the
job.
- Read data as RDD[String].
- Implement your per-line JSON binding in a method on an object, e.g.,
apply(...) for a companion object for a case class that models your line
items.  For the Jackson case, this would mean an ObjectMapper as a val in
the companion object (only need one ObjectMapper instance).
- .map(YourObject.apply) to get RDD[YourObject]

And there you go.  Something similar works for writing out JSON.

Probably obvious if you're a seasoned Spark user, but DO NOT write your
JSON serialization/deserialization as inline blocks, else you'll be
transporting your ObjectMapper instances around the cluster when you don't
need to (and depending on your specific configuration, it may not work).
 That is a facility that should (IMHO) be encapsulated with the pieces of
the system that directly touch the data, i.e., on the worker.


—
prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


On Sun, Feb 23, 2014 at 9:10 PM, nicholas.chammas <
nicholas.chammas@gmail.com> wrote:

> I'm new to this field, but it seems like most "Big Data" examples --
> Spark's included -- begin with reading in flat lines of text from a file.
>
> How would I go about having Spark turn a large JSON file into an RDD?
>
> So the file would just be a text file that looks like this:
>
> [{...}, {...}, ...]
>
>
>  where the individual JSON objects are arbitrarily complex (i.e. not
> necessarily flat) and may or may not be on separate lines.
>
> Basically, I'm guessing Spark would need to parse the JSON since it cannot
> rely on newlines as a delimiter. That sounds like a costly thing.
>
> Is JSON a "bad" format to have to deal with, or can Spark efficiently
> ingest and work with data in this format? If it can, can I get a pointer as
> to how I would do that?
>
>  Nick
>
> ------------------------------
> View this message in context: Having Spark read a JSON file<http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-tp1963.html>
> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>