You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gavin Yue <yu...@gmail.com> on 2015/08/28 03:58:55 UTC

How to increase the Json parsing speed

Hey 

I am using the Json4s-Jackson parser coming with spark and parsing roughly 80m records with totally size 900mb. 

But the speed is slow.  It took my 50 nodes(16cores cpu,100gb mem) roughly 30mins to parse Json to use spark sql. 

Jackson has the benchmark saying parsing should be ms level. 

Any way to increase speed? 

I am using spark 1.4 on Hadoop 2.7 with Java 8. 

Thanks a lot ! 
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: How to increase the Json parsing speed

Posted by Ewan Leith <ew...@realitymine.com>.

Can you post roughly what you’re running as your Spark code? One issue I’ve seen before is that passing a directory full of files as a path “/path/to/files/” can be slow, while “/path/to/files/*” runs fast.

Also, if you’ve not seen it, have a look at the binaryFiles call

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext

which can be a really handy way of manipulating files without reading them all into memory first – it returns a PortableDataStream which you can then handle in your java InputStreamReader code

Ewan




From: Gavin Yue [mailto:yue.yuanyuan@gmail.com]
Sent: 28 August 2015 08:06
To: Sabarish Sasidharan <sa...@manthan.com>
Cc: user <us...@spark.apache.org>
Subject: Re: How to increase the Json parsing speed

500 each with 8GB memory.
I did the test again on the cluster.
I have 6000 files which generates 6000 tasks.  Each task takes 1.5 min to finish based on the Stats.
So theoretically it should take 15 mins roughly. WIth some additinal overhead, it totally takes 18 mins.

Based on the local file parsing test, seems simply parsing the json is fast, which only takes 7 secs.

So wonder where is the additional 1 min coming from.
Thanks again.


On Thu, Aug 27, 2015 at 11:44 PM, Sabarish Sasidharan <sa...@manthan.com>> wrote:
How many executors are you using when using Spark SQL?

On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan <sa...@manthan.com>> wrote:
I see that you are not reusing the same mapper instance in the Scala snippet.

Regards
Sab

On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue <yu...@gmail.com>> wrote:
Just did some tests.
I have 6000 files, each has 14K records with 900Mb file size.  In spark sql, it would take one task roughly 1 min to parse.
On the local machine, using the same Jackson lib inside Spark lib. Just parse it.

            FileInputStream fstream = new FileInputStream("testfile");
            BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
            String strLine;
            Long begin = System.currentTimeMillis();
             while ((strLine = br.readLine()) != null)   {
                JsonNode s = mapper.readTree(strLine);
             }
            System.out.println(System.currentTimeMillis() - begin);
In JDK8, it took 6270ms.
Same code in Scala, it would take 7486ms
   val begin =  java.lang.System.currentTimeMillis()
    for(line <- Source.fromFile("testfile").getLines())
    {
      val mapper = new ObjectMapper()
      mapper.registerModule(DefaultScalaModule)
      val s = mapper.readTree(line)
    }
    println(java.lang.System.currentTimeMillis() - begin)

One Json record contains two fileds :  ID and List[Event].
I am guessing put all the events into List would take the left time.
Any solution to speed this up?
Thanks a lot!


On Thu, Aug 27, 2015 at 7:45 PM, Sabarish Sasidharan <sa...@manthan.com>> wrote:

For your jsons, can you tell us what is your benchmark when running on a single machine using just plain Java (without Spark and Spark sql)?

Regards
Sab
On 28-Aug-2015 7:29 am, "Gavin Yue" <yu...@gmail.com>> wrote:
Hey

I am using the Json4s-Jackson parser coming with spark and parsing roughly 80m records with totally size 900mb.

But the speed is slow.  It took my 50 nodes(16cores cpu,100gb mem) roughly 30mins to parse Json to use spark sql.

Jackson has the benchmark saying parsing should be ms level.

Any way to increase speed?

I am using spark 1.4 on Hadoop 2.7 with Java 8.

Thanks a lot !
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>




--

Architect - Big Data
Ph: +91 99805 99458<tel:%2B91%2099805%2099458>

Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan India ICT)
+++



--

Architect - Big Data
Ph: +91 99805 99458<tel:%2B91%2099805%2099458>

Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan India ICT)
+++

Re: How to increase the Json parsing speed

Posted by Gavin Yue <yu...@gmail.com>.

500 each with 8GB memory.

I did the test again on the cluster.

I have 6000 files which generates 6000 tasks.  Each task takes 1.5 min to
finish based on the Stats.

So theoretically it should take 15 mins roughly. WIth some additinal
overhead, it totally takes 18 mins.

Based on the local file parsing test, seems simply parsing the json is
fast, which only takes 7 secs.

So wonder where is the additional 1 min coming from.

Thanks again.


On Thu, Aug 27, 2015 at 11:44 PM, Sabarish Sasidharan <
sabarish.sasidharan@manthan.com> wrote:

> How many executors are you using when using Spark SQL?
>
> On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan <
> sabarish.sasidharan@manthan.com> wrote:
>
>> I see that you are not reusing the same mapper instance in the Scala
>> snippet.
>>
>> Regards
>> Sab
>>
>> On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue <yu...@gmail.com>
>> wrote:
>>
>>> Just did some tests.
>>>
>>> I have 6000 files, each has 14K records with 900Mb file size.  In spark
>>> sql, it would take one task roughly 1 min to parse.
>>>
>>> On the local machine, using the same Jackson lib inside Spark lib. Just
>>> parse it.
>>>
>>>             FileInputStream fstream = new FileInputStream("testfile");
>>>             BufferedReader br = new BufferedReader(new
>>> InputStreamReader(fstream));
>>>             String strLine;
>>>             Long begin = System.currentTimeMillis();
>>>              while ((strLine = br.readLine()) != null)   {
>>>                 JsonNode s = mapper.readTree(strLine);
>>>              }
>>>             System.out.println(System.currentTimeMillis() - begin);
>>>
>>> In JDK8, it took *6270ms. *
>>>
>>> Same code in Scala, it would take *7486ms*
>>>    val begin =  java.lang.System.currentTimeMillis()
>>>     for(line <- Source.fromFile("testfile").getLines())
>>>     {
>>>       val mapper = new ObjectMapper()
>>>       mapper.registerModule(DefaultScalaModule)
>>>       val s = mapper.readTree(line)
>>>     }
>>>     println(java.lang.System.currentTimeMillis() - begin)
>>>
>>>
>>> One Json record contains two fileds :  ID and List[Event].
>>>
>>> I am guessing put all the events into List would take the left time.
>>>
>>> Any solution to speed this up?
>>>
>>> Thanks a lot!
>>>
>>>
>>> On Thu, Aug 27, 2015 at 7:45 PM, Sabarish Sasidharan <
>>> sabarish.sasidharan@manthan.com> wrote:
>>>
>>>> For your jsons, can you tell us what is your benchmark when running on
>>>> a single machine using just plain Java (without Spark and Spark sql)?
>>>>
>>>> Regards
>>>> Sab
>>>> On 28-Aug-2015 7:29 am, "Gavin Yue" <yu...@gmail.com> wrote:
>>>>
>>>>> Hey
>>>>>
>>>>> I am using the Json4s-Jackson parser coming with spark and parsing
>>>>> roughly 80m records with totally size 900mb.
>>>>>
>>>>> But the speed is slow.  It took my 50 nodes(16cores cpu,100gb mem)
>>>>> roughly 30mins to parse Json to use spark sql.
>>>>>
>>>>> Jackson has the benchmark saying parsing should be ms level.
>>>>>
>>>>> Any way to increase speed?
>>>>>
>>>>> I am using spark 1.4 on Hadoop 2.7 with Java 8.
>>>>>
>>>>> Thanks a lot !
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>
>>
>>
>> --
>>
>> Architect - Big Data
>> Ph: +91 99805 99458
>>
>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>> Sullivan India ICT)*
>> +++
>>
>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>

Re: How to increase the Json parsing speed

Posted by Sabarish Sasidharan <sa...@manthan.com>.

How many executors are you using when using Spark SQL?

On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan <
sabarish.sasidharan@manthan.com> wrote:

> I see that you are not reusing the same mapper instance in the Scala
> snippet.
>
> Regards
> Sab
>
> On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue <yu...@gmail.com> wrote:
>
>> Just did some tests.
>>
>> I have 6000 files, each has 14K records with 900Mb file size.  In spark
>> sql, it would take one task roughly 1 min to parse.
>>
>> On the local machine, using the same Jackson lib inside Spark lib. Just
>> parse it.
>>
>>             FileInputStream fstream = new FileInputStream("testfile");
>>             BufferedReader br = new BufferedReader(new
>> InputStreamReader(fstream));
>>             String strLine;
>>             Long begin = System.currentTimeMillis();
>>              while ((strLine = br.readLine()) != null)   {
>>                 JsonNode s = mapper.readTree(strLine);
>>              }
>>             System.out.println(System.currentTimeMillis() - begin);
>>
>> In JDK8, it took *6270ms. *
>>
>> Same code in Scala, it would take *7486ms*
>>    val begin =  java.lang.System.currentTimeMillis()
>>     for(line <- Source.fromFile("testfile").getLines())
>>     {
>>       val mapper = new ObjectMapper()
>>       mapper.registerModule(DefaultScalaModule)
>>       val s = mapper.readTree(line)
>>     }
>>     println(java.lang.System.currentTimeMillis() - begin)
>>
>>
>> One Json record contains two fileds :  ID and List[Event].
>>
>> I am guessing put all the events into List would take the left time.
>>
>> Any solution to speed this up?
>>
>> Thanks a lot!
>>
>>
>> On Thu, Aug 27, 2015 at 7:45 PM, Sabarish Sasidharan <
>> sabarish.sasidharan@manthan.com> wrote:
>>
>>> For your jsons, can you tell us what is your benchmark when running on a
>>> single machine using just plain Java (without Spark and Spark sql)?
>>>
>>> Regards
>>> Sab
>>> On 28-Aug-2015 7:29 am, "Gavin Yue" <yu...@gmail.com> wrote:
>>>
>>>> Hey
>>>>
>>>> I am using the Json4s-Jackson parser coming with spark and parsing
>>>> roughly 80m records with totally size 900mb.
>>>>
>>>> But the speed is slow.  It took my 50 nodes(16cores cpu,100gb mem)
>>>> roughly 30mins to parse Json to use spark sql.
>>>>
>>>> Jackson has the benchmark saying parsing should be ms level.
>>>>
>>>> Any way to increase speed?
>>>>
>>>> I am using spark 1.4 on Hadoop 2.7 with Java 8.
>>>>
>>>> Thanks a lot !
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>



-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++

Re: How to increase the Json parsing speed

Posted by Sabarish Sasidharan <sa...@manthan.com>.

I see that you are not reusing the same mapper instance in the Scala
snippet.

Regards
Sab

On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue <yu...@gmail.com> wrote:

> Just did some tests.
>
> I have 6000 files, each has 14K records with 900Mb file size.  In spark
> sql, it would take one task roughly 1 min to parse.
>
> On the local machine, using the same Jackson lib inside Spark lib. Just
> parse it.
>
>             FileInputStream fstream = new FileInputStream("testfile");
>             BufferedReader br = new BufferedReader(new
> InputStreamReader(fstream));
>             String strLine;
>             Long begin = System.currentTimeMillis();
>              while ((strLine = br.readLine()) != null)   {
>                 JsonNode s = mapper.readTree(strLine);
>              }
>             System.out.println(System.currentTimeMillis() - begin);
>
> In JDK8, it took *6270ms. *
>
> Same code in Scala, it would take *7486ms*
>    val begin =  java.lang.System.currentTimeMillis()
>     for(line <- Source.fromFile("testfile").getLines())
>     {
>       val mapper = new ObjectMapper()
>       mapper.registerModule(DefaultScalaModule)
>       val s = mapper.readTree(line)
>     }
>     println(java.lang.System.currentTimeMillis() - begin)
>
>
> One Json record contains two fileds :  ID and List[Event].
>
> I am guessing put all the events into List would take the left time.
>
> Any solution to speed this up?
>
> Thanks a lot!
>
>
> On Thu, Aug 27, 2015 at 7:45 PM, Sabarish Sasidharan <
> sabarish.sasidharan@manthan.com> wrote:
>
>> For your jsons, can you tell us what is your benchmark when running on a
>> single machine using just plain Java (without Spark and Spark sql)?
>>
>> Regards
>> Sab
>> On 28-Aug-2015 7:29 am, "Gavin Yue" <yu...@gmail.com> wrote:
>>
>>> Hey
>>>
>>> I am using the Json4s-Jackson parser coming with spark and parsing
>>> roughly 80m records with totally size 900mb.
>>>
>>> But the speed is slow.  It took my 50 nodes(16cores cpu,100gb mem)
>>> roughly 30mins to parse Json to use spark sql.
>>>
>>> Jackson has the benchmark saying parsing should be ms level.
>>>
>>> Any way to increase speed?
>>>
>>> I am using spark 1.4 on Hadoop 2.7 with Java 8.
>>>
>>> Thanks a lot !
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>


-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++

Re: How to increase the Json parsing speed

Posted by Gavin Yue <yu...@gmail.com>.

Just did some tests.

I have 6000 files, each has 14K records with 900Mb file size.  In spark
sql, it would take one task roughly 1 min to parse.

On the local machine, using the same Jackson lib inside Spark lib. Just
parse it.

            FileInputStream fstream = new FileInputStream("testfile");
            BufferedReader br = new BufferedReader(new
InputStreamReader(fstream));
            String strLine;
            Long begin = System.currentTimeMillis();
             while ((strLine = br.readLine()) != null)   {
                JsonNode s = mapper.readTree(strLine);
             }
            System.out.println(System.currentTimeMillis() - begin);

In JDK8, it took *6270ms. *

Same code in Scala, it would take *7486ms*
   val begin =  java.lang.System.currentTimeMillis()
    for(line <- Source.fromFile("testfile").getLines())
    {
      val mapper = new ObjectMapper()
      mapper.registerModule(DefaultScalaModule)
      val s = mapper.readTree(line)
    }
    println(java.lang.System.currentTimeMillis() - begin)


One Json record contains two fileds :  ID and List[Event].

I am guessing put all the events into List would take the left time.

Any solution to speed this up?

Thanks a lot!


On Thu, Aug 27, 2015 at 7:45 PM, Sabarish Sasidharan <
sabarish.sasidharan@manthan.com> wrote:

> For your jsons, can you tell us what is your benchmark when running on a
> single machine using just plain Java (without Spark and Spark sql)?
>
> Regards
> Sab
> On 28-Aug-2015 7:29 am, "Gavin Yue" <yu...@gmail.com> wrote:
>
>> Hey
>>
>> I am using the Json4s-Jackson parser coming with spark and parsing
>> roughly 80m records with totally size 900mb.
>>
>> But the speed is slow.  It took my 50 nodes(16cores cpu,100gb mem)
>> roughly 30mins to parse Json to use spark sql.
>>
>> Jackson has the benchmark saying parsing should be ms level.
>>
>> Any way to increase speed?
>>
>> I am using spark 1.4 on Hadoop 2.7 with Java 8.
>>
>> Thanks a lot !
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Re: How to increase the Json parsing speed

Posted by Sabarish Sasidharan <sa...@manthan.com>.

For your jsons, can you tell us what is your benchmark when running on a
single machine using just plain Java (without Spark and Spark sql)?

Regards
Sab
On 28-Aug-2015 7:29 am, "Gavin Yue" <yu...@gmail.com> wrote:

> Hey
>
> I am using the Json4s-Jackson parser coming with spark and parsing roughly
> 80m records with totally size 900mb.
>
> But the speed is slow.  It took my 50 nodes(16cores cpu,100gb mem) roughly
> 30mins to parse Json to use spark sql.
>
> Jackson has the benchmark saying parsing should be ms level.
>
> Any way to increase speed?
>
> I am using spark 1.4 on Hadoop 2.7 with Java 8.
>
> Thanks a lot !
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: How to increase the Json parsing speed

Posted by Ewan Higgs <ew...@ugent.be>.

Hi Gavin,

You can increase the speed by choosing a better encoding. A little bit 
of ETL goes a long way.

e.g. As you're working with Spark SQL you probably have a tabular 
format. So you could use CSV so you don't need to parse the field names 
on each entry (and it will also reduce the file size). You should also 
check if you can put your files into Parquet or Avro.

Yours,
Ewan

On 28/08/15 03:58, Gavin Yue wrote:
> Hey
>
> I am using the Json4s-Jackson parser coming with spark and parsing roughly 80m records with totally size 900mb.
>
> But the speed is slow.  It took my 50 nodes(16cores cpu,100gb mem) roughly 30mins to parse Json to use spark sql.
>
> Jackson has the benchmark saying parsing should be ms level.
>
> Any way to increase speed?
>
> I am using spark 1.4 on Hadoop 2.7 with Java 8.
>
> Thanks a lot !
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org