You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Abel Coronado Iruegas <ac...@gmail.com> on 2014/07/04 16:49:08 UTC

SQL FIlter of tweets (json) running on Disk

Hi everybody

Someone can tell me if it is possible to read and filter a 60 GB file of
tweets (Json Docs) in a Standalone Spark Deployment that runs in a single
machine with 40 Gb RAM and 8 cores???

I mean, is it possible to configure Spark to work with some amount of
memory (20 GB) and the rest of the process in Disk, and avoid OutOfMemory
exceptions????

Regards

Abel

Re: SQL FIlter of tweets (json) running on Disk

Posted by Abel Coronado Iruegas <ac...@gmail.com>.
Thank you, DataBricks Rules !!!!



On Fri, Jul 4, 2014 at 1:58 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> sqlContext.jsonFile("data.json")  <---- Is this already available in the
>> master branch???
>>
>
> Yes, and it will be available in the soon to come 1.0.1 release.
>
>
>> But the question about the use a combination of resources (Memory
>> processing & Disk processing) still remains.
>>
>
> This code should work just fine off of disk.  I would not recommend trying
> to cache the JSON data in memory as it is heavily nested and this is a
> place where the columnar storage code does not do great.  Instead, maybe
> try converting it to parquet and reading that data from disk
> (tweets.saveAsParquetFile(...);
>  sqlContext.parquetFile(...).registerAsTable(...))  You should see improved
> compression and much better performance for queries that only read some of
> the columns.  You could also just pull out the relevant columns and cache
> only that data in memory:
>
> sqlContext.jsonFile("data.json").registerAsTable("allTweets")
> sql("SELECT text FROM allTweets").registerAsTable("tweetText")
> sqlContext.cacheTable("tweetText")
>

Re: SQL FIlter of tweets (json) running on Disk

Posted by Michael Armbrust <mi...@databricks.com>.
>
> sqlContext.jsonFile("data.json")  <---- Is this already available in the
> master branch???
>

Yes, and it will be available in the soon to come 1.0.1 release.


> But the question about the use a combination of resources (Memory
> processing & Disk processing) still remains.
>

This code should work just fine off of disk.  I would not recommend trying
to cache the JSON data in memory as it is heavily nested and this is a
place where the columnar storage code does not do great.  Instead, maybe
try converting it to parquet and reading that data from disk
(tweets.saveAsParquetFile(...);
 sqlContext.parquetFile(...).registerAsTable(...))  You should see improved
compression and much better performance for queries that only read some of
the columns.  You could also just pull out the relevant columns and cache
only that data in memory:

sqlContext.jsonFile("data.json").registerAsTable("allTweets")
sql("SELECT text FROM allTweets").registerAsTable("tweetText")
sqlContext.cacheTable("tweetText")

Re: SQL FIlter of tweets (json) running on Disk

Posted by Abel Coronado Iruegas <ac...@gmail.com>.
Ok i find this slides of Yin Huai (
http://spark-summit.org/wp-content/uploads/2014/07/Easy-json-Data-Manipulation-Yin-Huai.pdf
)

to read a Json file the code seem pretty simple :

sqlContext.jsonFile("data.json")  <---- Is this already available in the
master branch???

But the question about the use a combination of resources (Memory
processing & Disk processing) still remains.

Thanks !!



On Fri, Jul 4, 2014 at 9:49 AM, Abel Coronado Iruegas <
acoronadoiruegas@gmail.com> wrote:

> Hi everybody
>
> Someone can tell me if it is possible to read and filter a 60 GB file of
> tweets (Json Docs) in a Standalone Spark Deployment that runs in a single
> machine with 40 Gb RAM and 8 cores???
>
> I mean, is it possible to configure Spark to work with some amount of
> memory (20 GB) and the rest of the process in Disk, and avoid OutOfMemory
> exceptions????
>
> Regards
>
> Abel
>