You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sa...@wellsfargo.com on 2016/02/22 16:25:01 UTC

Can we load csv partitioned data into one DF?

Hello all, I am facing a silly data question.

If I have +100 csv files which are part of the same data, but each csv is for example, a year on a timeframe column (i.e. partitioned by year),
what would you suggest instead of loading all those files and joining them?

Final target would be parquet. Is it possible, for example, to load them and then store them as parquet, and then read parquet and consider all as one?

Thanks for any suggestions,
Saif

RE: Can we load csv partitioned data into one DF?

Posted by Mohammed Guller <mo...@glassbeam.com>.

Are all the csv files in the same directory?

Mohammed
Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Saif.A.Ellafi@wellsfargo.com [mailto:Saif.A.Ellafi@wellsfargo.com]
Sent: Monday, February 22, 2016 7:25 AM
To: user@spark.apache.org
Subject: Can we load csv partitioned data into one DF?

Hello all, I am facing a silly data question.

If I have +100 csv files which are part of the same data, but each csv is for example, a year on a timeframe column (i.e. partitioned by year),
what would you suggest instead of loading all those files and joining them?

Final target would be parquet. Is it possible, for example, to load them and then store them as parquet, and then read parquet and consider all as one?

Thanks for any suggestions,
Saif

Re: Can we load csv partitioned data into oneDF?

Posted by Mich Talebzadeh <mi...@cloudtechnologypartners.co.uk>.

Indeed this will work. Additionally the files could be zipped as well
(gz or bzip2) 

val df =
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load("/data/stg") 

On 22/02/2016 15:32, Alex Dzhagriev wrote: 

> Hi Saif, 
> 
> You can put your files into one directory and read it as text. Another option is to read them separately and then union the datasets. 
> 
> Thanks, Alex. 
> 
> On Mon, Feb 22, 2016 at 4:25 PM, <Sa...@wellsfargo.com> wrote:
> 
>> Hello all, I am facing a silly data question. 
>> 
>> If I have +100 csv files which are part of the same data, but each csv is for example, a year on a timeframe column (i.e. partitioned by year), 
>> what would you suggest instead of loading all those files and joining them? 
>> 
>> Final target would be parquet. Is it possible, for example, to load them and then store them as parquet, and then read parquet and consider all as one? 
>> 
>> Thanks for any suggestions, 
>> Saif

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

Re: Can we load csv partitioned data into one DF?

Posted by Alex Dzhagriev <dz...@gmail.com>.

Hi Saif,

You can put your files into one directory and read it as text. Another
option is to read them separately and then union the datasets.

Thanks, Alex.

On Mon, Feb 22, 2016 at 4:25 PM, <Sa...@wellsfargo.com> wrote:

> Hello all, I am facing a silly data question.
>
> If I have +100 csv files which are part of the same data, but each csv is
> for example, a year on a timeframe column (i.e. partitioned by year),
> what would you suggest instead of loading all those files and joining them?
>
> Final target would be parquet. Is it possible, for example, to load them
> and then store them as parquet, and then read parquet and consider all as
> one?
>
> Thanks for any suggestions,
> Saif
>
>