You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by haridass saisriram <ha...@gmail.com> on 2015/10/01 23:11:08 UTC

SparkSQL: Reading data from hdfs and storing into multiple paths

Hi,

  I am trying to find a simple example to read a data file on HDFS. The
file has the following format
a , b  , c ,yyyy,mm
a1,b1,c1,2015,09
a2,b2,c2,2014,08


I would like to read this file and store it in HDFS partitioned by year and
month. Something like this
/path/to/hdfs/yyyy/mm

I want to specify the "/path/to/hdfs/" and yyyy/mm should be populated
automatically based on those columns. Could some one point me in the right
direction

Thank you,
Sri Ram

Re: SparkSQL: Reading data from hdfs and storing into multiple paths

Posted by Michael Armbrust <mi...@databricks.com>.

Once you convert your data to a dataframe (look at spark-csv), try
df.write.partitionBy("yyyy", "mm").save("...").

On Thu, Oct 1, 2015 at 4:11 PM, haridass saisriram <
haridass.saisriram@gmail.com> wrote:

> Hi,
>
>   I am trying to find a simple example to read a data file on HDFS. The
> file has the following format
> a , b  , c ,yyyy,mm
> a1,b1,c1,2015,09
> a2,b2,c2,2014,08
>
>
> I would like to read this file and store it in HDFS partitioned by year
> and month. Something like this
> /path/to/hdfs/yyyy/mm
>
> I want to specify the "/path/to/hdfs/" and yyyy/mm should be populated
> automatically based on those columns. Could some one point me in the right
> direction
>
> Thank you,
> Sri Ram
>
>