You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Divya Gehlot <di...@gmail.com> on 2016/02/19 11:14:34 UTC

Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

Hi,
I have a use case ,where I have one parent directory

File stucture looks like
hdfs:///TestDirectory/spark1/part files( created by some spark job )
hdfs:///TestDirectory/spark2/ part files (created by some spark job )

spark1 and spark 2 has different schema

like spark 1  part files schema
carname model year

Spark2 part files schema
carowner city  carcost


As these spark 1 and spark2 directory gets created dynamically
can have spark3 directory with different schema

M requirement is to read the parent directory and list sub drectory
and create dataframe for each subdirectory

I am not able to get how can I list subdirectory under parent directory and
dynamically create dataframes.

Thanks,
Divya

Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

Posted by Chandeep Singh <cs...@chandeep.com>.
Here is how you can list all HDFS directories for a given path.

val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://<Your NN Hostname>:8020"), hadoopConf)
val c = hdfsConn.listStatus(new org.apache.hadoop.fs.Path("/user/csingh/"))
c.foreach(x => println(x.getPath))

Output:
hdfs://<NN hostname>/user/csingh/.Trash
hdfs://<NN hostname>/user/csingh/.sparkStaging
hdfs://<NN hostname>/user/csingh/.staging
hdfs://<NN hostname>/user/csingh/test1
hdfs://<NN hostname>/user/csingh/test2
hdfs://<NN hostname>/user/csingh/tmp


> On Feb 20, 2016, at 2:37 PM, Divya Gehlot <di...@gmail.com> wrote:
> 
> Hi,
> @Umesh :You understanding is partially correct as per my requirement.
> My idea which I try to implement is 
> Steps which I am trying to follow 
> (Not sure how feasible it is I am new new bee to spark and scala)
> 1.List all the files under parent directory 
>   hdfs :///Testdirectory/
> As list 
> For example : val listsubdirs =(subdir1,subdir2...subdir.n)
> Iterate through this list 
> for(subdir <-listsubdirs){
> val df ="df"+subdir
> df= read it using spark csv package using custom schema
> 
> }
> Will get dataframes equal to subdirs
> 
> Now I got stuck in first step itself .
> How do I list directories and put it in list ?
> 
> Hope you understood my issue now.
> Thanks,
> Divya 
> On Feb 19, 2016 6:54 PM, "UMESH CHAUDHARY" <umesh9794@gmail.com <ma...@gmail.com>> wrote:
> If I understood correctly, you can have many sub-dirs under hdfs:///TestDirectory and and you need to attach a schema to all part files in a sub-dir. 
> 
> 1) I am assuming that you know the sub-dirs names :
> 
>     For that, you need to list all sub-dirs inside hdfs:///TestDirectory using Scala, iterate over sub-dirs 
>     foreach sub-dir in the list 
>     read the partfiles , identify and attach schema respective to that sub-directory. 
> 
> 2) If you don't know the sub-directory names:
>     You need to store schema somewhere inside that sub-directory and read it in iteration.
> 
> On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot <divya.htconex@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> I have a use case ,where I have one parent directory
> 
> File stucture looks like 
> hdfs:///TestDirectory/spark1/part files( created by some spark job )
> hdfs:///TestDirectory/spark2/ part files (created by some spark job )
> 
> spark1 and spark 2 has different schema 
> 
> like spark 1  part files schema
> carname model year
> 
> Spark2 part files schema
> carowner city  carcost
> 
> 
> As these spark 1 and spark2 directory gets created dynamically 
> can have spark3 directory with different schema
> 
> M requirement is to read the parent directory and list sub drectory 
> and create dataframe for each subdirectory
> 
> I am not able to get how can I list subdirectory under parent directory and dynamically create dataframes.
> 
> Thanks,
> Divya 
> 
> 
> 
> 
> 


Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

Posted by Divya Gehlot <di...@gmail.com>.
Hi,
@Umesh :You understanding is partially correct as per my requirement.
My idea which I try to implement is
Steps which I am trying to follow
(Not sure how feasible it is I am new new bee to spark and scala)
1.List all the files under parent directory
  hdfs :///Testdirectory/
As list
For example : val listsubdirs =(subdir1,subdir2...subdir.n)
Iterate through this list
for(subdir <-listsubdirs){
val df ="df"+subdir
df= read it using spark csv package using custom schema

}
Will get dataframes equal to subdirs

Now I got stuck in first step itself .
How do I list directories and put it in list ?

Hope you understood my issue now.
Thanks,
Divya
On Feb 19, 2016 6:54 PM, "UMESH CHAUDHARY" <um...@gmail.com> wrote:

> If I understood correctly, you can have many sub-dirs under *hdfs:///TestDirectory
> *and and you need to attach a schema to all part files in a sub-dir.
>
> 1) I am assuming that you know the sub-dirs names :
>
>     For that, you need to list all sub-dirs inside *hdfs:///TestDirectory
> *using Scala, iterate over sub-dirs
>     foreach sub-dir in the list
>     read the partfiles , identify and attach schema respective to that
> sub-directory.
>
> 2) If you don't know the sub-directory names:
>     You need to store schema somewhere inside that sub-directory and read
> it in iteration.
>
> On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot <di...@gmail.com>
> wrote:
>
>> Hi,
>> I have a use case ,where I have one parent directory
>>
>> File stucture looks like
>> hdfs:///TestDirectory/spark1/part files( created by some spark job )
>> hdfs:///TestDirectory/spark2/ part files (created by some spark job )
>>
>> spark1 and spark 2 has different schema
>>
>> like spark 1  part files schema
>> carname model year
>>
>> Spark2 part files schema
>> carowner city  carcost
>>
>>
>> As these spark 1 and spark2 directory gets created dynamically
>> can have spark3 directory with different schema
>>
>> M requirement is to read the parent directory and list sub drectory
>> and create dataframe for each subdirectory
>>
>> I am not able to get how can I list subdirectory under parent directory
>> and dynamically create dataframes.
>>
>> Thanks,
>> Divya
>>
>>
>>
>>
>>
>

Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

Posted by UMESH CHAUDHARY <um...@gmail.com>.
If I understood correctly, you can have many sub-dirs under
*hdfs:///TestDirectory
*and and you need to attach a schema to all part files in a sub-dir.

1) I am assuming that you know the sub-dirs names :

    For that, you need to list all sub-dirs inside
*hdfs:///TestDirectory *using
Scala, iterate over sub-dirs
    foreach sub-dir in the list
    read the partfiles , identify and attach schema respective to that
sub-directory.

2) If you don't know the sub-directory names:
    You need to store schema somewhere inside that sub-directory and read
it in iteration.

On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot <di...@gmail.com>
wrote:

> Hi,
> I have a use case ,where I have one parent directory
>
> File stucture looks like
> hdfs:///TestDirectory/spark1/part files( created by some spark job )
> hdfs:///TestDirectory/spark2/ part files (created by some spark job )
>
> spark1 and spark 2 has different schema
>
> like spark 1  part files schema
> carname model year
>
> Spark2 part files schema
> carowner city  carcost
>
>
> As these spark 1 and spark2 directory gets created dynamically
> can have spark3 directory with different schema
>
> M requirement is to read the parent directory and list sub drectory
> and create dataframe for each subdirectory
>
> I am not able to get how can I list subdirectory under parent directory
> and dynamically create dataframes.
>
> Thanks,
> Divya
>
>
>
>
>