You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Divya Gehlot <di...@gmail.com> on 2016/02/19 11:14:34 UTC
Read files dynamically having different schema under one parent
directory + scala + Spakr 1.5,2
Hi,
I have a use case ,where I have one parent directory
File stucture looks like
hdfs:///TestDirectory/spark1/part files( created by some spark job )
hdfs:///TestDirectory/spark2/ part files (created by some spark job )
spark1 and spark 2 has different schema
like spark 1 part files schema
carname model year
Spark2 part files schema
carowner city carcost
As these spark 1 and spark2 directory gets created dynamically
can have spark3 directory with different schema
M requirement is to read the parent directory and list sub drectory
and create dataframe for each subdirectory
I am not able to get how can I list subdirectory under parent directory and
dynamically create dataframes.
Thanks,
Divya
Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2
Posted by Chandeep Singh <cs...@chandeep.com>.
Here is how you can list all HDFS directories for a given path.
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://<Your NN Hostname>:8020"), hadoopConf)
val c = hdfsConn.listStatus(new org.apache.hadoop.fs.Path("/user/csingh/"))
c.foreach(x => println(x.getPath))
Output:
hdfs://<NN hostname>/user/csingh/.Trash
hdfs://<NN hostname>/user/csingh/.sparkStaging
hdfs://<NN hostname>/user/csingh/.staging
hdfs://<NN hostname>/user/csingh/test1
hdfs://<NN hostname>/user/csingh/test2
hdfs://<NN hostname>/user/csingh/tmp
> On Feb 20, 2016, at 2:37 PM, Divya Gehlot <di...@gmail.com> wrote:
>
> Hi,
> @Umesh :You understanding is partially correct as per my requirement.
> My idea which I try to implement is
> Steps which I am trying to follow
> (Not sure how feasible it is I am new new bee to spark and scala)
> 1.List all the files under parent directory
> hdfs :///Testdirectory/
> As list
> For example : val listsubdirs =(subdir1,subdir2...subdir.n)
> Iterate through this list
> for(subdir <-listsubdirs){
> val df ="df"+subdir
> df= read it using spark csv package using custom schema
>
> }
> Will get dataframes equal to subdirs
>
> Now I got stuck in first step itself .
> How do I list directories and put it in list ?
>
> Hope you understood my issue now.
> Thanks,
> Divya
> On Feb 19, 2016 6:54 PM, "UMESH CHAUDHARY" <umesh9794@gmail.com <ma...@gmail.com>> wrote:
> If I understood correctly, you can have many sub-dirs under hdfs:///TestDirectory and and you need to attach a schema to all part files in a sub-dir.
>
> 1) I am assuming that you know the sub-dirs names :
>
> For that, you need to list all sub-dirs inside hdfs:///TestDirectory using Scala, iterate over sub-dirs
> foreach sub-dir in the list
> read the partfiles , identify and attach schema respective to that sub-directory.
>
> 2) If you don't know the sub-directory names:
> You need to store schema somewhere inside that sub-directory and read it in iteration.
>
> On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot <divya.htconex@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> I have a use case ,where I have one parent directory
>
> File stucture looks like
> hdfs:///TestDirectory/spark1/part files( created by some spark job )
> hdfs:///TestDirectory/spark2/ part files (created by some spark job )
>
> spark1 and spark 2 has different schema
>
> like spark 1 part files schema
> carname model year
>
> Spark2 part files schema
> carowner city carcost
>
>
> As these spark 1 and spark2 directory gets created dynamically
> can have spark3 directory with different schema
>
> M requirement is to read the parent directory and list sub drectory
> and create dataframe for each subdirectory
>
> I am not able to get how can I list subdirectory under parent directory and dynamically create dataframes.
>
> Thanks,
> Divya
>
>
>
>
>
Re: Read files dynamically having different schema under one parent
directory + scala + Spakr 1.5,2
Posted by Divya Gehlot <di...@gmail.com>.
Hi,
@Umesh :You understanding is partially correct as per my requirement.
My idea which I try to implement is
Steps which I am trying to follow
(Not sure how feasible it is I am new new bee to spark and scala)
1.List all the files under parent directory
hdfs :///Testdirectory/
As list
For example : val listsubdirs =(subdir1,subdir2...subdir.n)
Iterate through this list
for(subdir <-listsubdirs){
val df ="df"+subdir
df= read it using spark csv package using custom schema
}
Will get dataframes equal to subdirs
Now I got stuck in first step itself .
How do I list directories and put it in list ?
Hope you understood my issue now.
Thanks,
Divya
On Feb 19, 2016 6:54 PM, "UMESH CHAUDHARY" <um...@gmail.com> wrote:
> If I understood correctly, you can have many sub-dirs under *hdfs:///TestDirectory
> *and and you need to attach a schema to all part files in a sub-dir.
>
> 1) I am assuming that you know the sub-dirs names :
>
> For that, you need to list all sub-dirs inside *hdfs:///TestDirectory
> *using Scala, iterate over sub-dirs
> foreach sub-dir in the list
> read the partfiles , identify and attach schema respective to that
> sub-directory.
>
> 2) If you don't know the sub-directory names:
> You need to store schema somewhere inside that sub-directory and read
> it in iteration.
>
> On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot <di...@gmail.com>
> wrote:
>
>> Hi,
>> I have a use case ,where I have one parent directory
>>
>> File stucture looks like
>> hdfs:///TestDirectory/spark1/part files( created by some spark job )
>> hdfs:///TestDirectory/spark2/ part files (created by some spark job )
>>
>> spark1 and spark 2 has different schema
>>
>> like spark 1 part files schema
>> carname model year
>>
>> Spark2 part files schema
>> carowner city carcost
>>
>>
>> As these spark 1 and spark2 directory gets created dynamically
>> can have spark3 directory with different schema
>>
>> M requirement is to read the parent directory and list sub drectory
>> and create dataframe for each subdirectory
>>
>> I am not able to get how can I list subdirectory under parent directory
>> and dynamically create dataframes.
>>
>> Thanks,
>> Divya
>>
>>
>>
>>
>>
>
Re: Read files dynamically having different schema under one parent
directory + scala + Spakr 1.5,2
Posted by UMESH CHAUDHARY <um...@gmail.com>.
If I understood correctly, you can have many sub-dirs under
*hdfs:///TestDirectory
*and and you need to attach a schema to all part files in a sub-dir.
1) I am assuming that you know the sub-dirs names :
For that, you need to list all sub-dirs inside
*hdfs:///TestDirectory *using
Scala, iterate over sub-dirs
foreach sub-dir in the list
read the partfiles , identify and attach schema respective to that
sub-directory.
2) If you don't know the sub-directory names:
You need to store schema somewhere inside that sub-directory and read
it in iteration.
On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot <di...@gmail.com>
wrote:
> Hi,
> I have a use case ,where I have one parent directory
>
> File stucture looks like
> hdfs:///TestDirectory/spark1/part files( created by some spark job )
> hdfs:///TestDirectory/spark2/ part files (created by some spark job )
>
> spark1 and spark 2 has different schema
>
> like spark 1 part files schema
> carname model year
>
> Spark2 part files schema
> carowner city carcost
>
>
> As these spark 1 and spark2 directory gets created dynamically
> can have spark3 directory with different schema
>
> M requirement is to read the parent directory and list sub drectory
> and create dataframe for each subdirectory
>
> I am not able to get how can I list subdirectory under parent directory
> and dynamically create dataframes.
>
> Thanks,
> Divya
>
>
>
>
>