You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Masf <ma...@gmail.com> on 2015/03/11 17:15:13 UTC

Read parquet folders recursively

Hi all

Is it possible to read recursively folders to read parquet files?


Thanks.

-- 


Saludos.
Miguel Ángel

Re: Read parquet folders recursively

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

With fileStream you are free to plugin any InputFormat, in your case, you
can easily plugin ParquetInputFormat. Here' some parquet hadoop examples
<https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example>
.

Thanks
Best Regards

On Thu, Mar 12, 2015 at 5:51 PM, Masf <ma...@gmail.com> wrote:

> Hi.
>
> Thanks for your answers, but, to read parquet files is necessary to use
> parquetFile method in org.apache.spark.sql.SQLContext,  is it true?
>
> How can I combine your solution with the called to this method?
>
> Thanks!!
> Regards
>
> On Thu, Mar 12, 2015 at 8:34 AM, Yijie Shen <he...@gmail.com>
> wrote:
>
>> org.apache.spark.deploy.SparkHadoopUtil has a method:
>>
>> /**
>>    * Get [[FileStatus]] objects for all leaf children (files) under the
>> given base path. If the
>>    * given path points to a file, return a single-element collection
>> containing [[FileStatus]] of
>>    * that file.
>>    */
>>   def listLeafStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] =
>> {
>>     def recurse(path: Path) = {
>>       val (directories, leaves) = fs.listStatus(path).partition(_.isDir)
>>       leaves ++ directories.flatMap(f => listLeafStatuses(fs, f.getPath))
>>     }
>>
>>     val baseStatus = fs.getFileStatus(basePath)
>>     if (baseStatus.isDir) recurse(basePath) else Array(baseStatus)
>>   }
>>
>> —
>> Best Regards!
>> Yijie Shen
>>
>> On March 12, 2015 at 2:35:49 PM, Akhil Das (akhil@sigmoidanalytics.com)
>> wrote:
>>
>>  Hi
>>
>> We have a custom build to read directories recursively, Currently we use
>> it with fileStream like:
>>
>>  val lines = ssc.fileStream[LongWritable, Text,
>> TextInputFormat]("/datadumps/",
>>       (t: Path) => true, true, *true*)
>>
>>
>> Making the 4th argument true to read recursively.
>>
>>
>> You could give it a try
>> https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz
>>
>>  Thanks
>> Best Regards
>>
>> On Wed, Mar 11, 2015 at 9:45 PM, Masf <ma...@gmail.com> wrote:
>>
>>> Hi all
>>>
>>> Is it possible to read recursively folders to read parquet files?
>>>
>>>
>>> Thanks.
>>>
>>> --
>>>
>>>
>>> Saludos.
>>> Miguel Ángel
>>>
>>
>>
>
>
> --
>
>
> Saludos.
> Miguel Ángel
>

Re: Read parquet folders recursively

Posted by Masf <ma...@gmail.com>.

Hi.

Thanks for your answers, but, to read parquet files is necessary to use
parquetFile method in org.apache.spark.sql.SQLContext,  is it true?

How can I combine your solution with the called to this method?

Thanks!!
Regards

On Thu, Mar 12, 2015 at 8:34 AM, Yijie Shen <he...@gmail.com>
wrote:

> org.apache.spark.deploy.SparkHadoopUtil has a method:
>
> /**
>    * Get [[FileStatus]] objects for all leaf children (files) under the
> given base path. If the
>    * given path points to a file, return a single-element collection
> containing [[FileStatus]] of
>    * that file.
>    */
>   def listLeafStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] = {
>     def recurse(path: Path) = {
>       val (directories, leaves) = fs.listStatus(path).partition(_.isDir)
>       leaves ++ directories.flatMap(f => listLeafStatuses(fs, f.getPath))
>     }
>
>     val baseStatus = fs.getFileStatus(basePath)
>     if (baseStatus.isDir) recurse(basePath) else Array(baseStatus)
>   }
>
> —
> Best Regards!
> Yijie Shen
>
> On March 12, 2015 at 2:35:49 PM, Akhil Das (akhil@sigmoidanalytics.com)
> wrote:
>
>  Hi
>
> We have a custom build to read directories recursively, Currently we use
> it with fileStream like:
>
>  val lines = ssc.fileStream[LongWritable, Text,
> TextInputFormat]("/datadumps/",
>       (t: Path) => true, true, *true*)
>
>
> Making the 4th argument true to read recursively.
>
>
> You could give it a try
> https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz
>
>  Thanks
> Best Regards
>
> On Wed, Mar 11, 2015 at 9:45 PM, Masf <ma...@gmail.com> wrote:
>
>> Hi all
>>
>> Is it possible to read recursively folders to read parquet files?
>>
>>
>> Thanks.
>>
>> --
>>
>>
>> Saludos.
>> Miguel Ángel
>>
>
>


-- 


Saludos.
Miguel Ángel

Re: Read parquet folders recursively

Posted by Yijie Shen <he...@gmail.com>.

org.apache.spark.deploy.SparkHadoopUtil has a method:

/**
   * Get [[FileStatus]] objects for all leaf children (files) under the given base path. If the
   * given path points to a file, return a single-element collection containing [[FileStatus]] of
   * that file.
   */
  def listLeafStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] = {
    def recurse(path: Path) = {
      val (directories, leaves) = fs.listStatus(path).partition(_.isDir)
      leaves ++ directories.flatMap(f => listLeafStatuses(fs, f.getPath))
    }

    val baseStatus = fs.getFileStatus(basePath)
    if (baseStatus.isDir) recurse(basePath) else Array(baseStatus)
  }

— 
Best Regards!
Yijie Shen

On March 12, 2015 at 2:35:49 PM, Akhil Das (akhil@sigmoidanalytics.com) wrote:

Hi

We have a custom build to read directories recursively, Currently we use it with fileStream like:

val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/datadumps/",
     (t: Path) => true, true, true)

Making the 4th argument true to read recursively.


You could give it a try https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz

Thanks
Best Regards

On Wed, Mar 11, 2015 at 9:45 PM, Masf <ma...@gmail.com> wrote:
Hi all

Is it possible to read recursively folders to read parquet files?


Thanks.

--


Saludos.
Miguel Ángel

Re: Read parquet folders recursively

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Hi

We have a custom build to read directories recursively, Currently we use it
with fileStream like:

val lines = ssc.fileStream[LongWritable, Text,
TextInputFormat]("/datadumps/",
     (t: Path) => true, true, *true*)

Making the 4th argument true to read recursively.

You could give it a try
https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz

Thanks
Best Regards

On Wed, Mar 11, 2015 at 9:45 PM, Masf <ma...@gmail.com> wrote:

> Hi all
>
> Is it possible to read recursively folders to read parquet files?
>
>
> Thanks.
>
> --
>
>
> Saludos.
> Miguel Ángel
>