You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Masf <ma...@gmail.com> on 2015/03/11 17:15:13 UTC
Read parquet folders recursively
Hi all
Is it possible to read recursively folders to read parquet files?
Thanks.
--
Saludos.
Miguel Ángel
Re: Read parquet folders recursively
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
With fileStream you are free to plugin any InputFormat, in your case, you
can easily plugin ParquetInputFormat. Here' some parquet hadoop examples
<https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example>
.
Thanks
Best Regards
On Thu, Mar 12, 2015 at 5:51 PM, Masf <ma...@gmail.com> wrote:
> Hi.
>
> Thanks for your answers, but, to read parquet files is necessary to use
> parquetFile method in org.apache.spark.sql.SQLContext, is it true?
>
> How can I combine your solution with the called to this method?
>
> Thanks!!
> Regards
>
> On Thu, Mar 12, 2015 at 8:34 AM, Yijie Shen <he...@gmail.com>
> wrote:
>
>> org.apache.spark.deploy.SparkHadoopUtil has a method:
>>
>> /**
>> * Get [[FileStatus]] objects for all leaf children (files) under the
>> given base path. If the
>> * given path points to a file, return a single-element collection
>> containing [[FileStatus]] of
>> * that file.
>> */
>> def listLeafStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] =
>> {
>> def recurse(path: Path) = {
>> val (directories, leaves) = fs.listStatus(path).partition(_.isDir)
>> leaves ++ directories.flatMap(f => listLeafStatuses(fs, f.getPath))
>> }
>>
>> val baseStatus = fs.getFileStatus(basePath)
>> if (baseStatus.isDir) recurse(basePath) else Array(baseStatus)
>> }
>>
>> —
>> Best Regards!
>> Yijie Shen
>>
>> On March 12, 2015 at 2:35:49 PM, Akhil Das (akhil@sigmoidanalytics.com)
>> wrote:
>>
>> Hi
>>
>> We have a custom build to read directories recursively, Currently we use
>> it with fileStream like:
>>
>> val lines = ssc.fileStream[LongWritable, Text,
>> TextInputFormat]("/datadumps/",
>> (t: Path) => true, true, *true*)
>>
>>
>> Making the 4th argument true to read recursively.
>>
>>
>> You could give it a try
>> https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Mar 11, 2015 at 9:45 PM, Masf <ma...@gmail.com> wrote:
>>
>>> Hi all
>>>
>>> Is it possible to read recursively folders to read parquet files?
>>>
>>>
>>> Thanks.
>>>
>>> --
>>>
>>>
>>> Saludos.
>>> Miguel Ángel
>>>
>>
>>
>
>
> --
>
>
> Saludos.
> Miguel Ángel
>
Re: Read parquet folders recursively
Posted by Masf <ma...@gmail.com>.
Hi.
Thanks for your answers, but, to read parquet files is necessary to use
parquetFile method in org.apache.spark.sql.SQLContext, is it true?
How can I combine your solution with the called to this method?
Thanks!!
Regards
On Thu, Mar 12, 2015 at 8:34 AM, Yijie Shen <he...@gmail.com>
wrote:
> org.apache.spark.deploy.SparkHadoopUtil has a method:
>
> /**
> * Get [[FileStatus]] objects for all leaf children (files) under the
> given base path. If the
> * given path points to a file, return a single-element collection
> containing [[FileStatus]] of
> * that file.
> */
> def listLeafStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] = {
> def recurse(path: Path) = {
> val (directories, leaves) = fs.listStatus(path).partition(_.isDir)
> leaves ++ directories.flatMap(f => listLeafStatuses(fs, f.getPath))
> }
>
> val baseStatus = fs.getFileStatus(basePath)
> if (baseStatus.isDir) recurse(basePath) else Array(baseStatus)
> }
>
> —
> Best Regards!
> Yijie Shen
>
> On March 12, 2015 at 2:35:49 PM, Akhil Das (akhil@sigmoidanalytics.com)
> wrote:
>
> Hi
>
> We have a custom build to read directories recursively, Currently we use
> it with fileStream like:
>
> val lines = ssc.fileStream[LongWritable, Text,
> TextInputFormat]("/datadumps/",
> (t: Path) => true, true, *true*)
>
>
> Making the 4th argument true to read recursively.
>
>
> You could give it a try
> https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz
>
> Thanks
> Best Regards
>
> On Wed, Mar 11, 2015 at 9:45 PM, Masf <ma...@gmail.com> wrote:
>
>> Hi all
>>
>> Is it possible to read recursively folders to read parquet files?
>>
>>
>> Thanks.
>>
>> --
>>
>>
>> Saludos.
>> Miguel Ángel
>>
>
>
--
Saludos.
Miguel Ángel
Re: Read parquet folders recursively
Posted by Yijie Shen <he...@gmail.com>.
org.apache.spark.deploy.SparkHadoopUtil has a method:
/**
* Get [[FileStatus]] objects for all leaf children (files) under the given base path. If the
* given path points to a file, return a single-element collection containing [[FileStatus]] of
* that file.
*/
def listLeafStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] = {
def recurse(path: Path) = {
val (directories, leaves) = fs.listStatus(path).partition(_.isDir)
leaves ++ directories.flatMap(f => listLeafStatuses(fs, f.getPath))
}
val baseStatus = fs.getFileStatus(basePath)
if (baseStatus.isDir) recurse(basePath) else Array(baseStatus)
}
—
Best Regards!
Yijie Shen
On March 12, 2015 at 2:35:49 PM, Akhil Das (akhil@sigmoidanalytics.com) wrote:
Hi
We have a custom build to read directories recursively, Currently we use it with fileStream like:
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/datadumps/",
(t: Path) => true, true, true)
Making the 4th argument true to read recursively.
You could give it a try https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz
Thanks
Best Regards
On Wed, Mar 11, 2015 at 9:45 PM, Masf <ma...@gmail.com> wrote:
Hi all
Is it possible to read recursively folders to read parquet files?
Thanks.
--
Saludos.
Miguel Ángel
Re: Read parquet folders recursively
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Hi
We have a custom build to read directories recursively, Currently we use it
with fileStream like:
val lines = ssc.fileStream[LongWritable, Text,
TextInputFormat]("/datadumps/",
(t: Path) => true, true, *true*)
Making the 4th argument true to read recursively.
You could give it a try
https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz
Thanks
Best Regards
On Wed, Mar 11, 2015 at 9:45 PM, Masf <ma...@gmail.com> wrote:
> Hi all
>
> Is it possible to read recursively folders to read parquet files?
>
>
> Thanks.
>
> --
>
>
> Saludos.
> Miguel Ángel
>