You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Eric Friedman <er...@gmail.com> on 2014/09/15 04:37:53 UTC

PathFilter for newAPIHadoopFile?

Hi,

I have a directory structure with parquet+avro data in it. There are a
couple of administrative files (.foo and/or _foo) that I need to ignore
when processing this data or Spark tries to read them as containing parquet
content, which they do not.

How can I set a PathFilter on the FileInputFormat used to construct an RDD?

Re: PathFilter for newAPIHadoopFile?

Posted by Davies Liu <da...@databricks.com>.

Or maybe you could give this one a try:
https://labs.spotify.com/2013/05/07/snakebite/

On Mon, Sep 15, 2014 at 2:51 PM, Davies Liu <da...@databricks.com> wrote:
> There is one way by do it in bash: hadoop fs -ls xxxx, maybe you could
> end up with a bash scripts to do the things.
>
> On Mon, Sep 15, 2014 at 1:01 PM, Eric Friedman
> <er...@gmail.com> wrote:
>> That's a good idea and one I had considered too.  Unfortunately I'm not
>> aware of an API in PySpark for enumerating paths on HDFS.  Have I overlooked
>> one?
>>
>> On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu <da...@databricks.com> wrote:
>>>
>>> In PySpark, I think you could enumerate all the valid files, and create
>>> RDD by
>>> newAPIHadoopFile(), then union them together.
>>>
>>> On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
>>> <er...@gmail.com> wrote:
>>> > I neglected to specify that I'm using pyspark. Doesn't look like these
>>> > APIs have been bridged.
>>> >
>>> > ----
>>> > Eric Friedman
>>> >
>>> >> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <re...@gmail.com>
>>> >> wrote:
>>> >>
>>> >> Hi Eric,
>>> >>
>>> >> Something along the lines of the following should work
>>> >>
>>> >> val fs = getFileSystem(...) // standard hadoop API call
>>> >> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
>>> >> pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
>>> >> instance of org.apache.hadoop.fs.PathFilter
>>> >> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
>>> >> classOf[ParquetInputFormat[Something]], classOf[Void],
>>> >> classOf[SomeAvroType], getConfiguration(...))
>>> >>
>>> >> You have to do some initializations on ParquetInputFormat such as
>>> >> AvroReadSetup/AvroWriteSupport etc but that you should be doing
>>> >> already I am guessing.
>>> >>
>>> >> Cheers,
>>> >> Nat
>>> >>
>>> >>
>>> >> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
>>> >> <er...@gmail.com> wrote:
>>> >>> Hi,
>>> >>>
>>> >>> I have a directory structure with parquet+avro data in it. There are a
>>> >>> couple of administrative files (.foo and/or _foo) that I need to
>>> >>> ignore when
>>> >>> processing this data or Spark tries to read them as containing parquet
>>> >>> content, which they do not.
>>> >>>
>>> >>> How can I set a PathFilter on the FileInputFormat used to construct an
>>> >>> RDD?
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> > For additional commands, e-mail: user-help@spark.apache.org
>>> >
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: PathFilter for newAPIHadoopFile?

Posted by Davies Liu <da...@databricks.com>.

There is one way by do it in bash: hadoop fs -ls xxxx, maybe you could
end up with a bash scripts to do the things.

On Mon, Sep 15, 2014 at 1:01 PM, Eric Friedman
<er...@gmail.com> wrote:
> That's a good idea and one I had considered too.  Unfortunately I'm not
> aware of an API in PySpark for enumerating paths on HDFS.  Have I overlooked
> one?
>
> On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu <da...@databricks.com> wrote:
>>
>> In PySpark, I think you could enumerate all the valid files, and create
>> RDD by
>> newAPIHadoopFile(), then union them together.
>>
>> On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
>> <er...@gmail.com> wrote:
>> > I neglected to specify that I'm using pyspark. Doesn't look like these
>> > APIs have been bridged.
>> >
>> > ----
>> > Eric Friedman
>> >
>> >> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <re...@gmail.com>
>> >> wrote:
>> >>
>> >> Hi Eric,
>> >>
>> >> Something along the lines of the following should work
>> >>
>> >> val fs = getFileSystem(...) // standard hadoop API call
>> >> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
>> >> pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
>> >> instance of org.apache.hadoop.fs.PathFilter
>> >> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
>> >> classOf[ParquetInputFormat[Something]], classOf[Void],
>> >> classOf[SomeAvroType], getConfiguration(...))
>> >>
>> >> You have to do some initializations on ParquetInputFormat such as
>> >> AvroReadSetup/AvroWriteSupport etc but that you should be doing
>> >> already I am guessing.
>> >>
>> >> Cheers,
>> >> Nat
>> >>
>> >>
>> >> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
>> >> <er...@gmail.com> wrote:
>> >>> Hi,
>> >>>
>> >>> I have a directory structure with parquet+avro data in it. There are a
>> >>> couple of administrative files (.foo and/or _foo) that I need to
>> >>> ignore when
>> >>> processing this data or Spark tries to read them as containing parquet
>> >>> content, which they do not.
>> >>>
>> >>> How can I set a PathFilter on the FileInputFormat used to construct an
>> >>> RDD?
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: PathFilter for newAPIHadoopFile?

Posted by Eric Friedman <er...@gmail.com>.

That's a good idea and one I had considered too.  Unfortunately I'm not
aware of an API in PySpark for enumerating paths on HDFS.  Have I
overlooked one?

On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu <da...@databricks.com> wrote:

> In PySpark, I think you could enumerate all the valid files, and create
> RDD by
> newAPIHadoopFile(), then union them together.
>
> On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
> <er...@gmail.com> wrote:
> > I neglected to specify that I'm using pyspark. Doesn't look like these
> APIs have been bridged.
> >
> > ----
> > Eric Friedman
> >
> >> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <re...@gmail.com>
> wrote:
> >>
> >> Hi Eric,
> >>
> >> Something along the lines of the following should work
> >>
> >> val fs = getFileSystem(...) // standard hadoop API call
> >> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
> >> pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
> >> instance of org.apache.hadoop.fs.PathFilter
> >> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
> >> classOf[ParquetInputFormat[Something]], classOf[Void],
> >> classOf[SomeAvroType], getConfiguration(...))
> >>
> >> You have to do some initializations on ParquetInputFormat such as
> >> AvroReadSetup/AvroWriteSupport etc but that you should be doing
> >> already I am guessing.
> >>
> >> Cheers,
> >> Nat
> >>
> >>
> >> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
> >> <er...@gmail.com> wrote:
> >>> Hi,
> >>>
> >>> I have a directory structure with parquet+avro data in it. There are a
> >>> couple of administrative files (.foo and/or _foo) that I need to
> ignore when
> >>> processing this data or Spark tries to read them as containing parquet
> >>> content, which they do not.
> >>>
> >>> How can I set a PathFilter on the FileInputFormat used to construct an
> RDD?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>

Re: PathFilter for newAPIHadoopFile?

Posted by Davies Liu <da...@databricks.com>.

In PySpark, I think you could enumerate all the valid files, and create RDD by
newAPIHadoopFile(), then union them together.

On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
<er...@gmail.com> wrote:
> I neglected to specify that I'm using pyspark. Doesn't look like these APIs have been bridged.
>
> ----
> Eric Friedman
>
>> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <re...@gmail.com> wrote:
>>
>> Hi Eric,
>>
>> Something along the lines of the following should work
>>
>> val fs = getFileSystem(...) // standard hadoop API call
>> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
>> pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
>> instance of org.apache.hadoop.fs.PathFilter
>> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
>> classOf[ParquetInputFormat[Something]], classOf[Void],
>> classOf[SomeAvroType], getConfiguration(...))
>>
>> You have to do some initializations on ParquetInputFormat such as
>> AvroReadSetup/AvroWriteSupport etc but that you should be doing
>> already I am guessing.
>>
>> Cheers,
>> Nat
>>
>>
>> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
>> <er...@gmail.com> wrote:
>>> Hi,
>>>
>>> I have a directory structure with parquet+avro data in it. There are a
>>> couple of administrative files (.foo and/or _foo) that I need to ignore when
>>> processing this data or Spark tries to read them as containing parquet
>>> content, which they do not.
>>>
>>> How can I set a PathFilter on the FileInputFormat used to construct an RDD?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: PathFilter for newAPIHadoopFile?

Posted by Eric Friedman <er...@gmail.com>.

I neglected to specify that I'm using pyspark. Doesn't look like these APIs have been bridged. 

----
Eric Friedman

> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <re...@gmail.com> wrote:
> 
> Hi Eric,
> 
> Something along the lines of the following should work
> 
> val fs = getFileSystem(...) // standard hadoop API call
> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
> pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
> instance of org.apache.hadoop.fs.PathFilter
> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
> classOf[ParquetInputFormat[Something]], classOf[Void],
> classOf[SomeAvroType], getConfiguration(...))
> 
> You have to do some initializations on ParquetInputFormat such as
> AvroReadSetup/AvroWriteSupport etc but that you should be doing
> already I am guessing.
> 
> Cheers,
> Nat
> 
> 
> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
> <er...@gmail.com> wrote:
>> Hi,
>> 
>> I have a directory structure with parquet+avro data in it. There are a
>> couple of administrative files (.foo and/or _foo) that I need to ignore when
>> processing this data or Spark tries to read them as containing parquet
>> content, which they do not.
>> 
>> How can I set a PathFilter on the FileInputFormat used to construct an RDD?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: PathFilter for newAPIHadoopFile?

Posted by Nat Padmanabhan <re...@gmail.com>.

Hi Eric,

Something along the lines of the following should work

val fs = getFileSystem(...) // standard hadoop API call
val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
instance of org.apache.hadoop.fs.PathFilter
val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
classOf[ParquetInputFormat[Something]], classOf[Void],
classOf[SomeAvroType], getConfiguration(...))

You have to do some initializations on ParquetInputFormat such as
AvroReadSetup/AvroWriteSupport etc but that you should be doing
already I am guessing.

Cheers,
Nat


On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
<er...@gmail.com> wrote:
> Hi,
>
> I have a directory structure with parquet+avro data in it. There are a
> couple of administrative files (.foo and/or _foo) that I need to ignore when
> processing this data or Spark tries to read them as containing parquet
> content, which they do not.
>
> How can I set a PathFilter on the FileInputFormat used to construct an RDD?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org