You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by JF Chen <da...@gmail.com> on 2018/05/21 03:30:33 UTC

How to skip nonexistent file when read files with spark?

Hi Everyone
I meet a tricky problem recently. I am trying to read some file paths
generated by other method. The file paths are represented by wild card in
list, like [ '/data/*/12', '/data/*/13']
But in practice, if the wildcard cannot match any existed path, it will
throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not
exist: ...'", and the program stops after that.
Actually I want spark can just ignore and skip these nonexistent  file
path, and continues to run. I have tried python HDFSCli api to check the
existence of path , but hdfs cli cannot support wildcard.

Any good idea to solve my problem? Thanks~

Regard,
Junfeng Chen

Re: How to skip nonexistent file when read files with spark?

Posted by JF Chen <da...@gmail.com>.
Thanks Thakrar~


Regard,
Junfeng Chen

On Tue, May 22, 2018 at 11:22 AM, Thakrar, Jayesh <
jthakrar@conversantmedia.com> wrote:

> Junfeng,
>
>
>
> I would suggest preprocessing/validating the paths in plain Python (and
> not Spark) before you try to fetch data.
>
> I am not familiar with Python Hadoop libraries, but see if this helps -
> http://crs4.github.io/pydoop/tutorial/hdfs_api.html
>
>
>
> Best,
>
> Jayesh
>
>
>
> *From: *JF Chen <da...@gmail.com>
> *Date: *Monday, May 21, 2018 at 10:20 PM
> *To: *ayan guha <gu...@gmail.com>
> *Cc: *"Thakrar, Jayesh" <jt...@conversantmedia.com>, user <
> user@spark.apache.org>
> *Subject: *Re: How to skip nonexistent file when read files with spark?
>
>
>
> Thanks ayan,
>
>
>
> Also I have tried this method, the most tricky thing is that dataframe
> union method must be based on same structure schema, while on my files, the
> schema is variable.
>
>
>
>
> Regard,
> Junfeng Chen
>
>
>
> On Tue, May 22, 2018 at 10:33 AM, ayan guha <gu...@gmail.com> wrote:
>
> A relatively naive solution will be:
>
>
>
> 0. Create a dummy blank dataframe
>
> 1. Loop through the list of paths.
>
> 2. Try to create the dataframe from the path. If success then union it
> cumulatively.
>
> 3. If error, just ignore it or handle as you wish.
>
>
>
> At the end of the loop, just use the unioned df. This should not have any
> additional performance overhead as declaring dataframes and union is not
> expensive, unless you call any action within the loop.
>
>
>
> Best
>
> Ayan
>
>
>
> On Tue, 22 May 2018 at 11:27 am, JF Chen <da...@gmail.com> wrote:
>
> Thanks, Thakrar,
>
>
>
> I have tried to check the existence of path before read it, but HDFSCli
> python package seems not support wildcard.  "FileSystem.globStatus" is a
> java api while I am using python via livy.... Do you know any python api
> implementing the same function?
>
>
>
>
> Regard,
> Junfeng Chen
>
>
>
> On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <
> jthakrar@conversantmedia.com> wrote:
>
> Probably you can do some preprocessing/checking of the paths before you
> attempt to read it via Spark.
>
> Whether it is local or hdfs filesystem, you can try to check for existence
> and other details by using the "FileSystem.globStatus" method from the
> Hadoop API.
>
>
>
> *From: *JF Chen <da...@gmail.com>
> *Date: *Sunday, May 20, 2018 at 10:30 PM
> *To: *user <us...@spark.apache.org>
> *Subject: *How to skip nonexistent file when read files with spark?
>
>
>
> Hi Everyone
>
> I meet a tricky problem recently. I am trying to read some file paths
> generated by other method. The file paths are represented by wild card in
> list, like [ '/data/*/12', '/data/*/13']
>
> But in practice, if the wildcard cannot match any existed path, it will
> throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not
> exist: ...'", and the program stops after that.
>
> Actually I want spark can just ignore and skip these nonexistent  file
> path, and continues to run. I have tried python HDFSCli api to check the
> existence of path , but hdfs cli cannot support wildcard.
>
>
>
> Any good idea to solve my problem? Thanks~
>
>
>
> Regard,
> Junfeng Chen
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>
>

Re: How to skip nonexistent file when read files with spark?

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.
Junfeng,

I would suggest preprocessing/validating the paths in plain Python (and not Spark) before you try to fetch data.
I am not familiar with Python Hadoop libraries, but see if this helps - http://crs4.github.io/pydoop/tutorial/hdfs_api.html

Best,
Jayesh

From: JF Chen <da...@gmail.com>
Date: Monday, May 21, 2018 at 10:20 PM
To: ayan guha <gu...@gmail.com>
Cc: "Thakrar, Jayesh" <jt...@conversantmedia.com>, user <us...@spark.apache.org>
Subject: Re: How to skip nonexistent file when read files with spark?

Thanks ayan,

Also I have tried this method, the most tricky thing is that dataframe union method must be based on same structure schema, while on my files, the schema is variable.


Regard,
Junfeng Chen

On Tue, May 22, 2018 at 10:33 AM, ayan guha <gu...@gmail.com>> wrote:
A relatively naive solution will be:

0. Create a dummy blank dataframe
1. Loop through the list of paths.
2. Try to create the dataframe from the path. If success then union it cumulatively.
3. If error, just ignore it or handle as you wish.

At the end of the loop, just use the unioned df. This should not have any additional performance overhead as declaring dataframes and union is not expensive, unless you call any action within the loop.

Best
Ayan

On Tue, 22 May 2018 at 11:27 am, JF Chen <da...@gmail.com>> wrote:
Thanks, Thakrar,

I have tried to check the existence of path before read it, but HDFSCli python package seems not support wildcard.  "FileSystem.globStatus" is a java api while I am using python via livy.... Do you know any python api implementing the same function?


Regard,
Junfeng Chen

On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <jt...@conversantmedia.com>> wrote:
Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark.
Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API.

From: JF Chen <da...@gmail.com>>
Date: Sunday, May 20, 2018 at 10:30 PM
To: user <us...@spark.apache.org>>
Subject: How to skip nonexistent file when read files with spark?

Hi Everyone
I meet a tricky problem recently. I am trying to read some file paths generated by other method. The file paths are represented by wild card in list, like [ '/data/*/12', '/data/*/13']
But in practice, if the wildcard cannot match any existed path, it will throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", and the program stops after that.
Actually I want spark can just ignore and skip these nonexistent  file path, and continues to run. I have tried python HDFSCli api to check the existence of path , but hdfs cli cannot support wildcard.

Any good idea to solve my problem? Thanks~

Regard,
Junfeng Chen

--
Best Regards,
Ayan Guha


Re: How to skip nonexistent file when read files with spark?

Posted by JF Chen <da...@gmail.com>.
Thanks ayan,

Also I have tried this method, the most tricky thing is that dataframe
union method must be based on same structure schema, while on my files, the
schema is variable.


Regard,
Junfeng Chen

On Tue, May 22, 2018 at 10:33 AM, ayan guha <gu...@gmail.com> wrote:

> A relatively naive solution will be:
>
> 0. Create a dummy blank dataframe
> 1. Loop through the list of paths.
> 2. Try to create the dataframe from the path. If success then union it
> cumulatively.
> 3. If error, just ignore it or handle as you wish.
>
> At the end of the loop, just use the unioned df. This should not have any
> additional performance overhead as declaring dataframes and union is not
> expensive, unless you call any action within the loop.
>
> Best
> Ayan
>
> On Tue, 22 May 2018 at 11:27 am, JF Chen <da...@gmail.com> wrote:
>
>> Thanks, Thakrar,
>>
>> I have tried to check the existence of path before read it, but HDFSCli
>> python package seems not support wildcard.  "FileSystem.globStatus" is a
>> java api while I am using python via livy.... Do you know any python api
>> implementing the same function?
>>
>>
>> Regard,
>> Junfeng Chen
>>
>> On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <
>> jthakrar@conversantmedia.com> wrote:
>>
>>> Probably you can do some preprocessing/checking of the paths before you
>>> attempt to read it via Spark.
>>>
>>> Whether it is local or hdfs filesystem, you can try to check for
>>> existence and other details by using the "FileSystem.globStatus" method
>>> from the Hadoop API.
>>>
>>>
>>>
>>> *From: *JF Chen <da...@gmail.com>
>>> *Date: *Sunday, May 20, 2018 at 10:30 PM
>>> *To: *user <us...@spark.apache.org>
>>> *Subject: *How to skip nonexistent file when read files with spark?
>>>
>>>
>>>
>>> Hi Everyone
>>>
>>> I meet a tricky problem recently. I am trying to read some file paths
>>> generated by other method. The file paths are represented by wild card in
>>> list, like [ '/data/*/12', '/data/*/13']
>>>
>>> But in practice, if the wildcard cannot match any existed path, it will
>>> throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not
>>> exist: ...'", and the program stops after that.
>>>
>>> Actually I want spark can just ignore and skip these nonexistent  file
>>> path, and continues to run. I have tried python HDFSCli api to check the
>>> existence of path , but hdfs cli cannot support wildcard.
>>>
>>>
>>>
>>> Any good idea to solve my problem? Thanks~
>>>
>>>
>>>
>>> Regard,
>>> Junfeng Chen
>>>
>>
>> --
> Best Regards,
> Ayan Guha
>

Re: How to skip nonexistent file when read files with spark?

Posted by ayan guha <gu...@gmail.com>.
A relatively naive solution will be:

0. Create a dummy blank dataframe
1. Loop through the list of paths.
2. Try to create the dataframe from the path. If success then union it
cumulatively.
3. If error, just ignore it or handle as you wish.

At the end of the loop, just use the unioned df. This should not have any
additional performance overhead as declaring dataframes and union is not
expensive, unless you call any action within the loop.

Best
Ayan

On Tue, 22 May 2018 at 11:27 am, JF Chen <da...@gmail.com> wrote:

> Thanks, Thakrar,
>
> I have tried to check the existence of path before read it, but HDFSCli
> python package seems not support wildcard.  "FileSystem.globStatus" is a
> java api while I am using python via livy.... Do you know any python api
> implementing the same function?
>
>
> Regard,
> Junfeng Chen
>
> On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <
> jthakrar@conversantmedia.com> wrote:
>
>> Probably you can do some preprocessing/checking of the paths before you
>> attempt to read it via Spark.
>>
>> Whether it is local or hdfs filesystem, you can try to check for
>> existence and other details by using the "FileSystem.globStatus" method
>> from the Hadoop API.
>>
>>
>>
>> *From: *JF Chen <da...@gmail.com>
>> *Date: *Sunday, May 20, 2018 at 10:30 PM
>> *To: *user <us...@spark.apache.org>
>> *Subject: *How to skip nonexistent file when read files with spark?
>>
>>
>>
>> Hi Everyone
>>
>> I meet a tricky problem recently. I am trying to read some file paths
>> generated by other method. The file paths are represented by wild card in
>> list, like [ '/data/*/12', '/data/*/13']
>>
>> But in practice, if the wildcard cannot match any existed path, it will
>> throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not
>> exist: ...'", and the program stops after that.
>>
>> Actually I want spark can just ignore and skip these nonexistent  file
>> path, and continues to run. I have tried python HDFSCli api to check the
>> existence of path , but hdfs cli cannot support wildcard.
>>
>>
>>
>> Any good idea to solve my problem? Thanks~
>>
>>
>>
>> Regard,
>> Junfeng Chen
>>
>
> --
Best Regards,
Ayan Guha

Re: How to skip nonexistent file when read files with spark?

Posted by JF Chen <da...@gmail.com>.
Thanks, Thakrar,

I have tried to check the existence of path before read it, but HDFSCli
python package seems not support wildcard.  "FileSystem.globStatus" is a
java api while I am using python via livy.... Do you know any python api
implementing the same function?


Regard,
Junfeng Chen

On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <
jthakrar@conversantmedia.com> wrote:

> Probably you can do some preprocessing/checking of the paths before you
> attempt to read it via Spark.
>
> Whether it is local or hdfs filesystem, you can try to check for existence
> and other details by using the "FileSystem.globStatus" method from the
> Hadoop API.
>
>
>
> *From: *JF Chen <da...@gmail.com>
> *Date: *Sunday, May 20, 2018 at 10:30 PM
> *To: *user <us...@spark.apache.org>
> *Subject: *How to skip nonexistent file when read files with spark?
>
>
>
> Hi Everyone
>
> I meet a tricky problem recently. I am trying to read some file paths
> generated by other method. The file paths are represented by wild card in
> list, like [ '/data/*/12', '/data/*/13']
>
> But in practice, if the wildcard cannot match any existed path, it will
> throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not
> exist: ...'", and the program stops after that.
>
> Actually I want spark can just ignore and skip these nonexistent  file
> path, and continues to run. I have tried python HDFSCli api to check the
> existence of path , but hdfs cli cannot support wildcard.
>
>
>
> Any good idea to solve my problem? Thanks~
>
>
>
> Regard,
> Junfeng Chen
>

Re: How to skip nonexistent file when read files with spark?

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.
Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark.
Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API.

From: JF Chen <da...@gmail.com>
Date: Sunday, May 20, 2018 at 10:30 PM
To: user <us...@spark.apache.org>
Subject: How to skip nonexistent file when read files with spark?

Hi Everyone
I meet a tricky problem recently. I am trying to read some file paths generated by other method. The file paths are represented by wild card in list, like [ '/data/*/12', '/data/*/13']
But in practice, if the wildcard cannot match any existed path, it will throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", and the program stops after that.
Actually I want spark can just ignore and skip these nonexistent  file path, and continues to run. I have tried python HDFSCli api to check the existence of path , but hdfs cli cannot support wildcard.

Any good idea to solve my problem? Thanks~

Regard,
Junfeng Chen