You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dhrubajyoti Hati <dh...@gmail.com> on 2020/04/22 19:15:22 UTC

Re: Error while reading hive tables with tmp/hidden files inside partitions

Just wondering if any one could help me out on this.

Thank you!




*Regards,Dhrubajyoti Hati.*


On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <dh...@gmail.com>
wrote:

> Hi,
>
> Is there any way to discard files starting with dot(.) or ending with .tmp
> in the hive partition while reading from Hive table using spark.read.table
> method.
>
> I tried using PathFilters but they didn't work. I am using spark-submit
> and passing my python file(pyspark) containing the source code.
>
> spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class",
> "com.abc.hadoop.utility.TmpFileFilter")
>
> class TmpFileFilter extends PathFilter {
>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
> }
>
> Still in the detailed logs I can see .tmp files are getting considered in
> the detailed logs:
> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
> maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
> maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
> maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp
>
>
> Is there any way to discard the tmp(.tmp) or the hidden files(filename
> starting with dot or underscore) in hive partitions while reading from
> spark?
>
>
>
>
> *Regards,Dhrubajyoti Hati.*
>

Re: Error while reading hive tables with tmp/hidden files inside partitions

Posted by Wenchen Fan <cl...@gmail.com>.

Yea, please report the bug on a supported Spark version like 2.4.

On Thu, Apr 23, 2020 at 3:40 PM Dhrubajyoti Hati <dh...@gmail.com>
wrote:

> FYI we are using Spark 2.2.0. Should the change be present in this spark
> version? Wanted to check before opening a JIRA ticket?
>
>
>
>
> *Regards,Dhrubajyoti Hati.*
>
>
> On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> This looks like a bug that path filter doesn't work for hive table
>> reading. Can you open a JIRA ticket?
>>
>> On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati <dh...@gmail.com>
>> wrote:
>>
>>> Just wondering if any one could help me out on this.
>>>
>>> Thank you!
>>>
>>>
>>>
>>>
>>> *Regards,Dhrubajyoti Hati.*
>>>
>>>
>>> On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <dh...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is there any way to discard files starting with dot(.) or ending with
>>>> .tmp in the hive partition while reading from Hive table using
>>>> spark.read.table method.
>>>>
>>>> I tried using PathFilters but they didn't work. I am using spark-submit
>>>> and passing my python file(pyspark) containing the source code.
>>>>
>>>> spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class",
>>>> "com.abc.hadoop.utility.TmpFileFilter")
>>>>
>>>> class TmpFileFilter extends PathFilter {
>>>>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
>>>> }
>>>>
>>>> Still in the detailed logs I can see .tmp files are getting considered
>>>> in the detailed logs:
>>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>>> maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
>>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>>> maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
>>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>>> maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp
>>>>
>>>>
>>>> Is there any way to discard the tmp(.tmp) or the hidden files(filename
>>>> starting with dot or underscore) in hive partitions while reading from
>>>> spark?
>>>>
>>>>
>>>>
>>>>
>>>> *Regards,Dhrubajyoti Hati.*
>>>>
>>>

Re: Error while reading hive tables with tmp/hidden files inside partitions

Posted by Wenchen Fan <cl...@gmail.com>.

Yea, please report the bug on a supported Spark version like 2.4.

On Thu, Apr 23, 2020 at 3:40 PM Dhrubajyoti Hati <dh...@gmail.com>
wrote:

> FYI we are using Spark 2.2.0. Should the change be present in this spark
> version? Wanted to check before opening a JIRA ticket?
>
>
>
>
> *Regards,Dhrubajyoti Hati.*
>
>
> On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan <cl...@gmail.com> wrote:
>
>> This looks like a bug that path filter doesn't work for hive table
>> reading. Can you open a JIRA ticket?
>>
>> On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati <dh...@gmail.com>
>> wrote:
>>
>>> Just wondering if any one could help me out on this.
>>>
>>> Thank you!
>>>
>>>
>>>
>>>
>>> *Regards,Dhrubajyoti Hati.*
>>>
>>>
>>> On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <dh...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is there any way to discard files starting with dot(.) or ending with
>>>> .tmp in the hive partition while reading from Hive table using
>>>> spark.read.table method.
>>>>
>>>> I tried using PathFilters but they didn't work. I am using spark-submit
>>>> and passing my python file(pyspark) containing the source code.
>>>>
>>>> spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class",
>>>> "com.abc.hadoop.utility.TmpFileFilter")
>>>>
>>>> class TmpFileFilter extends PathFilter {
>>>>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
>>>> }
>>>>
>>>> Still in the detailed logs I can see .tmp files are getting considered
>>>> in the detailed logs:
>>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>>> maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
>>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>>> maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
>>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>>> maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp
>>>>
>>>>
>>>> Is there any way to discard the tmp(.tmp) or the hidden files(filename
>>>> starting with dot or underscore) in hive partitions while reading from
>>>> spark?
>>>>
>>>>
>>>>
>>>>
>>>> *Regards,Dhrubajyoti Hati.*
>>>>
>>>

Re: Error while reading hive tables with tmp/hidden files inside partitions

Posted by Dhrubajyoti Hati <dh...@gmail.com>.

FYI we are using Spark 2.2.0. Should the change be present in this spark
version? Wanted to check before opening a JIRA ticket?




*Regards,Dhrubajyoti Hati.*


On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan <cl...@gmail.com> wrote:

> This looks like a bug that path filter doesn't work for hive table
> reading. Can you open a JIRA ticket?
>
> On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati <dh...@gmail.com>
> wrote:
>
>> Just wondering if any one could help me out on this.
>>
>> Thank you!
>>
>>
>>
>>
>> *Regards,Dhrubajyoti Hati.*
>>
>>
>> On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <dh...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Is there any way to discard files starting with dot(.) or ending with
>>> .tmp in the hive partition while reading from Hive table using
>>> spark.read.table method.
>>>
>>> I tried using PathFilters but they didn't work. I am using spark-submit
>>> and passing my python file(pyspark) containing the source code.
>>>
>>> spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class",
>>> "com.abc.hadoop.utility.TmpFileFilter")
>>>
>>> class TmpFileFilter extends PathFilter {
>>>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
>>> }
>>>
>>> Still in the detailed logs I can see .tmp files are getting considered
>>> in the detailed logs:
>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>> maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>> maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>> maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp
>>>
>>>
>>> Is there any way to discard the tmp(.tmp) or the hidden files(filename
>>> starting with dot or underscore) in hive partitions while reading from
>>> spark?
>>>
>>>
>>>
>>>
>>> *Regards,Dhrubajyoti Hati.*
>>>
>>

Re: Error while reading hive tables with tmp/hidden files inside partitions

Posted by Dhrubajyoti Hati <dh...@gmail.com>.

FYI we are using Spark 2.2.0. Should the change be present in this spark
version? Wanted to check before opening a JIRA ticket?




*Regards,Dhrubajyoti Hati.*


On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan <cl...@gmail.com> wrote:

> This looks like a bug that path filter doesn't work for hive table
> reading. Can you open a JIRA ticket?
>
> On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati <dh...@gmail.com>
> wrote:
>
>> Just wondering if any one could help me out on this.
>>
>> Thank you!
>>
>>
>>
>>
>> *Regards,Dhrubajyoti Hati.*
>>
>>
>> On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <dh...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Is there any way to discard files starting with dot(.) or ending with
>>> .tmp in the hive partition while reading from Hive table using
>>> spark.read.table method.
>>>
>>> I tried using PathFilters but they didn't work. I am using spark-submit
>>> and passing my python file(pyspark) containing the source code.
>>>
>>> spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class",
>>> "com.abc.hadoop.utility.TmpFileFilter")
>>>
>>> class TmpFileFilter extends PathFilter {
>>>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
>>> }
>>>
>>> Still in the detailed logs I can see .tmp files are getting considered
>>> in the detailed logs:
>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>> maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>> maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
>>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>>> maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp
>>>
>>>
>>> Is there any way to discard the tmp(.tmp) or the hidden files(filename
>>> starting with dot or underscore) in hive partitions while reading from
>>> spark?
>>>
>>>
>>>
>>>
>>> *Regards,Dhrubajyoti Hati.*
>>>
>>

Re: Error while reading hive tables with tmp/hidden files inside partitions

Posted by Wenchen Fan <cl...@gmail.com>.

This looks like a bug that path filter doesn't work for hive table reading.
Can you open a JIRA ticket?

On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati <dh...@gmail.com>
wrote:

> Just wondering if any one could help me out on this.
>
> Thank you!
>
>
>
>
> *Regards,Dhrubajyoti Hati.*
>
>
> On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <dh...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Is there any way to discard files starting with dot(.) or ending with
>> .tmp in the hive partition while reading from Hive table using
>> spark.read.table method.
>>
>> I tried using PathFilters but they didn't work. I am using spark-submit
>> and passing my python file(pyspark) containing the source code.
>>
>> spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class",
>> "com.abc.hadoop.utility.TmpFileFilter")
>>
>> class TmpFileFilter extends PathFilter {
>>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
>> }
>>
>> Still in the detailed logs I can see .tmp files are getting considered in
>> the detailed logs:
>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>> maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>> maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>> maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp
>>
>>
>> Is there any way to discard the tmp(.tmp) or the hidden files(filename
>> starting with dot or underscore) in hive partitions while reading from
>> spark?
>>
>>
>>
>>
>> *Regards,Dhrubajyoti Hati.*
>>
>

Re: Error while reading hive tables with tmp/hidden files inside partitions

Posted by Wenchen Fan <cl...@gmail.com>.

This looks like a bug that path filter doesn't work for hive table reading.
Can you open a JIRA ticket?

On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati <dh...@gmail.com>
wrote:

> Just wondering if any one could help me out on this.
>
> Thank you!
>
>
>
>
> *Regards,Dhrubajyoti Hati.*
>
>
> On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <dh...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Is there any way to discard files starting with dot(.) or ending with
>> .tmp in the hive partition while reading from Hive table using
>> spark.read.table method.
>>
>> I tried using PathFilters but they didn't work. I am using spark-submit
>> and passing my python file(pyspark) containing the source code.
>>
>> spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class",
>> "com.abc.hadoop.utility.TmpFileFilter")
>>
>> class TmpFileFilter extends PathFilter {
>>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
>> }
>>
>> Still in the detailed logs I can see .tmp files are getting considered in
>> the detailed logs:
>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>> maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>> maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
>> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
>> maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp
>>
>>
>> Is there any way to discard the tmp(.tmp) or the hidden files(filename
>> starting with dot or underscore) in hive partitions while reading from
>> spark?
>>
>>
>>
>>
>> *Regards,Dhrubajyoti Hati.*
>>
>