You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/11/21 07:26:00 UTC
[jira] [Commented] (SPARK-26128) filter breaks input_file_name

    [ https://issues.apache.org/jira/browse/SPARK-26128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694330#comment-16694330 ] 

Hyukjin Kwon commented on SPARK-26128:
--------------------------------------

I can't reproduce this:

```
scala> spark.range(10).write.parquet("/tmp/newparquet")
18/11/21 15:23:16 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
18/11/21 15:23:16 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
18/11/21 15:23:16 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers

scala> spark.read.parquet("/tmp/newparquet").where("id > 5").select(input_file_name()).show(5,false)
+------------------------------------------------------------------------------------------+
|input_file_name()                                                                         |
+------------------------------------------------------------------------------------------+
|file:///tmp/newparquet/part-00007-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-00007-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-00006-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-00005-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
+------------------------------------------------------------------------------------------+


scala> spark.read.parquet("/tmp/newparquet").select(input_file_name()).show(5,false)
+------------------------------------------------------------------------------------------+
|input_file_name()                                                                         |
+------------------------------------------------------------------------------------------+
|file:///tmp/newparquet/part-00007-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-00007-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-00003-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-00003-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-00000-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
+------------------------------------------------------------------------------------------+
only showing top 5 rows

```

mind showing how {{"/tmp/newparquet"}} is made?

> filter breaks input_file_name
> -----------------------------
>
>                 Key: SPARK-26128
>                 URL: https://issues.apache.org/jira/browse/SPARK-26128
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Shell
>    Affects Versions: 2.3.2
>            Reporter: Paul Praet
>            Priority: Minor
>
> This works:
> {code:java}
> scala> spark.read.parquet("/tmp/newparquet").select(input_file_name).show(5,false)
> +-----------------------------------------------------------------------------------------------------------------------------------------------------+
> |input_file_name()                                                                                                                                    |
> +-----------------------------------------------------------------------------------------------------------------------------------------------------+
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> +-----------------------------------------------------------------------------------------------------------------------------------------------------+
> {code}
> When adding a filter:
> {code:java}
> scala> spark.read.parquet("/tmp/newparquet").where("key.station='XYZ'").select(input_file_name()).show(5,false)
> +-----------------+
> |input_file_name()|
> +-----------------+
> | |
> | |
> | |
> | |
> | |
> +-----------------+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org