You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "colin fang (Jira)" <ji...@apache.org> on 2020/10/19 18:08:00 UTC
[jira] [Updated] (SPARK-33184) spark doesn't read data source
column if it is needed as an index to an array in a nested struct
[ https://issues.apache.org/jira/browse/SPARK-33184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
colin fang updated SPARK-33184:
-------------------------------
Description:
{code:python}
df = spark.createDataFrame([[1, [[1, 2]]]], schema='x:int,y:struct<a:array<int>>')
df.write.mode('overwrite').parquet('test')
{code}
{code:python}
# This causes an error "Caused by: java.lang.RuntimeException: Couldn't find x#720 in [y#721]"
spark.read.parquet('test').select(F.expr('y.a[x]')).show()
# Explain works fine, note it doesn't read x in ReadSchema
spark.read.parquet('test').select(F.expr('y.a[x]')).explain()
== Physical Plan ==
*(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717]
+- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<y:struct<a:array<int>>>
{code}
The code works well if I
- manually select the column it misses `spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show()`
- or use `F.element_at` function `spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show()`
was:
```
df = spark.createDataFrame([[1, [[1, 2]]]], schema='x:int,y:struct<a:array<int>>')
df.write.mode('overwrite').parquet('test')
```
```
# This causes an error "Caused by: java.lang.RuntimeException: Couldn't find x#720 in [y#721]"
spark.read.parquet('test').select(F.expr('y.a[x]')).show()
# Explain works fine, note it doesn't read x in ReadSchema
spark.read.parquet('test').select(F.expr('y.a[x]')).explain()
== Physical Plan ==
*(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717]
+- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<y:struct<a:array<int>>>
```
The code works well if I
- manually select the column it misses `spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show()`
- or use `F.element_at` function `spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show()`
```
> spark doesn't read data source column if it is needed as an index to an array in a nested struct
> ------------------------------------------------------------------------------------------------
>
> Key: SPARK-33184
> URL: https://issues.apache.org/jira/browse/SPARK-33184
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.0.0
> Reporter: colin fang
> Priority: Minor
>
> {code:python}
> df = spark.createDataFrame([[1, [[1, 2]]]], schema='x:int,y:struct<a:array<int>>')
> df.write.mode('overwrite').parquet('test')
> {code}
> {code:python}
> # This causes an error "Caused by: java.lang.RuntimeException: Couldn't find x#720 in [y#721]"
> spark.read.parquet('test').select(F.expr('y.a[x]')).show()
> # Explain works fine, note it doesn't read x in ReadSchema
> spark.read.parquet('test').select(F.expr('y.a[x]')).explain()
> == Physical Plan ==
> *(1) !Project [y#713.a[x#712] AS y.a AS `a`[x]#717]
> +- FileScan parquet [y#713] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<y:struct<a:array<int>>>
> {code}
> The code works well if I
> - manually select the column it misses `spark.read.parquet('test').select(F.expr('y.a[x]'), F.col('x')).show()`
> - or use `F.element_at` function `spark.read.parquet('test').select(F.element_at('y.a', F.col('x') + 1)).show()`
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org