You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jurriaan Pruis (JIRA)" <ji...@apache.org> on 2016/04/04 08:59:25 UTC
[jira] [Updated] (SPARK-14343) Dataframe operations on a
partitioned dataset (using partition discovery) return invalid results
[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jurriaan Pruis updated SPARK-14343:
-----------------------------------
Description:
When reading a dataset using {{sqlContext.read.text()}} queries on the partitioned column return invalid results.
h2. How to reproduce:
h3. Generate datasets
{code:title=repro.sh}
#!/bin/sh
mkdir -p dataset/year=2014
mkdir -p dataset/year=2015
echo "data from 2014" > dataset/year=2014/part01.txt
echo "data from 2015" > dataset/year=2015/part01.txt
{code}
{code:title=repro2.sh}
#!/bin/sh
mkdir -p dataset2/month=june
mkdir -p dataset2/month=july
echo "data from june" > dataset2/month=june/part01.txt
echo "data from july" > dataset2/month=july/part01.txt
{code}
h3. using first dataset
{code:none}
>>> df = sqlContext.read.text('dataset')
...
>>> df
DataFrame[value: string, year: int]
>>> df.show()
+--------------+----+
| value|year|
+--------------+----+
|data from 2014|2014|
|data from 2015|2015|
+--------------+----+
>>> df.select('year').show()
+----+
|year|
+----+
| 14|
| 14|
+----+
{code}
This is clearly wrong. Seems like it returns the length of the value column?
h3. using second dataset
With another dataset it looks like this:
{code:none}
>>> df = sqlContext.read.text('dataset2')
>>> df
DataFrame[value: string, month: string]
>>> df.show()
+--------------+-----+
| value|month|
+--------------+-----+
|data from june| june|
|data from july| july|
+--------------+-----+
>>> df.select('month').show()
+--------------+
| month|
+--------------+
|data from june|
|data from july|
+--------------+
{code}
Here it returns the value of the value column instead of the month partition.
h3. Workaround
When I convert the dataframe to an RDD and back to a DataFrame I get the following result (which is the expected behaviour):
{code:none}
>>> df.rdd.toDF().select('month').show()
+-----+
|month|
+-----+
| june|
| july|
+-----+
{code}
was:
When reading a dataset using {{sqlContext.read.text()}} queries on the partitioned column return invalid results.
h2. How to reproduce:
h3. Generate datasets
{code:title=repro.sh}
#!/bin/sh
mkdir -p dataset/year=2014
mkdir -p dataset/year=2015
echo "data from 2014" > dataset/year=2014/part01.txt
echo "data from 2015" > dataset/year=2015/part01.txt
{code}
{code:title=repro2.sh}
#!/bin/sh
mkdir -p dataset2/month=june
mkdir -p dataset2/month=july
echo "data from june" > dataset2/month=june/part01.txt
echo "data from july" > dataset2/month=july/part01.txt
{code}
h3. using first dataset
{code:none}
>>> df = sqlContext.read.text('dataset')
...
>>> df
DataFrame[value: string, year: int]
>>> df.show()
+--------------+----+
| value|year|
+--------------+----+
|data from 2014|2014|
|data from 2015|2015|
+--------------+----+
>>> df.select('year').show()
+----+
|year|
+----+
| 14|
| 14|
+----+
{code}
This is clearly wrong. Seems like it returns the length of the value column?
h3. using second dataset
With another dataset it looks like this:
{code:none}
>>> df = sqlContext.read.text('dataset2')
>>> df
DataFrame[value: string, month: string]
>>> df.show()
+--------------+-----+
| value|month|
+--------------+-----+
|data from june| june|
|data from july| july|
+--------------+-----+
>>> df.select('month').show()
+--------------+
| month|
+--------------+
|data from june|
|data from july|
+--------------+
{code}
Here it returns the value of the value column instead of the month partition.
h3. Workaround
If I convert the dataframe to an RDD and back to a DataFrame I get the following result (which is the expected behaviour):
{code:none}
>>> df.rdd.toDF().select('month').show()
+-----+
|month|
+-----+
| june|
| july|
+-----+
{code}
> Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
> ------------------------------------------------------------------------------------------------
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.4
> Reporter: Jurriaan Pruis
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--------------+----+
> | value|year|
> +--------------+----+
> |data from 2014|2014|
> |data from 2015|2015|
> +--------------+----+
> >>> df.select('year').show()
> +----+
> |year|
> +----+
> | 14|
> | 14|
> +----+
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--------------+-----+
> | value|month|
> +--------------+-----+
> |data from june| june|
> |data from july| july|
> +--------------+-----+
> >>> df.select('month').show()
> +--------------+
> | month|
> +--------------+
> |data from june|
> |data from july|
> +--------------+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-----+
> |month|
> +-----+
> | june|
> | july|
> +-----+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org