You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jurriaan Pruis (JIRA)" <ji...@apache.org> on 2016/04/04 08:59:25 UTC
[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

     [ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jurriaan Pruis updated SPARK-14343:
-----------------------------------
    Description: 
When reading a dataset using {{sqlContext.read.text()}} queries on the partitioned column return invalid results.

h2. How to reproduce:

h3. Generate datasets
{code:title=repro.sh}
#!/bin/sh

mkdir -p dataset/year=2014
mkdir -p dataset/year=2015

echo "data from 2014" > dataset/year=2014/part01.txt
echo "data from 2015" > dataset/year=2015/part01.txt
{code}

{code:title=repro2.sh}
#!/bin/sh

mkdir -p dataset2/month=june
mkdir -p dataset2/month=july

echo "data from june" > dataset2/month=june/part01.txt
echo "data from july" > dataset2/month=july/part01.txt
{code}

h3. using first dataset
{code:none}
>>> df = sqlContext.read.text('dataset')
...
>>> df
DataFrame[value: string, year: int]
>>> df.show()
+--------------+----+
|         value|year|
+--------------+----+
|data from 2014|2014|
|data from 2015|2015|
+--------------+----+
>>> df.select('year').show()
+----+
|year|
+----+
|  14|
|  14|
+----+
{code}

This is clearly wrong. Seems like it returns the length of the value column?

h3. using second dataset

With another dataset it looks like this:
{code:none}
>>> df = sqlContext.read.text('dataset2')
>>> df
DataFrame[value: string, month: string]
>>> df.show()
+--------------+-----+
|         value|month|
+--------------+-----+
|data from june| june|
|data from july| july|
+--------------+-----+
>>> df.select('month').show()
+--------------+
|         month|
+--------------+
|data from june|
|data from july|
+--------------+
{code}

Here it returns the value of the value column instead of the month partition.

h3. Workaround

When I convert the dataframe to an RDD and back to a DataFrame I get the following result (which is the expected behaviour):
{code:none}
>>> df.rdd.toDF().select('month').show()
+-----+
|month|
+-----+
| june|
| july|
+-----+
{code}

  was:
When reading a dataset using {{sqlContext.read.text()}} queries on the partitioned column return invalid results.

h2. How to reproduce:

h3. Generate datasets
{code:title=repro.sh}
#!/bin/sh

mkdir -p dataset/year=2014
mkdir -p dataset/year=2015

echo "data from 2014" > dataset/year=2014/part01.txt
echo "data from 2015" > dataset/year=2015/part01.txt
{code}

{code:title=repro2.sh}
#!/bin/sh

mkdir -p dataset2/month=june
mkdir -p dataset2/month=july

echo "data from june" > dataset2/month=june/part01.txt
echo "data from july" > dataset2/month=july/part01.txt
{code}

h3. using first dataset
{code:none}
>>> df = sqlContext.read.text('dataset')
...
>>> df
DataFrame[value: string, year: int]
>>> df.show()
+--------------+----+
|         value|year|
+--------------+----+
|data from 2014|2014|
|data from 2015|2015|
+--------------+----+
>>> df.select('year').show()
+----+
|year|
+----+
|  14|
|  14|
+----+
{code}

This is clearly wrong. Seems like it returns the length of the value column?

h3. using second dataset

With another dataset it looks like this:
{code:none}
>>> df = sqlContext.read.text('dataset2')
>>> df
DataFrame[value: string, month: string]
>>> df.show()
+--------------+-----+
|         value|month|
+--------------+-----+
|data from june| june|
|data from july| july|
+--------------+-----+
>>> df.select('month').show()
+--------------+
|         month|
+--------------+
|data from june|
|data from july|
+--------------+
{code}

Here it returns the value of the value column instead of the month partition.

h3. Workaround

If I convert the dataframe to an RDD and back to a DataFrame I get the following result (which is the expected behaviour):
{code:none}
>>> df.rdd.toDF().select('month').show()
+-----+
|month|
+-----+
| june|
| july|
+-----+
{code}


> Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14343
>                 URL: https://issues.apache.org/jira/browse/SPARK-14343
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>         Environment: Mac OS X 10.11.4
>            Reporter: Jurriaan Pruis
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--------------+----+
> |         value|year|
> +--------------+----+
> |data from 2014|2014|
> |data from 2015|2015|
> +--------------+----+
> >>> df.select('year').show()
> +----+
> |year|
> +----+
> |  14|
> |  14|
> +----+
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--------------+-----+
> |         value|month|
> +--------------+-----+
> |data from june| june|
> |data from july| july|
> +--------------+-----+
> >>> df.select('month').show()
> +--------------+
> |         month|
> +--------------+
> |data from june|
> |data from july|
> +--------------+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-----+
> |month|
> +-----+
> | june|
> | july|
> +-----+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org