You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by guoxu1231 <gu...@gmail.com> on 2014/12/30 05:04:09 UTC

Help, pyspark.sql.List flatMap results become tuple

Hi pyspark guys, 

I have a json file, and its struct like below:

{"NAME":"George", "AGE":35, "ADD_ID":1212, "POSTAL_AREA":1,
"TIME_ZONE_ID":1, "INTEREST":[{"INTEREST_NO":1, "INFO":"x"},
{"INTEREST_NO":2, "INFO":"y"}]}
{"NAME":"John", "AGE":45, "ADD_ID":1213, "POSTAL_AREA":1, "TIME_ZONE_ID":1,
"INTEREST":[{"INTEREST_NO":2, "INFO":"x"}, {"INTEREST_NO":3, "INFO":"y"}]}

I'm using spark sql api to manipulate the json data in pyspark shell, 

*sqlContext = SQLContext(sc)
A400= sqlContext.jsonFile('jason_file_path')*
/Row(ADD_ID=1212, AGE=35, INTEREST=[Row(INFO=u'x', INTEREST_NO=1),
Row(INFO=u'y', INTEREST_NO=2)], NAME=u'George', POSTAL_AREA=1,
TIME_ZONE_ID=1)
Row(ADD_ID=1213, AGE=45, INTEREST=[Row(INFO=u'x', INTEREST_NO=2),
Row(INFO=u'y', INTEREST_NO=3)], NAME=u'John', POSTAL_AREA=1,
TIME_ZONE_ID=1)/
*X = A400.flatMap(lambda i: i.INTEREST)*
The flatMap results like below, each element in json array were flatten to
tuple, not my expected  pyspark.sql.Row. I can only access the flatten
results by index. but it supposed to be flatten to Row(namedTuple) and
support to access by name.
(u'x', 1)
(u'y', 2)
(u'x', 2)
(u'y', 3)

My spark version is 1.1.







--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Help-pyspark-sql-List-flatMap-results-become-tuple-tp9961.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Help, pyspark.sql.List flatMap results become tuple

Posted by guoxu1231 <gu...@gmail.com>.

named tuple degenerate to tuple. 
*A400.map(lambda i: map(None,i.INTEREST))*
===============================
[(u'x', 1), (u'y', 2)]
[(u'x', 2), (u'y', 3)]



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Help-pyspark-sql-List-flatMap-results-become-tuple-tp9961p9962.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Help, pyspark.sql.List flatMap results become tuple

Posted by guoxu1231 <gu...@gmail.com>.

Thanks Davies, it works in 1.2. 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Help-pyspark-sql-List-flatMap-results-become-tuple-tp9961p9975.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Help, pyspark.sql.List flatMap results become tuple

Posted by Davies Liu <da...@databricks.com>.

This should be fixed in 1.2, could you try it?

On Mon, Dec 29, 2014 at 8:04 PM, guoxu1231 <gu...@gmail.com> wrote:
> Hi pyspark guys,
>
> I have a json file, and its struct like below:
>
> {"NAME":"George", "AGE":35, "ADD_ID":1212, "POSTAL_AREA":1,
> "TIME_ZONE_ID":1, "INTEREST":[{"INTEREST_NO":1, "INFO":"x"},
> {"INTEREST_NO":2, "INFO":"y"}]}
> {"NAME":"John", "AGE":45, "ADD_ID":1213, "POSTAL_AREA":1, "TIME_ZONE_ID":1,
> "INTEREST":[{"INTEREST_NO":2, "INFO":"x"}, {"INTEREST_NO":3, "INFO":"y"}]}
>
> I'm using spark sql api to manipulate the json data in pyspark shell,
>
> *sqlContext = SQLContext(sc)
> A400= sqlContext.jsonFile('jason_file_path')*
> /Row(ADD_ID=1212, AGE=35, INTEREST=[Row(INFO=u'x', INTEREST_NO=1),
> Row(INFO=u'y', INTEREST_NO=2)], NAME=u'George', POSTAL_AREA=1,
> TIME_ZONE_ID=1)
> Row(ADD_ID=1213, AGE=45, INTEREST=[Row(INFO=u'x', INTEREST_NO=2),
> Row(INFO=u'y', INTEREST_NO=3)], NAME=u'John', POSTAL_AREA=1,
> TIME_ZONE_ID=1)/
> *X = A400.flatMap(lambda i: i.INTEREST)*
> The flatMap results like below, each element in json array were flatten to
> tuple, not my expected  pyspark.sql.Row. I can only access the flatten
> results by index. but it supposed to be flatten to Row(namedTuple) and
> support to access by name.
> (u'x', 1)
> (u'y', 2)
> (u'x', 2)
> (u'y', 3)
>
> My spark version is 1.1.
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Help-pyspark-sql-List-flatMap-results-become-tuple-tp9961.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org