You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by parameshr <pa...@gmail.com> on 2016/07/30 21:54:42 UTC

Dataframe and corresponding RDD return different rows (PySpark)

Hi,

I am facing a weird behavior where the dataframe and the downstream list and
map generated from its RDD equivalent seem to be returning different rows.
What could be possibly going wrong? Any help is appreciated.

Below is a snippet of the code along with the output:
NOTE: 

[1] samples is a dataframe with 10 rows and three columns. In the first
line, I am concatenating the first two columns
[2] Output of the highlighted statements is shown below. They are different.
I understand if the order is different (because doing .collect() on a rdd
could possibly produce a different ordering), but some of the rows returned
are completely different. This seems really weird!

CODE:

tmp = samples.select(func.concat(func.col("post_visid_low"), func.lit("-"),
func.col("post_visid_high")).alias(
        'user_id'), "post_page_url")
print("tmp show:")
*tmp.show(10, False)*

# term freq computation
vocab =
tmp.select("post_page_url").groupBy("post_page_url").count().rdd.collectAsMap()
*for k,v in vocab.items():
    print(k,v)
*

# group by user_ids
user_id_urls = tmp.rdd.reduceByKey(
    lambda x,y: x + "," + y)
num_users = user_id_urls.count()
print("user_id_urls:")
*user_id_urls.collect()*

OUTPUT:


tmp dataframe show():
+---------------------------------------+--------------------------------------------------------------------------------------------+
|user_id                                |post_page_url                                                                              
|
+---------------------------------------+--------------------------------------------------------------------------------------------+
|6917530152391623611-2707424459370863148|http://www.backcountry.com/Store/catalog/shopAllBrands.jsp                                 
|
|6917530609264617841-2788188800375174579|http://www.backcountry.com/Store/catalog/shopAllBrands.jsp                                 
|
|6917530818644021208-2821777435347267515|http://www.backcountry.com                                                                 
|
|6917530818644021208-2821777435347267515|http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets                             
|
|6917530818644021208-2821777435347267515|http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets                             
|
|6917530818644021208-2821777435347267515|http://www.backcountry.com/dakine-washburn-jacket-mens                                     
|
|1657310128-1262694438                 
|http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016|
|4611687717086954899-2907911088913069555|http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys                           
|
|2023386797-562458996                   |http://www.backcountry.com                                                                 
|
|6917530783747871522-2923626095076314968|http://www.backcountry.com/pikolinos-verona-boot-womens                                    
|
+---------------------------------------+--------------------------------------------------------------------------------------------+

vocab map:
http://www.backcountry.com/boys-jackets 2
http://www.backcountry.com/dakine-titan-mittens 1
https://www.backcountry.com/Store/account/account.jsp 1
http://www.backcountry.com/ski-clothing 1
http://www.backcountry.com/the-north-face-runners-1-etip-glove 1
http://www.backcountry.com/patagonia 1
http://www.backcountry.com/burton-boys-clothing 1
http://www.backcountry.com/mens-shorts 1
https://www.backcountry.com/Store/account/login.jsp 1

user_id_urls rdd:
[(u'4611687717086954899-2907911088913069555',
  u'http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys'),
 (u'2023386797-562458996', u'http://www.backcountry.com'),
 (u'6917530783747871522-2923626095076314968',
  u'http://www.backcountry.com/pikolinos-verona-boot-womens'),
 (u'6917530818644021208-2821777435347267515',
 
u'http://www.backcountry.com,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'),
 (u'6917530152391623611-2707424459370863148',
  u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
 (u'6917530609264617841-2788188800375174579',
  u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'),
 (u'1657310128-1262694438',
 
u'http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016')]


Thanks,
Params



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-and-corresponding-RDD-return-different-rows-PySpark-tp27435.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org