You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2014/11/14 20:59:33 UTC
[jira] [Updated] (SPARK-4395) Running a Spark SQL SELECT command
from PySpark causes a hang for ~ 1 hour
[ https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust updated SPARK-4395:
------------------------------------
Description:
When I run this command it hangs for one to many hours and then finally returns with successful results:
>>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
Note, the lab environment below is still active, so let me know if you'd like to just access it directly.
+++ My Environment +++
- 1-node cluster in Amazon
- RedHat 6.5 64-bit
- java version "1.7.0_67"
- SBT version: sbt-0.13.5
- Scala version: scala-2.11.2
Ran:
sudo yum -y update
git clone https://github.com/apache/spark
sudo sbt assembly
+++ Data file used +++
http://blueplastic.com/databricks/movielens/ratings.dat
{code}
>>> import re
>>> import string
>>> from pyspark.sql import SQLContext, Row
>>> sqlContext = SQLContext(sc)
>>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
>>>
>>> def parse_ratings_line(line):
... match = re.search(RATINGS_PATTERN, line)
... if match is None:
... # Optionally, you can change this to just ignore if each line of data is not critical.
... raise Error("Invalid logline: %s" % logline)
... return Row(
... UserID = int(match.group(1)),
... MovieID = int(match.group(2)),
... Rating = int(match.group(3)),
... Timestamp = int(match.group(4)))
...
>>> ratings_base_RDD = (sc.textFile("file:///home/ec2-user/movielens/ratings.dat")
... # Call the parse_apace_log_line function on each line.
... .map(parse_ratings_line)
... # Caches the objects in memory since they will be queried multiple times.
... .cache())
>>> ratings_base_RDD.count()
1000209
>>> ratings_base_RDD.first()
Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
>>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
>>> schemaRatings.registerTempTable("RatingsTable")
>>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
{code}
(Now the Python shell hangs...)
was:
When I run this command it hangs for one to many hours and then finally returns with successful results:
>>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
Note, the lab environment below is still active, so let me know if you'd like to just access it directly.
+++ My Environment +++
- 1-node cluster in Amazon
- RedHat 6.5 64-bit
- java version "1.7.0_67"
- SBT version: sbt-0.13.5
- Scala version: scala-2.11.2
Ran:
sudo yum -y update
git clone https://github.com/apache/spark
sudo sbt assembly
+++ Data file used +++
http://blueplastic.com/databricks/movielens/ratings.dat
+++ Code ran +++
>>> import re
>>> import string
>>> from pyspark.sql import SQLContext, Row
>>> sqlContext = SQLContext(sc)
>>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
>>>
>>> def parse_ratings_line(line):
... match = re.search(RATINGS_PATTERN, line)
... if match is None:
... # Optionally, you can change this to just ignore if each line of data is not critical.
... raise Error("Invalid logline: %s" % logline)
... return Row(
... UserID = int(match.group(1)),
... MovieID = int(match.group(2)),
... Rating = int(match.group(3)),
... Timestamp = int(match.group(4)))
...
>>> ratings_base_RDD = (sc.textFile("file:///home/ec2-user/movielens/ratings.dat")
... # Call the parse_apace_log_line function on each line.
... .map(parse_ratings_line)
... # Caches the objects in memory since they will be queried multiple times.
... .cache())
>>> ratings_base_RDD.count()
1000209
>>> ratings_base_RDD.first()
Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
>>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
>>> schemaRatings.registerTempTable("RatingsTable")
>>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
(Now the Python shell hangs...)
> Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
> --------------------------------------------------------------------------
>
> Key: SPARK-4395
> URL: https://issues.apache.org/jira/browse/SPARK-4395
> Project: Spark
> Issue Type: Bug
> Affects Versions: 1.2.0
> Environment: version 1.2.0-SNAPSHOT
> Reporter: Sameer Farooqui
>
> When I run this command it hangs for one to many hours and then finally returns with successful results:
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> Note, the lab environment below is still active, so let me know if you'd like to just access it directly.
> +++ My Environment +++
> - 1-node cluster in Amazon
> - RedHat 6.5 64-bit
> - java version "1.7.0_67"
> - SBT version: sbt-0.13.5
> - Scala version: scala-2.11.2
> Ran:
> sudo yum -y update
> git clone https://github.com/apache/spark
> sudo sbt assembly
> +++ Data file used +++
> http://blueplastic.com/databricks/movielens/ratings.dat
> {code}
> >>> import re
> >>> import string
> >>> from pyspark.sql import SQLContext, Row
> >>> sqlContext = SQLContext(sc)
> >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
> >>>
> >>> def parse_ratings_line(line):
> ... match = re.search(RATINGS_PATTERN, line)
> ... if match is None:
> ... # Optionally, you can change this to just ignore if each line of data is not critical.
> ... raise Error("Invalid logline: %s" % logline)
> ... return Row(
> ... UserID = int(match.group(1)),
> ... MovieID = int(match.group(2)),
> ... Rating = int(match.group(3)),
> ... Timestamp = int(match.group(4)))
> ...
> >>> ratings_base_RDD = (sc.textFile("file:///home/ec2-user/movielens/ratings.dat")
> ... # Call the parse_apace_log_line function on each line.
> ... .map(parse_ratings_line)
> ... # Caches the objects in memory since they will be queried multiple times.
> ... .cache())
> >>> ratings_base_RDD.count()
> 1000209
> >>> ratings_base_RDD.first()
> Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
> >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
> >>> schemaRatings.registerTempTable("RatingsTable")
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> {code}
> (Now the Python shell hangs...)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org