You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by freedafeng <fr...@yahoo.com> on 2014/09/10 02:27:32 UTC
how to run python examples in spark 1.1?
I'm mostly interested in the hbase examples in the repo. I saw two examples
hbase_inputformat.py and hbase_outputformat.py in the 1.1 branch. Can you
show me how to run them?
Compile step is done. I tried to run the examples, but failed.
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-examples-in-spark-1-1-tp13841.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: how to run python examples in spark 1.1?
Posted by freedafeng <fr...@yahoo.com>.
Just want to provide more information on how I ran the examples.
Environment: Cloudera quick start Vm 5.1.0 (HBase 0.98.1 installed). I
created a table called 'data1', and 'put' two records in it. I can see the
table and data are fine in hbase shell.
I cloned spark repo and checked out to 1.1 branch, built it by running
sbt/sbt assembly/assembly
sbt/sbt examples/assembly.
The script is basically,
if __name__ == "__main__":
conf = SparkConf().setAppName('testspark_similar_users_v2')
sc = SparkContext(conf=conf, batchSize=512)
conf2 = {"hbase.zookeeper.quorum": "localhost",
"hbase.mapreduce.inputtable": 'data1'}
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter="org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter",
valueConverter="org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter",
conf=conf2)
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)
sc.stop()
The error message is,
14/09/10 09:41:52 INFO ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/09/10 09:41:52 INFO RecoverableZooKeeper: The identifier of this process
is 25963@quickstart.cloudera
14/09/10 09:41:52 INFO ClientCnxn: Opening socket connection to server
quickstart.cloudera/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
14/09/10 09:41:52 INFO ClientCnxn: Socket connection established to
quickstart.cloudera/127.0.0.1:2181, initiating session
14/09/10 09:41:52 INFO ClientCnxn: Session establishment complete on server
quickstart.cloudera/127.0.0.1:2181, sessionid = 0x1485b365c450016,
negotiated timeout = 40000
14/09/10 09:52:32 ERROR TableInputFormat:
org.apache.hadoop.hbase.client.NoServerForRegionException: Unable to find
region for data1,,99999999999999 after 10 tries.
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:980)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:885)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:987)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:889)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:846)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:133)
at
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:96)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
at org.apache.spark.rdd.RDD.first(RDD.scala:1091)
at
org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:70)
at
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:441)
at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
File "/home/cloudera/workspace/recom_env_testing/bin/sparkhbasecheck.py",
line 71, in <module>
conf=conf2)
File "/home/cloudera/workspace/spark/python/pyspark/context.py", line 471,
in newAPIHadoopRDD
jconf, batchSize)
File
"/usr/lib/python2.6/site-packages/py4j-0.8.2.1-py2.6.egg/py4j/java_gateway.py",
line 538, in __call__
self.target_id, self.name)
File
"/usr/lib/python2.6/site-packages/py4j-0.8.2.1-py2.6.egg/py4j/protocol.py",
line 300, in get_return_value
format(target_id, '.', name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: No table was provided.
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:152)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-examples-in-spark-1-1-tp13841p13905.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org