You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by freedafeng <fr...@yahoo.com> on 2014/09/10 02:27:32 UTC

how to run python examples in spark 1.1?

I'm mostly interested in the hbase examples in the repo. I saw two examples
hbase_inputformat.py and hbase_outputformat.py in the 1.1 branch. Can you
show me how to run them? 

Compile step is done. I tried to run the examples, but failed. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-examples-in-spark-1-1-tp13841.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: how to run python examples in spark 1.1?

Posted by freedafeng <fr...@yahoo.com>.

Just want to provide more information on how I ran the examples.

Environment: Cloudera quick start Vm 5.1.0 (HBase 0.98.1 installed). I
created a table called 'data1', and 'put' two records in it. I can see the
table and data are fine in hbase shell. 

I cloned spark repo and checked out to 1.1 branch, built it by running
sbt/sbt assembly/assembly
sbt/sbt examples/assembly.

The script is basically,

if __name__ == "__main__":

    conf = SparkConf().setAppName('testspark_similar_users_v2')

    sc = SparkContext(conf=conf, batchSize=512)

    conf2 = {"hbase.zookeeper.quorum": "localhost",
"hbase.mapreduce.inputtable": 'data1'}
    hbase_rdd = sc.newAPIHadoopRDD(
        "org.apache.hadoop.hbase.mapreduce.TableInputFormat",
        "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
        "org.apache.hadoop.hbase.client.Result",
       
keyConverter="org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter",
       
valueConverter="org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter",
        conf=conf2)
    output = hbase_rdd.collect()
    for (k, v) in output:
        print (k, v)

    sc.stop()

The error message is,

14/09/10 09:41:52 INFO ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/09/10 09:41:52 INFO RecoverableZooKeeper: The identifier of this process
is 25963@quickstart.cloudera
14/09/10 09:41:52 INFO ClientCnxn: Opening socket connection to server
quickstart.cloudera/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
14/09/10 09:41:52 INFO ClientCnxn: Socket connection established to
quickstart.cloudera/127.0.0.1:2181, initiating session
14/09/10 09:41:52 INFO ClientCnxn: Session establishment complete on server
quickstart.cloudera/127.0.0.1:2181, sessionid = 0x1485b365c450016,
negotiated timeout = 40000
14/09/10 09:52:32 ERROR TableInputFormat:
org.apache.hadoop.hbase.client.NoServerForRegionException: Unable to find
region for data1,,99999999999999 after 10 tries.
	at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:980)
	at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:885)
	at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:987)
	at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:889)
	at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:846)
	at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:133)
	at
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:96)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1091)
	at
org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:70)
	at
org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:441)
	at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:207)
	at java.lang.Thread.run(Thread.java:745)

Traceback (most recent call last):
  File "/home/cloudera/workspace/recom_env_testing/bin/sparkhbasecheck.py",
line 71, in <module>
    conf=conf2)
  File "/home/cloudera/workspace/spark/python/pyspark/context.py", line 471,
in newAPIHadoopRDD
    jconf, batchSize)
  File
"/usr/lib/python2.6/site-packages/py4j-0.8.2.1-py2.6.egg/py4j/java_gateway.py",
line 538, in __call__
    self.target_id, self.name)
  File
"/usr/lib/python2.6/site-packages/py4j-0.8.2.1-py2.6.egg/py4j/protocol.py",
line 300, in get_return_value
    format(target_id, '.', name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: No table was provided.
	at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:152)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-examples-in-spark-1-1-tp13841p13905.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org