You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Joe L <se...@yahoo.com> on 2014/04/18 13:58:22 UTC
how to split one big RDD (about 100G) into several small ones?
I want to split a single big rdd into small rdds without reading too much
from disk (hdfs). what is the best way to do that?
this is my current code:
subclass_pairs = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['subClassOf']).map(lambda (s, p, o): (s, o))
subproperty_pairs = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['subPropertyOf']).map(lambda (s, p, o): (s, o)).cache()
domain_pairs = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['domain']).map(lambda (s, p, o): (s, o))
range_pairs = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['range']).map(lambda (s, p, o): (s, o))
total_triples = instance_triples.union(schema_triples)
type_pairs = total_triples.filter(lambda (s, p, o): p ==
PROPERTIES['type']).map(lambda (s, p, o): (s, o)).distinct().cache()
triples = total_triples.filter(lambda (s, p, o):
isUserDefined(p)).map(lambda (s, p, o): (s, p, o)).distinct().cache()
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-one-big-RDD-about-100G-into-several-small-ones-tp4450.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.