You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Joe L <se...@yahoo.com> on 2014/04/18 13:58:22 UTC

how to split one big RDD (about 100G) into several small ones?

I want to split a single big rdd into small rdds without reading too much
from disk (hdfs). what is the best way to do that?

this is my current code: 
 subclass_pairs    = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['subClassOf']).map(lambda (s, p, o): (s, o))
    subproperty_pairs = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['subPropertyOf']).map(lambda (s, p, o): (s, o)).cache()
    domain_pairs      = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['domain']).map(lambda (s, p, o): (s, o))
    range_pairs       = schema_triples.filter(lambda (s, p, o): p ==
PROPERTIES['range']).map(lambda (s, p, o): (s, o))
    total_triples     = instance_triples.union(schema_triples)
    type_pairs        = total_triples.filter(lambda (s, p, o):  p ==
PROPERTIES['type']).map(lambda (s, p, o): (s, o)).distinct().cache()
    triples           = total_triples.filter(lambda (s, p, o): 
isUserDefined(p)).map(lambda (s, p, o): (s, p, o)).distinct().cache()  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-one-big-RDD-about-100G-into-several-small-ones-tp4450.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.