You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mohanraj Ragupathiraj <mo...@gmail.com> on 2016/05/18 04:04:48 UTC

Load Table as DataFrame

I have created a DataFrame from a HBase Table (PHOENIX) which has 500
million rows. From the DataFrame I created an RDD of JavaBean and use it
for joining with data from a file.

Map<String, String> phoenixInfoMap = new HashMap<String, String>();
phoenixInfoMap.put("table", tableName);
phoenixInfoMap.put("zkUrl", zkURL);
DataFrame df =
sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
JavaRDD<Row> tableRows = df.toJavaRDD();
JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair(
new PairFunction<Row, String, String>()
{
@Override
public Tuple2<String, String> call(Row row) throws Exception
{
return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME"));
}
});

Now my question - Lets say the file has 2 unique million entries matching
with the table. Is the entire table loaded into memory as RDD or only the
matching 2 million records from the table will be loaded into memory as RDD
?


http://stackoverflow.com/questions/37289849/phoenix-spark-load-table-as-dataframe

-- 
Thanks and Regards
Mohan