You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@zeppelin.apache.org by Oren Shani <os...@iucc.ac.il> on 2016/04/11 15:11:00 UTC

pyspark and elasticsearch - accessing ES fileds

Hi All,

I connected pyspark under Zeppelin to my Elasticsearch DB and I am able to do this:

%pyspark
es_rdd = sc.newAPIHadoopRDD(
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
    keyClass="org.apache.hadoop.io.NullWritable",
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
    conf={ "es.resource" : "logstash-uni-*" })

es_rdd.toDF().registerTempTable("elk")

and then

%sql select * from elk

And then what I get is a table with just two columns. One is some objectID, I guess and the other is a string with a mapping of all the fields in the ES record into values ( " Map(@timestamp -> 2016-03-16T14:31:12.861Z, host -> ..." ).

My question is how do I create a spark table, or even just a python object ( probably a dict ), that will enable me to access each filed seperatly?

Thanks,

Oren