You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2015/08/10 20:22:24 UTC

How to use custom Hadoop InputFormat in DataFrame?

Hi I have my own Hadoop custom InputFormat which I want to use in DataFrame.
How do we do that? I know I can use sc.hadoopFile(..) but then how do I
convert it into DataFrame

JavaPairRDD<Void,MyRecordWritable> myFormatAsPairRdd =
jsc.hadoopFile("hdfs://tmp/data/myformat.xyz",MyInputFormat.class,Void.class,MyRecordWritable.class); 
JavaRDD<MyRecordWritable> myformatRdd =  myFormatAsPairRdd.values(); 
DataFrame myFormatAsDataframe = sqlContext.createDataFrame(myformatRdd,??); 

In above code what should I put in place of ?? I tried to put
MyRecordWritable.class but it does not work as it is not schema it is Record
Writable. Please guide.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-custom-Hadoop-InputFormat-in-DataFrame-tp24198.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: How to use custom Hadoop InputFormat in DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.
Hi Michael thanks for the reply. I know that I can create DataFrame using
JavaBean or Struct Type I want to know how can I create DataFrame from
above code which is custom Hadoop format.

On Tue, Aug 11, 2015 at 12:04 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> You can't create a DataFrame from an arbitrary object since we don't know
> how to figure out the schema.  You can either create a JavaBean
> <https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema>
> or manually create a row + specify the schema
> <https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema>
> .
>
>
>
> On Mon, Aug 10, 2015 at 11:22 AM, unk1102 <um...@gmail.com> wrote:
>
>> Hi I have my own Hadoop custom InputFormat which I want to use in
>> DataFrame.
>> How do we do that? I know I can use sc.hadoopFile(..) but then how do I
>> convert it into DataFrame
>>
>> JavaPairRDD<Void,MyRecordWritable> myFormatAsPairRdd =
>>
>> jsc.hadoopFile("hdfs://tmp/data/myformat.xyz",MyInputFormat.class,Void.class,MyRecordWritable.class);
>> JavaRDD<MyRecordWritable> myformatRdd =  myFormatAsPairRdd.values();
>> DataFrame myFormatAsDataframe =
>> sqlContext.createDataFrame(myformatRdd,??);
>>
>> In above code what should I put in place of ?? I tried to put
>> MyRecordWritable.class but it does not work as it is not schema it is
>> Record
>> Writable. Please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-custom-Hadoop-InputFormat-in-DataFrame-tp24198.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: How to use custom Hadoop InputFormat in DataFrame?

Posted by Michael Armbrust <mi...@databricks.com>.
You can't create a DataFrame from an arbitrary object since we don't know
how to figure out the schema.  You can either create a JavaBean
<https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema>
or manually create a row + specify the schema
<https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema>
.



On Mon, Aug 10, 2015 at 11:22 AM, unk1102 <um...@gmail.com> wrote:

> Hi I have my own Hadoop custom InputFormat which I want to use in
> DataFrame.
> How do we do that? I know I can use sc.hadoopFile(..) but then how do I
> convert it into DataFrame
>
> JavaPairRDD<Void,MyRecordWritable> myFormatAsPairRdd =
>
> jsc.hadoopFile("hdfs://tmp/data/myformat.xyz",MyInputFormat.class,Void.class,MyRecordWritable.class);
> JavaRDD<MyRecordWritable> myformatRdd =  myFormatAsPairRdd.values();
> DataFrame myFormatAsDataframe = sqlContext.createDataFrame(myformatRdd,??);
>
> In above code what should I put in place of ?? I tried to put
> MyRecordWritable.class but it does not work as it is not schema it is
> Record
> Writable. Please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-custom-Hadoop-InputFormat-in-DataFrame-tp24198.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>