You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2015/08/08 22:42:15 UTC

How to create DataFrame from a binary file?

Hi how do we create DataFrame from a binary file stored in HDFS? I was
thinking to use

JavaPairRDD<String,PortableDataStream> pairRdd =
javaSparkContext.binaryFiles("/hdfs/path/to/binfile");
JavaRDD<PortableDataStream> javardd = pairRdd.values();

I can see that PortableDataStream has method called toArray which can
convert into byte array I was thinking if I have JavaRDD<byte[]> can I call
the following and get DataFrame

DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,Byte.class);

Please guide I am new to Spark. I have my own custom format which is binary
format and I was thinking if I can convert my custom format into DataFrame
using binary operations then I dont need to create my own custom Hadoop
format am I on right track? Will reading binary data into DataFrame scale?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: How to create DataFrame from a binary file?

Posted by Ted Yu <yu...@gmail.com>.
Umesh:
Please take a look at the classes under:
sql/core/src/main/scala/org/apache/spark/sql/parquet

FYI

On Mon, Aug 10, 2015 at 10:35 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Bo thanks much let me explain please see the following code
>
> JavaPairRDD<String,PortableDataStream> pairRdd =
> javaSparkContext.binaryFiles("/hdfs/path/to/binfile");
> JavaRDD<PortableDataStream> javardd = pairRdd.values();
>
> DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,
>  PortableDataStream.class);
> binDataFrame.show(); //shows just one row with above file path
> /hdfs/path/to/binfile
>
> I want binary data in DataFrame from above file so that I can directly do
> analytics on it. My data is binary so I cant use StructType
> with primitive data types rigth since everything is binary/byte. My custom
> data format in binary is same as Parquet I did not find any good example
> where/how parquet is read into DataFrame. Please guide.
>
>
>
>
>
> On Sun, Aug 9, 2015 at 11:52 PM, bo yang <bo...@gmail.com> wrote:
>
>> Well, my post uses raw text json file to show how to create data frame
>> with a custom data schema. The key idea is to show the flexibility to deal
>> with any format of data by using your own schema. Sorry if I did not make
>> you fully understand.
>>
>> Anyway, let us know once you figure out your problem.
>>
>>
>>
>>
>> On Sun, Aug 9, 2015 at 11:10 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> Hi Bo I know how to create a DataFrame my question is how to create a
>>> DataFrame for binary files and in your blog it is raw text json files
>>> please read my question properly thanks.
>>>
>>> On Sun, Aug 9, 2015 at 11:21 PM, bo yang <bo...@gmail.com> wrote:
>>>
>>>> You can create your own data schema (StructType in spark), and use
>>>> following method to create data frame with your own data schema:
>>>>
>>>> sqlContext.createDataFrame(yourRDD, structType);
>>>>
>>>> I wrote a post on how to do it. You can also get the sample code there:
>>>>
>>>> Light-Weight Self-Service Data Query through Spark SQL:
>>>>
>>>> https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang
>>>>
>>>> Take a look and feel free to  let me know for any question.
>>>>
>>>> Best,
>>>> Bo
>>>>
>>>>
>>>>
>>>> On Sat, Aug 8, 2015 at 1:42 PM, unk1102 <um...@gmail.com> wrote:
>>>>
>>>>> Hi how do we create DataFrame from a binary file stored in HDFS? I was
>>>>> thinking to use
>>>>>
>>>>> JavaPairRDD<String,PortableDataStream> pairRdd =
>>>>> javaSparkContext.binaryFiles("/hdfs/path/to/binfile");
>>>>> JavaRDD<PortableDataStream> javardd = pairRdd.values();
>>>>>
>>>>> I can see that PortableDataStream has method called toArray which can
>>>>> convert into byte array I was thinking if I have JavaRDD<byte[]> can I
>>>>> call
>>>>> the following and get DataFrame
>>>>>
>>>>> DataFrame binDataFrame =
>>>>> sqlContext.createDataFrame(javaBinRdd,Byte.class);
>>>>>
>>>>> Please guide I am new to Spark. I have my own custom format which is
>>>>> binary
>>>>> format and I was thinking if I can convert my custom format into
>>>>> DataFrame
>>>>> using binary operations then I dont need to create my own custom Hadoop
>>>>> format am I on right track? Will reading binary data into DataFrame
>>>>> scale?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to create DataFrame from a binary file?

Posted by Umesh Kacha <um...@gmail.com>.
Hi Bo thanks much let me explain please see the following code

JavaPairRDD<String,PortableDataStream> pairRdd =
javaSparkContext.binaryFiles("/hdfs/path/to/binfile");
JavaRDD<PortableDataStream> javardd = pairRdd.values();

DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,
 PortableDataStream.class);
binDataFrame.show(); //shows just one row with above file path
/hdfs/path/to/binfile

I want binary data in DataFrame from above file so that I can directly do
analytics on it. My data is binary so I cant use StructType
with primitive data types rigth since everything is binary/byte. My custom
data format in binary is same as Parquet I did not find any good example
where/how parquet is read into DataFrame. Please guide.





On Sun, Aug 9, 2015 at 11:52 PM, bo yang <bo...@gmail.com> wrote:

> Well, my post uses raw text json file to show how to create data frame
> with a custom data schema. The key idea is to show the flexibility to deal
> with any format of data by using your own schema. Sorry if I did not make
> you fully understand.
>
> Anyway, let us know once you figure out your problem.
>
>
>
>
> On Sun, Aug 9, 2015 at 11:10 AM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> Hi Bo I know how to create a DataFrame my question is how to create a
>> DataFrame for binary files and in your blog it is raw text json files
>> please read my question properly thanks.
>>
>> On Sun, Aug 9, 2015 at 11:21 PM, bo yang <bo...@gmail.com> wrote:
>>
>>> You can create your own data schema (StructType in spark), and use
>>> following method to create data frame with your own data schema:
>>>
>>> sqlContext.createDataFrame(yourRDD, structType);
>>>
>>> I wrote a post on how to do it. You can also get the sample code there:
>>>
>>> Light-Weight Self-Service Data Query through Spark SQL:
>>>
>>> https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang
>>>
>>> Take a look and feel free to  let me know for any question.
>>>
>>> Best,
>>> Bo
>>>
>>>
>>>
>>> On Sat, Aug 8, 2015 at 1:42 PM, unk1102 <um...@gmail.com> wrote:
>>>
>>>> Hi how do we create DataFrame from a binary file stored in HDFS? I was
>>>> thinking to use
>>>>
>>>> JavaPairRDD<String,PortableDataStream> pairRdd =
>>>> javaSparkContext.binaryFiles("/hdfs/path/to/binfile");
>>>> JavaRDD<PortableDataStream> javardd = pairRdd.values();
>>>>
>>>> I can see that PortableDataStream has method called toArray which can
>>>> convert into byte array I was thinking if I have JavaRDD<byte[]> can I
>>>> call
>>>> the following and get DataFrame
>>>>
>>>> DataFrame binDataFrame =
>>>> sqlContext.createDataFrame(javaBinRdd,Byte.class);
>>>>
>>>> Please guide I am new to Spark. I have my own custom format which is
>>>> binary
>>>> format and I was thinking if I can convert my custom format into
>>>> DataFrame
>>>> using binary operations then I dont need to create my own custom Hadoop
>>>> format am I on right track? Will reading binary data into DataFrame
>>>> scale?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: How to create DataFrame from a binary file?

Posted by bo yang <bo...@gmail.com>.
Well, my post uses raw text json file to show how to create data frame with
a custom data schema. The key idea is to show the flexibility to deal with
any format of data by using your own schema. Sorry if I did not make you
fully understand.

Anyway, let us know once you figure out your problem.




On Sun, Aug 9, 2015 at 11:10 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Bo I know how to create a DataFrame my question is how to create a
> DataFrame for binary files and in your blog it is raw text json files
> please read my question properly thanks.
>
> On Sun, Aug 9, 2015 at 11:21 PM, bo yang <bo...@gmail.com> wrote:
>
>> You can create your own data schema (StructType in spark), and use
>> following method to create data frame with your own data schema:
>>
>> sqlContext.createDataFrame(yourRDD, structType);
>>
>> I wrote a post on how to do it. You can also get the sample code there:
>>
>> Light-Weight Self-Service Data Query through Spark SQL:
>>
>> https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang
>>
>> Take a look and feel free to  let me know for any question.
>>
>> Best,
>> Bo
>>
>>
>>
>> On Sat, Aug 8, 2015 at 1:42 PM, unk1102 <um...@gmail.com> wrote:
>>
>>> Hi how do we create DataFrame from a binary file stored in HDFS? I was
>>> thinking to use
>>>
>>> JavaPairRDD<String,PortableDataStream> pairRdd =
>>> javaSparkContext.binaryFiles("/hdfs/path/to/binfile");
>>> JavaRDD<PortableDataStream> javardd = pairRdd.values();
>>>
>>> I can see that PortableDataStream has method called toArray which can
>>> convert into byte array I was thinking if I have JavaRDD<byte[]> can I
>>> call
>>> the following and get DataFrame
>>>
>>> DataFrame binDataFrame =
>>> sqlContext.createDataFrame(javaBinRdd,Byte.class);
>>>
>>> Please guide I am new to Spark. I have my own custom format which is
>>> binary
>>> format and I was thinking if I can convert my custom format into
>>> DataFrame
>>> using binary operations then I dont need to create my own custom Hadoop
>>> format am I on right track? Will reading binary data into DataFrame
>>> scale?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: How to create DataFrame from a binary file?

Posted by Umesh Kacha <um...@gmail.com>.
Hi Bo I know how to create a DataFrame my question is how to create a
DataFrame for binary files and in your blog it is raw text json files
please read my question properly thanks.

On Sun, Aug 9, 2015 at 11:21 PM, bo yang <bo...@gmail.com> wrote:

> You can create your own data schema (StructType in spark), and use
> following method to create data frame with your own data schema:
>
> sqlContext.createDataFrame(yourRDD, structType);
>
> I wrote a post on how to do it. You can also get the sample code there:
>
> Light-Weight Self-Service Data Query through Spark SQL:
>
> https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang
>
> Take a look and feel free to  let me know for any question.
>
> Best,
> Bo
>
>
>
> On Sat, Aug 8, 2015 at 1:42 PM, unk1102 <um...@gmail.com> wrote:
>
>> Hi how do we create DataFrame from a binary file stored in HDFS? I was
>> thinking to use
>>
>> JavaPairRDD<String,PortableDataStream> pairRdd =
>> javaSparkContext.binaryFiles("/hdfs/path/to/binfile");
>> JavaRDD<PortableDataStream> javardd = pairRdd.values();
>>
>> I can see that PortableDataStream has method called toArray which can
>> convert into byte array I was thinking if I have JavaRDD<byte[]> can I
>> call
>> the following and get DataFrame
>>
>> DataFrame binDataFrame =
>> sqlContext.createDataFrame(javaBinRdd,Byte.class);
>>
>> Please guide I am new to Spark. I have my own custom format which is
>> binary
>> format and I was thinking if I can convert my custom format into DataFrame
>> using binary operations then I dont need to create my own custom Hadoop
>> format am I on right track? Will reading binary data into DataFrame scale?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: How to create DataFrame from a binary file?

Posted by bo yang <bo...@gmail.com>.
You can create your own data schema (StructType in spark), and use
following method to create data frame with your own data schema:

sqlContext.createDataFrame(yourRDD, structType);

I wrote a post on how to do it. You can also get the sample code there:

Light-Weight Self-Service Data Query through Spark SQL:
https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang

Take a look and feel free to  let me know for any question.

Best,
Bo



On Sat, Aug 8, 2015 at 1:42 PM, unk1102 <um...@gmail.com> wrote:

> Hi how do we create DataFrame from a binary file stored in HDFS? I was
> thinking to use
>
> JavaPairRDD<String,PortableDataStream> pairRdd =
> javaSparkContext.binaryFiles("/hdfs/path/to/binfile");
> JavaRDD<PortableDataStream> javardd = pairRdd.values();
>
> I can see that PortableDataStream has method called toArray which can
> convert into byte array I was thinking if I have JavaRDD<byte[]> can I call
> the following and get DataFrame
>
> DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,Byte.class);
>
> Please guide I am new to Spark. I have my own custom format which is binary
> format and I was thinking if I can convert my custom format into DataFrame
> using binary operations then I dont need to create my own custom Hadoop
> format am I on right track? Will reading binary data into DataFrame scale?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>