You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Michael Segel <ms...@hotmail.com> on 2018/01/18 15:32:54 UTC

Reading Hive RCFiles?

Hi, 

I’m trying to find out if there’s a simple way for Spark to be able to read an RCFile. 

I know I can create a table in Hive, then drop the files in to that directory and use a sql context to read the file from Hive, however I wanted to read the file directly. 

Not a lot of details to go on… even the Apache site’s links are broken. 
See :
https://cwiki.apache.org/confluence/display/Hive/RCFile

Then try to follow the Javadoc link. 


Any suggestions? 

Thx

-Mike


Re: Reading Hive RCFiles?

Posted by Prakash Joshi <pr...@gmail.com>.
If it's simply reading the files from source in HDFS then we have an option
of  sc.hadoopFile  in spark API
Not sure if Spark SQL provides direct method to read



On Jan 18, 2018 9:32 PM, "Michael Segel" <ms...@hotmail.com> wrote:

> No idea on how that last line of garbage got in the message.
>
>
> > On Jan 18, 2018, at 9:32 AM, Michael Segel <ms...@hotmail.com>
> wrote:
> >
> > Hi,
> >
> > I’m trying to find out if there’s a simple way for Spark to be able to
> read an RCFile.
> >
> > I know I can create a table in Hive, then drop the files in to that
> directory and use a sql context to read the file from Hive, however I
> wanted to read the file directly.
> >
> > Not a lot of details to go on… even the Apache site’s links are broken.
> > See :
> > https://cwiki.apache.org/confluence/display/Hive/RCFile
> >
> > Then try to follow the Javadoc link.
> >
> >
> > Any suggestions?
> >
> > Thx
> >
> > -Mike
> >
> >
>

Re: Reading Hive RCFiles?

Posted by Michael Segel <ms...@hotmail.com>.
Just to follow up…

I was able to create an RDD from the file, however,  diving in to the RDD is a bit weird, and I’m working thru it.  My test file seems to be one block … 3K rows. So when I tried to get the first column of the first row, I ended up getting all of the rows for the first column which were comma delimited.   The other issue is then converting numeric fields back from their byte code.  I have the schema so I can do that.  (This is also an issue with RCFileCat  (sorry if I messed that name up…) things work great if you’re using strings only. )

I guess this could be a start of a project (time permitting) to enhance the ability to read older file formats as easy as it is to read Parquet and ORC files.

Will have to follow up in Dev.

Thanks everyone for the pointers.


On Jan 20, 2018, at 5:55 PM, Jörn Franke <jo...@gmail.com>> wrote:

Forgot to add the mailinglist

On 18. Jan 2018, at 18:55, Jörn Franke <jo...@gmail.com>> wrote:

Welll you can use:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopRDD-org.apache.hadoop.mapred.JobConf-java.lang.Class-java.lang.Class-java.lang.Class-int-

with the following inputformat:
https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/RCFileInputFormat.html

(note the version of the Javadoc does not matter it is already possible since a long time).

Writing is similarly with PairRDD and RCFileOutputFormat

On Thu, Jan 18, 2018 at 5:02 PM, Michael Segel <ms...@hotmail.com>> wrote:
No idea on how that last line of garbage got in the message.


> On Jan 18, 2018, at 9:32 AM, Michael Segel <ms...@hotmail.com>> wrote:
>
> Hi,
>
> I’m trying to find out if there’s a simple way for Spark to be able to read an RCFile.
>
> I know I can create a table in Hive, then drop the files in to that directory and use a sql context to read the file from Hive, however I wanted to read the file directly.
>
> Not a lot of details to go on… even the Apache site’s links are broken.
> See :
> https://cwiki.apache.org/confluence/display/Hive/RCFile
>
> Then try to follow the Javadoc link.
>
>
> Any suggestions?
>
> Thx
>
> -Mike
>
>



Re: Reading Hive RCFiles?

Posted by Jörn Franke <jo...@gmail.com>.
Forgot to add the mailinglist 

> On 18. Jan 2018, at 18:55, Jörn Franke <jo...@gmail.com> wrote:
> 
> Welll you can use:
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopRDD-org.apache.hadoop.mapred.JobConf-java.lang.Class-java.lang.Class-java.lang.Class-int-
> 
> with the following inputformat:
> https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/RCFileInputFormat.html
> 
> (note the version of the Javadoc does not matter it is already possible since a long time).
> 
> Writing is similarly with PairRDD and RCFileOutputFormat
> 
>> On Thu, Jan 18, 2018 at 5:02 PM, Michael Segel <ms...@hotmail.com> wrote:
>> No idea on how that last line of garbage got in the message.
>> 
>> 
>> > On Jan 18, 2018, at 9:32 AM, Michael Segel <ms...@hotmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I’m trying to find out if there’s a simple way for Spark to be able to read an RCFile.
>> >
>> > I know I can create a table in Hive, then drop the files in to that directory and use a sql context to read the file from Hive, however I wanted to read the file directly.
>> >
>> > Not a lot of details to go on… even the Apache site’s links are broken.
>> > See :
>> > https://cwiki.apache.org/confluence/display/Hive/RCFile
>> >
>> > Then try to follow the Javadoc link.
>> >
>> >
>> > Any suggestions?
>> >
>> > Thx
>> >
>> > -Mike
>> >
>> >
> 

Re: Reading Hive RCFiles?

Posted by Michael Segel <ms...@hotmail.com>.
No idea on how that last line of garbage got in the message. 


> On Jan 18, 2018, at 9:32 AM, Michael Segel <ms...@hotmail.com> wrote:
> 
> Hi, 
> 
> I’m trying to find out if there’s a simple way for Spark to be able to read an RCFile. 
> 
> I know I can create a table in Hive, then drop the files in to that directory and use a sql context to read the file from Hive, however I wanted to read the file directly. 
> 
> Not a lot of details to go on… even the Apache site’s links are broken. 
> See :
> https://cwiki.apache.org/confluence/display/Hive/RCFile
> 
> Then try to follow the Javadoc link. 
> 
> 
> Any suggestions? 
> 
> Thx
> 
> -Mike
> 
>