You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by praveen reddy <pr...@gmail.com> on 2016/07/26 15:51:31 UTC

Need Help and Clarification

Hi All,

i am new to ORC (Hadoop as well). currently trying to use ORC format to
save data on HDFS using apache storm topology.
i am getting JSON message from Kafka, i am converting the json message to
Java object and adding few more properties,
convert to comma delimited values and after that i will save the data to
HDFS.
after going through few online material i am still unable to figure out how
to implement it and how the data will actaully look like in ORC format.

the data i need to store on HDFS gets as String format (can change to any
java object if needed), i want to save it along with the header info which
descrbes what
each attribute in data is, the header needs to save only once in that file.

now the question is how can i create the header file (similar to header we
create in txt or csv file). as i go through tutorial there is no header and
schema definiation need to be defined. if that is the case how can i save
schema defination in the file.

does the stripeSize should be same as block size in hdfs file system. what
happens if both are not equal.

i have to write into different orc files based on timestamp i get from json
message. i will not know which file i will be writing until i parse the
json message.
the time stamp can be anything. does closing and opening the ORC writer for
writing each tuple(json message) in ORC file takes lot of resources?
if i dont know which file i need to write until i see the timestamp in json
message, what other affective way to write into file.

i feel my understanding on ORC formating is wrong, but i am unable to move
fwd nor do i am getting clear picture of how ORC format looks like and how
to save.
we are not using any HIVE.

can you guys please give me some input on how to move fwd.