You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by "Saurabh Bhatnagar (Business Intelligence)" <sa...@gmail.com> on 2013/09/29 19:35:23 UTC

Converting from textfile to sequencefile using Hive

Hi,

I have a lot of tweets saved as text. I created an external table on top of
it to access it as textfile. I need to convert these to sequencefiles with
each tweet as its own record. To do this, I created another table as a
sequencefile table like so -

CREATE EXTERNAL TABLE tweetseq(
  tweet STRING
  )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS SEQUENCEFILE
LOCATION '/user/hdfs/tweetseq'


Now when I insert into this table from my original tweets table, each line
gets its own record as expected. This is great. However, I don't have any
record ids here. Short of writing my own UDF to make that happen, are there
any obvious solutions I am missing here?

PS, I need the ids to be there because mahout seq2sparse expects that.
Without ids, it fails with -

java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be
cast to org.apache.hadoop.io.Text
at
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizerMapper.java:37)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

Regards,
S