You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Mike Sukmanowsky <mi...@gmail.com> on 2011/10/08 02:47:29 UTC

Custom InputFormat for Multiline Input File Hive/Hadoop

Hi all,

Sending this to core-user@hadoop.apache.org and dev@hive.apache.org.

Trying to process Omniture's data log files with Hadoop/Hive. The file
format is tab delimited and while being pretty simple for the most part,
they do allow you to have multiple new lines and tabs within a field that
are escaped by a backslash (\\n and \\t). As a result I've opted to create
my own InputFormat to handle the multiple newlines and convert those tabs to
spaces when Hive is going to try to do a split on the tabs.

I've found a fairly good reference for doing this using the newer
InputFormat API at http://blog.rguha.net/?p=293 but unfortunately my version
of Hive (0.7.0) still uses the old InputFormat API.

I haven't been able to find many tutorials on writing a custom InputFile
using the older API so I'm looking to see if I can get some guidance as to
what may be wrong with the following two classes:

https://gist.github.com/3141e9d27d4e07f5f9ed
https://gist.github.com/79fdab227950a0776616

The SELECT statements within hive currently return nothing and my other
variations returned nothing but NULL values.

This issue is also available on StackOverflow at
http://stackoverflow.com/questions/7692994/custom-inputformat-with-hive.

If there's a resource someone can point me to that'd also be great.

Many thanks in advance,
Mike

Custom InputFormat for Multiline Input File Hive/Hadoop

Posted by Mike Sukmanowsky <mi...@gmail.com>.
Hi all,

Sending this to core-user@hadoop.apache.org and dev@hive.apache.org.

Trying to process Omniture's data log files with Hadoop/Hive. The file
format is tab delimited and while being pretty simple for the most part,
they do allow you to have multiple new lines and tabs within a field that
are escaped by a backslash (\\n and \\t). As a result I've opted to create
my own InputFormat to handle the multiple newlines and convert those tabs to
spaces when Hive is going to try to do a split on the tabs.

I've found a fairly good reference for doing this using the newer
InputFormat API at http://blog.rguha.net/?p=293 but unfortunately my version
of Hive (0.7.0) still uses the old InputFormat API.

I haven't been able to find many tutorials on writing a custom InputFile
using the older API so I'm looking to see if I can get some guidance as to
what may be wrong with the following two classes:

https://gist.github.com/3141e9d27d4e07f5f9ed
https://gist.github.com/79fdab227950a0776616

The SELECT statements within hive currently return nothing and my other
variations returned nothing but NULL values.

This issue is also available on StackOverflow at
http://stackoverflow.com/questions/7692994/custom-inputformat-with-hive.

If there's a resource someone can point me to that'd also be great.

Many thanks in advance,
Mike