You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Kiwon Lee <ki...@gmail.com> on 2012/08/20 17:33:03 UTC

hadoop don't split gzip compressed file, but it seems to be splitted.( ^D^H)

Hi,

I have a 20G gzip compressed log file on HDFS.
Because log format of file is complex, I use to create SerDe for parsing.
But, while parse the log file, occurred the parsing exception.
The parser is read as a* ^D^H*, not a line.

127.0.0.1 [2012-08-20] "ABCDEFG" "JSKEJFKDJKFD"
127.0.0.1 [2012-08-20] "ABCDEFG" "JSKEJFKDJKFD"
127.0.0.1 [2012-08-20] "ABCDEFG" "JSKEJFKDJKFD"
127.0.0.1 [2012-08-20] "ABCDEFG" "JSKEJFKDJKFD"
127.0.0.1 [2012-08-20] "ABCDE *^D^H*

The file of small size (about 40M) dose not occur parsing error.
I read that hadoop don't split gzip compressed file, but it seems to be
splitted.

Am i doing anything wrong ?
Plz. help me....