You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Amit Simgh <am...@cse.iitb.ac.in> on 2008/09/06 21:21:26 UTC

Hadoop custom readers and writers

Hi,

I have thousands of  webpages each represented as serialized tree object 
compressed (ZLIB)  together (file size varying from 2.5 GB to 4.5GB).
I have to do some heavy text processing on these pages.

What the the best way to read /access these pages.

Method1
***************
1) Write Custom Splitter that
    1. uncompresses the file(2.5GB to 4GB) and then parses it(time : 
around 10 minutes )
    2. Splits the binary data in to parts 10-20
2) Implement specific readers to read a page and present it to mapper

OR.

Method -2
***************
Read the entire file w/o splitting : one one Map task per file.
Implement specific readers to read a page and present it to mapper

Slight detour:
I was browing thru code in FileInputFormat and TextInputFormat. In 
getSplit method the file is broken at arbitary byte boundaries.
So in case of TextInputFormat what if last line of mapper is truncated 
(incomplete byte sequence). what happens.
Can someone explain and give pointers in code where this happens?

I also saw classes like Records . What are these used for?