You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Alejandro Abdelnur <tu...@gmail.com> on 2008/04/17 19:37:28 UTC

finding the # of records in a SequenceFile

In our applications we usually need to know the number of records in a
SequenceFile file or in all the SequenceFiles in a directory. It is
not possible to use the job output counters as the files sometimes are
uploaded to HDFS or moved around from different directories.

We wrote a SequenceFileCounterOutputFormat that extends the
SequenceFileOutputFormat wrapping the returned RecordWriter with a
proxy that keeps the count of written records and on writer close it
writes the counter to a '_FILENAME.counter' file. A couple of static
methods allow to retrieve the counter for a file or for all the files
in a directory by reading the counter files and adding them up in the
later case.

Has anybody else such requirements?

My concern with this approach is the use of an extra file per file
just to keep the counter.

A way of addressing my concern would to to modify the SequenceFile so
the counter is written at the very end of the file after the synch of
the last record (a special synch point could be use to differentiate
the EOF from EOR). Then a new method in the SequenceFile would
position at the end o the file and read the counter.

Thoughts?

Thxs.

A
PS: if the idea of modifying the SequenceFile to support this feature
does not fly we can still contribute our
SequenceFileCounterOutputFormat to mapred.lib