You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Zhijun Dai <da...@gmail.com> on 2008/05/22 11:15:44 UTC

inverted index for crawled webdata

Hello Friend,

I have a question on how to write mapreduce job to build inverted index for
crawled webdata.

My problem is: if I store one page in one file, file-id can easily got, but
I am afriad if crawled billions of pages will have a problem for the hadoop
storage system.

If I store all pages in a big file, then how to get the file-id during the
map-reduce job?

Thanks in advance!

Regards

Zhijun