You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Zhijun Dai <da...@gmail.com> on 2008/05/22 11:15:44 UTC
inverted index for crawled webdata
Hello Friend,
I have a question on how to write mapreduce job to build inverted index for
crawled webdata.
My problem is: if I store one page in one file, file-id can easily got, but
I am afriad if crawled billions of pages will have a problem for the hadoop
storage system.
If I store all pages in a big file, then how to get the file-id during the
map-reduce job?
Thanks in advance!
Regards
Zhijun