You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marcin Okraszewski <ok...@o2.pl> on 2007/05/25 11:50:13 UTC

How to create new file in segment?

I need to create a new file in segment during parsing. Could anyone help me, how to do it?

I find two issues:
1. How to get location of the segment being currently processed?
2. I suppose I need to use Hadoop. I don't know how to use it.

Maybe a bit on a background of the problem – maybe there is a better solution. I need to filter pages based on its content, so I cannot use URLFilter. Furthermore, I need to fallow links from the pages to filter out.

I seems that solution might be to write a file with URLs of pages to drop (which didn't match my criteria). Then I would apply an URL Filter during merge segments (mergeseg). The URL Filter would read the file created during parsing and would drop all URLs given in this file.

Thanks for help.
Marcin