You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Piotr Kosiorowski <pk...@gmail.com> on 2005/05/04 22:03:53 UTC

Removing unwanted sites/urls from an index

Hi all,

Working on domain specific search engine I faced a problem when 
sophisticated automatic classification algorithm was unable to 
categorize page correctly. Additionally this page can be categorized as 
offensive for many people so there was a strong pressure to remove it 
from an index. Sometimes after quick look at the data one can identify 
"spammer" sites - that contain many pages that should not be  apart of 
an index despite being domain related. I  had a look at PruneIndexTool 
but could not find a way to use it the way I wanted. In addition writing 
a small program that has exactly functionality descried above is quite easy.
So some time ago I wrote a small utility program that takes segment and 
simple text file with URLs or host names as parameters and removes given 
  URL or sites from Lucene index in this segment. It is done in very 
simple way:
1) load all URL or site information  into HashSet in memory
2) iterate over all Lucene documents removing unwanted ones.

It has a drawbacks as unwanted site/url information must fit into JVM 
memory. Usually it is not a big problem as one wants to remove special 
cases found manually. If it becomes a problem one can split url file 
into several files and process them one by one.
	
If anyone would find such addition useful and worth committing to nutch 
SVN I can update my source code (ASF license/ package names/ remove JDK 
1.5 dependencies) and send it to the list.
Regards,
Piotr

Re: Removing unwanted sites/urls from an index

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Hello Andrzej,
As I wrote I had a look at PruneIndexTool before writing my own one.As I 
understand PruneIndexTool removes data from index based on Lucene query. 
For me it was not obvious how to construct a lucene query to remove:
www.abc.com but not (www.def.abc.com or www.def.com/abc or 
www.aabc.com/def) taking into account all strange things nutch tokenizer 
does with such addresses.

I also look at the possibility of integration but found out that your 
tool does not go through all documents - doing full text search instead.
So control flow is very different in both cases, but of course we can 
try to create a common facade for both tools.

I think both tools serve a little bit different purpose - one removes
all documents found by given Lucene query - excellent for removing of 
all pages containing given bad word or bad phrase (lets take "mortage 
refinancement" as an example). In the second case one wants to remove
some specific pages or sites from an index - I agree your tool is more 
general and probably it can be achieved using some phrase queries
and conversion of queries using NutchDocumentAnalyzer but I think
use case for usage of simpler tool is quite common (especially for 
domain restricted search engines) and for nutch users (maybe not experts 
in lucene) such tool might be of some value.

I am not sure if it should be added to nutch - this is why I wrote an 
email and not started to port it before recieving comments.
I hope I explained my thinking behind reinventing the circle :)
I am not planning to do anything on it right now - if no other person 
finds it useful I can live with using it on my own. I am just going one 
by one through features I have implemented and checking if they might be 
of some interest to nutch community. We benefited a lot just by using 
nutch so giving back my small fixes and tools is our small attempt to 
help others and push the whole thing forward.

Regards
Piotr

Andrzej Bialecki wrote:
> Piotr Kosiorowski wrote:
> 
>> If anyone would find such addition useful and worth committing to 
>> nutch SVN I can update my source code (ASF license/ package names/ 
>> remove JDK 1.5 dependencies) and send it to the list.
> 
> 
> You're a bit late ;-) Please take a look at PruneIndexTool. If you think 
> your tool solves this or that in a better way, no problem - we can merge 
> these two.
> 
>

Re: Removing unwanted sites/urls from an index

Posted by Andrzej Bialecki <ab...@getopt.org>.

Piotr Kosiorowski wrote:
> If anyone would find such addition useful and worth committing to nutch 
> SVN I can update my source code (ASF license/ package names/ remove JDK 
> 1.5 dependencies) and send it to the list.

You're a bit late ;-) Please take a look at PruneIndexTool. If you think 
your tool solves this or that in a better way, no problem - we can merge 
these two.


-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com