You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Piotr Kosiorowski <pk...@gmail.com> on 2005/05/04 22:03:53 UTC
Removing unwanted sites/urls from an index
Hi all,
Working on domain specific search engine I faced a problem when
sophisticated automatic classification algorithm was unable to
categorize page correctly. Additionally this page can be categorized as
offensive for many people so there was a strong pressure to remove it
from an index. Sometimes after quick look at the data one can identify
"spammer" sites - that contain many pages that should not be apart of
an index despite being domain related. I had a look at PruneIndexTool
but could not find a way to use it the way I wanted. In addition writing
a small program that has exactly functionality descried above is quite easy.
So some time ago I wrote a small utility program that takes segment and
simple text file with URLs or host names as parameters and removes given
URL or sites from Lucene index in this segment. It is done in very
simple way:
1) load all URL or site information into HashSet in memory
2) iterate over all Lucene documents removing unwanted ones.
It has a drawbacks as unwanted site/url information must fit into JVM
memory. Usually it is not a big problem as one wants to remove special
cases found manually. If it becomes a problem one can split url file
into several files and process them one by one.
If anyone would find such addition useful and worth committing to nutch
SVN I can update my source code (ASF license/ package names/ remove JDK
1.5 dependencies) and send it to the list.
Regards,
Piotr
Re: Removing unwanted sites/urls from an index
Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello Andrzej,
As I wrote I had a look at PruneIndexTool before writing my own one.As I
understand PruneIndexTool removes data from index based on Lucene query.
For me it was not obvious how to construct a lucene query to remove:
www.abc.com but not (www.def.abc.com or www.def.com/abc or
www.aabc.com/def) taking into account all strange things nutch tokenizer
does with such addresses.
I also look at the possibility of integration but found out that your
tool does not go through all documents - doing full text search instead.
So control flow is very different in both cases, but of course we can
try to create a common facade for both tools.
I think both tools serve a little bit different purpose - one removes
all documents found by given Lucene query - excellent for removing of
all pages containing given bad word or bad phrase (lets take "mortage
refinancement" as an example). In the second case one wants to remove
some specific pages or sites from an index - I agree your tool is more
general and probably it can be achieved using some phrase queries
and conversion of queries using NutchDocumentAnalyzer but I think
use case for usage of simpler tool is quite common (especially for
domain restricted search engines) and for nutch users (maybe not experts
in lucene) such tool might be of some value.
I am not sure if it should be added to nutch - this is why I wrote an
email and not started to port it before recieving comments.
I hope I explained my thinking behind reinventing the circle :)
I am not planning to do anything on it right now - if no other person
finds it useful I can live with using it on my own. I am just going one
by one through features I have implemented and checking if they might be
of some interest to nutch community. We benefited a lot just by using
nutch so giving back my small fixes and tools is our small attempt to
help others and push the whole thing forward.
Regards
Piotr
Andrzej Bialecki wrote:
> Piotr Kosiorowski wrote:
>
>> If anyone would find such addition useful and worth committing to
>> nutch SVN I can update my source code (ASF license/ package names/
>> remove JDK 1.5 dependencies) and send it to the list.
>
>
> You're a bit late ;-) Please take a look at PruneIndexTool. If you think
> your tool solves this or that in a better way, no problem - we can merge
> these two.
>
>
Re: Removing unwanted sites/urls from an index
Posted by Andrzej Bialecki <ab...@getopt.org>.
Piotr Kosiorowski wrote:
> If anyone would find such addition useful and worth committing to nutch
> SVN I can update my source code (ASF license/ package names/ remove JDK
> 1.5 dependencies) and send it to the list.
You're a bit late ;-) Please take a look at PruneIndexTool. If you think
your tool solves this or that in a better way, no problem - we can merge
these two.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com