You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jeroen van Vianen <je...@vanvianen.nl> on 2010/08/17 13:04:21 UTC

Removing URLs from index

Hi,

I happen to have accumulated a lot of URLs in my index with the 
following layout:

http://www.company.com/directory1;if(T.getElementsByClassName(
http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case

There seem to be errors in the discovery of links from one page to the 
next. I have now excluded URLs with a ';' in regex-urlfilter.txt.

My question now is, how do I remove these documents from the index?

Regards,


Jeroen

Re: Removing URLs from index

Posted by Markus Jelsma <ma...@buyways.nl>.

On Tuesday 17 August 2010 13:47:32 Jeroen van Vianen wrote:
> 
> Yes. I have lots of similar results because of these URLs occurring many
> times for the same original URL.

You can use deduplication [1]. It generates signatures for (near) exact 
content depending on configuration. It can then optionally overwrite (delete) 
duplicates.

[1]: http://wiki.apache.org/solr/Deduplication

> 
> Thanks and best regards,
> 
> 
> Jeroen
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Removing URLs from index

Posted by Jeroen van Vianen <je...@vanvianen.nl>.

On 17-8-2010 13:35, Alex McLintock wrote:
>> I happen to have accumulated a lot of URLs in my index with the following
>> layout:
>>
>> http://www.company.com/directory1;if(T.getElementsByClassName(
>> http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case
>
> Hmmm,
>
> This may be thinking out loud rather than helpful:
>
> I thought ";" was supposed to introduce a session id. I wonder if we
> can or should be ignoring everything after the ";" character.

Maybe we should. I'm unsure why these JS fragments have been added to 
the URLs to crawl in the first place. Problem is that the webserver is 
happily serving URLs with above structure and generates proper content, 
probably because the JS fragment is an invalid session id and the 
webserver will automatically create a new session.

> I've recently seen cases where something which looked like a URL
> appeared in some Javascript and Nutch identified it as something to
> crawl. I don't know whether there is a easy fix.
>
>
>> There seem to be errors in the discovery of links from one page to the next.
>> I have now excluded URLs with a ';' in regex-urlfilter.txt.
>>
>> My question now is, how do I remove these documents from the index?
>
>
> Not sure. I suppose you could add in a plugin of your own which gets
> used when you extract the index - but I guess that would be too much
> trouble for you.
>
> May I ask why you want them removed from the index? Is it because you
> don't want users seeing them?

Yes. I have lots of similar results because of these URLs occurring many 
times for the same original URL.

Thanks and best regards,


Jeroen

Re: Removing URLs from index

Posted by Alex McLintock <al...@gmail.com>.

On 17 August 2010 12:04, Jeroen van Vianen <je...@vanvianen.nl> wrote:
> Hi,
>
> I happen to have accumulated a lot of URLs in my index with the following
> layout:
>
> http://www.company.com/directory1;if(T.getElementsByClassName(
> http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case

Hmmm,

This may be thinking out loud rather than helpful:

I thought ";" was supposed to introduce a session id. I wonder if we
can or should be ignoring everything after the ";" character.

I've recently seen cases where something which looked like a URL
appeared in some Javascript and Nutch identified it as something to
crawl. I don't know whether there is a easy fix.

> There seem to be errors in the discovery of links from one page to the next.
> I have now excluded URLs with a ';' in regex-urlfilter.txt.
>
> My question now is, how do I remove these documents from the index?

Not sure. I suppose you could add in a plugin of your own which gets
used when you extract the index - but I guess that would be too much
trouble for you.

May I ask why you want them removed from the index? Is it because you
don't want users seeing them?

Alex
> Regards,
>
>
> Jeroen
>

Re: Removing URLs from index

Posted by Jeroen van Vianen <je...@vanvianen.nl>.

On 17-8-2010 13:35, Markus Jelsma wrote:
> I assume it's about your Solr index again (for which you should mail to the
> Solr mailinglist). It features deleteById and deleteByQuery methods but in
> your case it's going to be rather hard. Your URL field is, using the stock
> schema, analyzed and has a tokenizer that strips characters such as your
> semicolon. Perhaps you can find a common trait amongst your bogus URL's that
> can be queried. If not, you must do it manually.

That's too bad as I'm unsure which URLs to look for. I think I'll just 
remove the entire domainname and crawl it again.

> But, if you reindex from Nutch, the already fetched and parsed pages will
> reappear in your Solr index. Removing data from Nutch is really hard but
> because of your urlfilter, the generate command will no longer add those URL's
> to the fetch queue but the pages are still in the segments.

Clear.

Thanks,


Jeroen

Re: Removing URLs from index

Posted by Markus Jelsma <ma...@buyways.nl>.

Hi,

I assume it's about your Solr index again (for which you should mail to the 
Solr mailinglist). It features deleteById and deleteByQuery methods but in 
your case it's going to be rather hard. Your URL field is, using the stock 
schema, analyzed and has a tokenizer that strips characters such as your 
semicolon. Perhaps you can find a common trait amongst your bogus URL's that 
can be queried. If not, you must do it manually.

But, if you reindex from Nutch, the already fetched and parsed pages will 
reappear in your Solr index. Removing data from Nutch is really hard but 
because of your urlfilter, the generate command will no longer add those URL's 
to the fetch queue but the pages are still in the segments. 

Cheers,

On Tuesday 17 August 2010 13:04:21 Jeroen van Vianen wrote:
> Hi,
> 
> I happen to have accumulated a lot of URLs in my index with the
> following layout:
> 
> http://www.company.com/directory1;if(T.getElementsByClassName(
> http://www.company.com/directory2;this.bottomContainer.appendChild(u);break
> ;case
> 
> There seem to be errors in the discovery of links from one page to the
> next. I have now excluded URLs with a ';' in regex-urlfilter.txt.
> 
> My question now is, how do I remove these documents from the index?
> 
> Regards,
> 
> 
> Jeroen
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350