You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Eric Martin <er...@makethembite.com> on 2010/11/05 21:33:05 UTC

Removing irrelevant URLS

Hi,

 

I have 100k URL's in my index. I specifically crawled sits relating to law.
However, during my intitial crawls I didn't specify urlfilters so I am stuck
with extrinsic and often irrelevant URL's like twitter, etc. 

 

Is there some way in Solr that I can run periodic URL cleanings to remove
URL's and search string results? Or, should I just dump my index and rebuild
using the filter? 

 

I have looked on the Solr wiki and came across some candidates that look
like it is what I am trying to accomplish but am not sure. If anyone knows
where I should be looking I would appreciate it.

 

Eric


RE: Removing irrelevant URLS

Posted by Eric Martin <er...@makethembite.com>.
OK, thanks. I am using nutch and figuring out how to use urlfilters,
unsuccessfully. Just thought there might be a way I could save some trouble
this way. Thanks!

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Sunday, November 07, 2010 8:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Removing irrelevant URLS

You can always do a delete-by-query, but that pre-supposes you can form
a query that would remove only those documents with URLs you want
removed... Assuming you do this, an optimize would then physically
remove the documents from your index (delete by query just marks
the docs as deleted).

Solr has nothing specifically for URLs, it's an engine rather than a web
crawling app....

Best
Erick

On Fri, Nov 5, 2010 at 4:33 PM, Eric Martin <er...@makethembite.com> wrote:

> Hi,
>
>
>
> I have 100k URL's in my index. I specifically crawled sits relating to
law.
> However, during my intitial crawls I didn't specify urlfilters so I am
> stuck
> with extrinsic and often irrelevant URL's like twitter, etc.
>
>
>
> Is there some way in Solr that I can run periodic URL cleanings to remove
> URL's and search string results? Or, should I just dump my index and
> rebuild
> using the filter?
>
>
>
> I have looked on the Solr wiki and came across some candidates that look
> like it is what I am trying to accomplish but am not sure. If anyone knows
> where I should be looking I would appreciate it.
>
>
>
> Eric
>
>


Re: Removing irrelevant URLS

Posted by Erick Erickson <er...@gmail.com>.
You can always do a delete-by-query, but that pre-supposes you can form
a query that would remove only those documents with URLs you want
removed... Assuming you do this, an optimize would then physically
remove the documents from your index (delete by query just marks
the docs as deleted).

Solr has nothing specifically for URLs, it's an engine rather than a web
crawling app....

Best
Erick

On Fri, Nov 5, 2010 at 4:33 PM, Eric Martin <er...@makethembite.com> wrote:

> Hi,
>
>
>
> I have 100k URL's in my index. I specifically crawled sits relating to law.
> However, during my intitial crawls I didn't specify urlfilters so I am
> stuck
> with extrinsic and often irrelevant URL's like twitter, etc.
>
>
>
> Is there some way in Solr that I can run periodic URL cleanings to remove
> URL's and search string results? Or, should I just dump my index and
> rebuild
> using the filter?
>
>
>
> I have looked on the Solr wiki and came across some candidates that look
> like it is what I am trying to accomplish but am not sure. If anyone knows
> where I should be looking I would appreciate it.
>
>
>
> Eric
>
>