You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Markus Jelsma <ma...@openindex.io> on 2013/11/01 17:35:07 UTC

RE: Exclude urls without 'www' from Nutch 1.7 crawl

Hi - Use the domain-urlfilter for host, domain and TLD filtering.

Also, please ask questions on the Nutch list, you're on Solr now :)
 
 
-----Original message-----
> From:Reyes, Mark <Ma...@bpiedu.com>
> Sent: Friday 1st November 2013 17:24
> To: solr-user@lucene.apache.org
> Subject: Exclude urls without 'www' from Nutch 1.7 crawl
> 
> I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.
> 
> Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both 'www' and '' urls such as:
> 
> www.mywebsite.com
> mywebsite.com
> www.mywebsite.com/page1
> mywebsite.com/page1
> 
> My understanding is that the url filtering (regex-urlfilter.txt) needs modification. Are there any regex/nutch experts that could suggest a solution?
> 
> Here is the code on paste bin,
> http://pastebin.com/Cp6vUxPR
> 
> Also on stack overflow,
> http://stackoverflow.com/questions/19731904/exclude-urls-without-www-from-nutch-1-7-crawl
> 
> Thank you,
> Mark
> 
> 
> IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.

Re: Exclude urls without 'www' from Nutch 1.7 crawl

Posted by "Reyes, Mark" <Ma...@bpiedu.com>.

Noted and will do (that goes twice for the suggestions and putting this on
the nutch list instead).

Thanks all,
Mark



On 11/1/13, 10:53 AM, "Furkan KAMACI" <fu...@gmail.com> wrote:

>As Markus pointed Nutch has a feature for such kind of situation. Here is
>Solr list but one more thing for you: www.mywebsite.com and
>mywebsite.commay point to "different" pages.
>
>
>2013/11/1 Markus Jelsma <ma...@openindex.io>
>
>> Hi - Use the domain-urlfilter for host, domain and TLD filtering.
>>
>> Also, please ask questions on the Nutch list, you're on Solr now :)
>>
>>
>> -----Original message-----
>> > From:Reyes, Mark <Ma...@bpiedu.com>
>> > Sent: Friday 1st November 2013 17:24
>> > To: solr-user@lucene.apache.org
>> > Subject: Exclude urls without 'www' from Nutch 1.7 crawl
>> >
>> > I'm currently using Nutch 1.7 to crawl my domain. My issue is specific
>> to URLs being indexed as www vs. non-www.
>> >
>> > Specifically, after firing the crawl and index to Solr 4.5 then
>> validating the results on the front-end with AJAX Solr, the search
>>results
>> page lists results/pages that are both 'www' and '' urls such as:
>> >
>> > www.mywebsite.com
>> > mywebsite.com
>> > www.mywebsite.com/page1
>> > mywebsite.com/page1
>> >
>> > My understanding is that the url filtering (regex-urlfilter.txt) needs
>> modification. Are there any regex/nutch experts that could suggest a
>> solution?
>> >
>> > Here is the code on paste bin,
>> > http://pastebin.com/Cp6vUxPR
>> >
>> > Also on stack overflow,
>> >
>> 
>>http://stackoverflow.com/questions/19731904/exclude-urls-without-www-from
>>-nutch-1-7-crawl
>> >
>> > Thank you,
>> > Mark
>> >
>> >
>> > IMPORTANT NOTICE: This e-mail message is intended to be received only
>>by
>> persons entitled to receive the confidential information it may contain.
>> E-mail messages sent from Bridgepoint Education may contain information
>> that is confidential and may be legally privileged. Please do not read,
>> copy, forward or store this message unless you are an intended
>>recipient of
>> it. If you received this transmission in error, please notify the
>>sender by
>> reply e-mail and delete the message and any attachments.
>>


IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.

Re: Exclude urls without 'www' from Nutch 1.7 crawl

Posted by Furkan KAMACI <fu...@gmail.com>.

As Markus pointed Nutch has a feature for such kind of situation. Here is
Solr list but one more thing for you: www.mywebsite.com and
mywebsite.commay point to "different" pages.


2013/11/1 Markus Jelsma <ma...@openindex.io>

> Hi - Use the domain-urlfilter for host, domain and TLD filtering.
>
> Also, please ask questions on the Nutch list, you're on Solr now :)
>
>
> -----Original message-----
> > From:Reyes, Mark <Ma...@bpiedu.com>
> > Sent: Friday 1st November 2013 17:24
> > To: solr-user@lucene.apache.org
> > Subject: Exclude urls without 'www' from Nutch 1.7 crawl
> >
> > I'm currently using Nutch 1.7 to crawl my domain. My issue is specific
> to URLs being indexed as www vs. non-www.
> >
> > Specifically, after firing the crawl and index to Solr 4.5 then
> validating the results on the front-end with AJAX Solr, the search results
> page lists results/pages that are both 'www' and '' urls such as:
> >
> > www.mywebsite.com
> > mywebsite.com
> > www.mywebsite.com/page1
> > mywebsite.com/page1
> >
> > My understanding is that the url filtering (regex-urlfilter.txt) needs
> modification. Are there any regex/nutch experts that could suggest a
> solution?
> >
> > Here is the code on paste bin,
> > http://pastebin.com/Cp6vUxPR
> >
> > Also on stack overflow,
> >
> http://stackoverflow.com/questions/19731904/exclude-urls-without-www-from-nutch-1-7-crawl
> >
> > Thank you,
> > Mark
> >
> >
> > IMPORTANT NOTICE: This e-mail message is intended to be received only by
> persons entitled to receive the confidential information it may contain.
> E-mail messages sent from Bridgepoint Education may contain information
> that is confidential and may be legally privileged. Please do not read,
> copy, forward or store this message unless you are an intended recipient of
> it. If you received this transmission in error, please notify the sender by
> reply e-mail and delete the message and any attachments.
>