You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Vertical Search <ve...@gmail.com> on 2006/03/10 05:58:09 UTC

URL containing "?", "&" and "="

Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
crawl.
I have tried all combinations of modifying crawl-urlfilter.txt and
# skip URLs containing certain characters as probable queries, etc.
+[?*!@=]

But invain. I have hit a road block.. that is terrible.. :(

Re: URL containing "?", "&" and "="

Posted by Vertical Search <ve...@gmail.com>.
The URL is
search_results.html?country1=USA&search_type_form=quick&updated_since=sixtydays&basicsearch=0&advancedsearch=0&keywords_all=motel&search=Search&metro_area=1&kw=motel

I am using nightly build from 8th March..

Thanks
Sudhi

On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
>  Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
> crawl.
> I have tried all combinations of modifying crawl-urlfilter.txt and
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>

Re: URL containing "?", "&" and "="

Posted by Marko Bauhardt <mb...@media-style.com>.
Do you crawl the intranet or do you crawl the web? If you crawl the  
web then you must edit the urlfilter-regex.txt and not the crawl- 
urlfilter.txt.
In your first mail you said you get an exception like  
"org.apache.nutch.net.URLFilter not found". Does the exception still  
occur?


Marko


Re: URL containing "?", "&" and "="

Posted by Vertical Search <ve...@gmail.com>.
Yes. I did comment as Mark suggested
#[?*!@=] in crawl-urlfilter.txt.
But still did not fetch the urls. Is this the only thing or should I escape
in the urlfile list ?

Thanks



On 3/10/06, Richard Braman <rb...@bramantax.com> wrote:
>
> Woa!
>
> If you want to include all urls don't do +, as that will make all urls
> with ?&= get fecthed, ignoring all of your other filters
>
> just comment the line out.
>
> -----Original Message-----
> From: Vertical Search [mailto:vertical.searchh@gmail.com]
> Sent: Friday, March 10, 2006 8:27 AM
> To: nutch-user
> Subject: Re: URL containing "?", "&" and "="
>
>
> Mark,
> I did follow your advice. I modified the following line in
> crawl-urlfilter.txt. But no difference. Should I escape the characters
> in urls folder ?
>
> Thanks
>
>
>
> On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
> >
> >  Okay, I have noticed that for URLs containing "?", "&" and "=" I
> > cannot crawl. I have tried all combinations of modifying
> > crawl-urlfilter.txt and # skip URLs containing certain characters as
> > probable queries, etc.
> > +[?*!@=]
> >
> > But invain. I have hit a road block.. that is terrible.. :(
> >
> >
> >
>
>

RE: URL containing "?", "&" and "="

Posted by Richard Braman <rb...@bramantax.com>.
Just to be clear, what marko said
#[?*!@=]
Is correct.
Comment the line out.

-----Original Message-----
From: Richard Braman [mailto:rbraman@bramantax.com] 
Sent: Friday, March 10, 2006 8:50 AM
To: nutch-user@lucene.apache.org
Subject: RE: URL containing "?", "&" and "="


Woa!

If you want to include all urls don't do +, as that will make all urls
with ?&= get fecthed, ignoring all of your other filters 

just comment the line out.

-----Original Message-----
From: Vertical Search [mailto:vertical.searchh@gmail.com] 
Sent: Friday, March 10, 2006 8:27 AM
To: nutch-user
Subject: Re: URL containing "?", "&" and "="


Mark,
I did follow your advice. I modified the following line in
crawl-urlfilter.txt. But no difference. Should I escape the characters
in urls folder ?

Thanks



On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
>  Okay, I have noticed that for URLs containing "?", "&" and "=" I
> cannot crawl. I have tried all combinations of modifying 
> crawl-urlfilter.txt and # skip URLs containing certain characters as 
> probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>


RE: URL containing "?", "&" and "="

Posted by Richard Braman <rb...@bramantax.com>.
Woa!

If you want to include all urls don't do +, as that will make all urls
with ?&= get fecthed, ignoring all of your other filters 

just comment the line out.

-----Original Message-----
From: Vertical Search [mailto:vertical.searchh@gmail.com] 
Sent: Friday, March 10, 2006 8:27 AM
To: nutch-user
Subject: Re: URL containing "?", "&" and "="


Mark,
I did follow your advice. I modified the following line in
crawl-urlfilter.txt. But no difference. Should I escape the characters
in urls folder ?

Thanks



On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
>  Okay, I have noticed that for URLs containing "?", "&" and "=" I 
> cannot crawl. I have tried all combinations of modifying 
> crawl-urlfilter.txt and # skip URLs containing certain characters as 
> probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>


Re: URL containing "?", "&" and "="

Posted by Vertical Search <ve...@gmail.com>.
Mark,
I did follow your advice. I modified the following line in
crawl-urlfilter.txt.
But no difference.
Should I escape the characters in urls folder ?

Thanks



On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
>  Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
> crawl.
> I have tried all combinations of modifying crawl-urlfilter.txt and
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>

crawling etiquette

Posted by Howie Wang <ho...@hotmail.com>.
I was wondering what others are setting the max number of fetches
per host to. I'm currently doing between 500-1000. Do you not set
this at all and just set a timeout between fetches to the same host?

Howie



Re: URL containing "?", "&" and "="

Posted by Vertical Search <ve...@gmail.com>.
First of all Thank You Richard and Mark.
I am able to move forward.
Now, I have to make sure, I dont parse unnecesary URLs in a given page.
Typically sites are organized such that there is a common look and feel
looping back to home and things like that..
I want to just ignore some URLs which is not relevant to my crawl and only
crawl those with specific pattern.
Can I use the whitelist urlfilter for this purpose.. Can some one help me
understand how it works.. I know how a plug in works. But I need to know,
how it actually works..

Thanks



On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
> Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
> crawl.
> I have tried all combinations of modifying crawl-urlfilter.txt and
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>

Re: URL containing "?", "&" and "="

Posted by Marko Bauhardt <mb...@media-style.com>.
Am 10.03.2006 um 05:58 schrieb Vertical Search:

> Okay, I have noticed that for URLs containing "?", "&" and "=" I  
> cannot
> crawl.
> I have tried all combinations of modifying crawl-urlfilter.txt and
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]


Try #[?*!@=] anstead +[?*!@=].

Marko




Re: URL containing "?", "&" and "="

Posted by Vertical Search <ve...@gmail.com>.
Thanks Marko.  The URLFilter not found was occuring when I try to run crawl
command from eclipse in a debug environment.

When I run from command (cygwin), I dont get the error. May be I am mising
something.. I will get it fixed.

Nowing coming back Crawling intranet and internet. I just tried crawling
intranet by modifying the crawl-urlfilter.txt.
It seems to be working..
For Internet I have to try, but will have to do it from my home computer..

My URL, am trying to fetch is as follows
search_results.html?country1=USA&search_type_form=quick&updated_since=sixtydays&basicsearch=0&advancedsearch=0&keywords_all=motel&search=Search&metro_area=1&kw=motel

Should I be changing anything in urlsfilelist ?

Thanks


Do you crawl the intranet or do you crawl the web? If you crawl the
web then you must edit the urlfilter-regex.txt and not the crawl-
urlfilter.txt.
In your first mail you said you get an exception like
"org.apache.nutch.net.URLFilter not found". Does the exception still
occur?


Marko



On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
>  Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
> crawl.
> I have tried all combinations of modifying crawl-urlfilter.txt and
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>