You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Vertical Search <ve...@gmail.com> on 2006/03/10 05:58:09 UTC
URL containing "?", "&" and "="
Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
crawl.
I have tried all combinations of modifying crawl-urlfilter.txt and
# skip URLs containing certain characters as probable queries, etc.
+[?*!@=]
But invain. I have hit a road block.. that is terrible.. :(
Re: URL containing "?", "&" and "="
Posted by Vertical Search <ve...@gmail.com>.
The URL is
search_results.html?country1=USA&search_type_form=quick&updated_since=sixtydays&basicsearch=0&advancedsearch=0&keywords_all=motel&search=Search&metro_area=1&kw=motel
I am using nightly build from 8th March..
Thanks
Sudhi
On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
> Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
> crawl.
> I have tried all combinations of modifying crawl-urlfilter.txt and
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>
Re: URL containing "?", "&" and "="
Posted by Marko Bauhardt <mb...@media-style.com>.
Do you crawl the intranet or do you crawl the web? If you crawl the
web then you must edit the urlfilter-regex.txt and not the crawl-
urlfilter.txt.
In your first mail you said you get an exception like
"org.apache.nutch.net.URLFilter not found". Does the exception still
occur?
Marko
Re: URL containing "?", "&" and "="
Posted by Vertical Search <ve...@gmail.com>.
Yes. I did comment as Mark suggested
#[?*!@=] in crawl-urlfilter.txt.
But still did not fetch the urls. Is this the only thing or should I escape
in the urlfile list ?
Thanks
On 3/10/06, Richard Braman <rb...@bramantax.com> wrote:
>
> Woa!
>
> If you want to include all urls don't do +, as that will make all urls
> with ?&= get fecthed, ignoring all of your other filters
>
> just comment the line out.
>
> -----Original Message-----
> From: Vertical Search [mailto:vertical.searchh@gmail.com]
> Sent: Friday, March 10, 2006 8:27 AM
> To: nutch-user
> Subject: Re: URL containing "?", "&" and "="
>
>
> Mark,
> I did follow your advice. I modified the following line in
> crawl-urlfilter.txt. But no difference. Should I escape the characters
> in urls folder ?
>
> Thanks
>
>
>
> On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
> >
> > Okay, I have noticed that for URLs containing "?", "&" and "=" I
> > cannot crawl. I have tried all combinations of modifying
> > crawl-urlfilter.txt and # skip URLs containing certain characters as
> > probable queries, etc.
> > +[?*!@=]
> >
> > But invain. I have hit a road block.. that is terrible.. :(
> >
> >
> >
>
>
RE: URL containing "?", "&" and "="
Posted by Richard Braman <rb...@bramantax.com>.
Just to be clear, what marko said
#[?*!@=]
Is correct.
Comment the line out.
-----Original Message-----
From: Richard Braman [mailto:rbraman@bramantax.com]
Sent: Friday, March 10, 2006 8:50 AM
To: nutch-user@lucene.apache.org
Subject: RE: URL containing "?", "&" and "="
Woa!
If you want to include all urls don't do +, as that will make all urls
with ?&= get fecthed, ignoring all of your other filters
just comment the line out.
-----Original Message-----
From: Vertical Search [mailto:vertical.searchh@gmail.com]
Sent: Friday, March 10, 2006 8:27 AM
To: nutch-user
Subject: Re: URL containing "?", "&" and "="
Mark,
I did follow your advice. I modified the following line in
crawl-urlfilter.txt. But no difference. Should I escape the characters
in urls folder ?
Thanks
On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
> Okay, I have noticed that for URLs containing "?", "&" and "=" I
> cannot crawl. I have tried all combinations of modifying
> crawl-urlfilter.txt and # skip URLs containing certain characters as
> probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>
RE: URL containing "?", "&" and "="
Posted by Richard Braman <rb...@bramantax.com>.
Woa!
If you want to include all urls don't do +, as that will make all urls
with ?&= get fecthed, ignoring all of your other filters
just comment the line out.
-----Original Message-----
From: Vertical Search [mailto:vertical.searchh@gmail.com]
Sent: Friday, March 10, 2006 8:27 AM
To: nutch-user
Subject: Re: URL containing "?", "&" and "="
Mark,
I did follow your advice. I modified the following line in
crawl-urlfilter.txt. But no difference. Should I escape the characters
in urls folder ?
Thanks
On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
> Okay, I have noticed that for URLs containing "?", "&" and "=" I
> cannot crawl. I have tried all combinations of modifying
> crawl-urlfilter.txt and # skip URLs containing certain characters as
> probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>
Re: URL containing "?", "&" and "="
Posted by Vertical Search <ve...@gmail.com>.
Mark,
I did follow your advice. I modified the following line in
crawl-urlfilter.txt.
But no difference.
Should I escape the characters in urls folder ?
Thanks
On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
> Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
> crawl.
> I have tried all combinations of modifying crawl-urlfilter.txt and
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>
crawling etiquette
Posted by Howie Wang <ho...@hotmail.com>.
I was wondering what others are setting the max number of fetches
per host to. I'm currently doing between 500-1000. Do you not set
this at all and just set a timeout between fetches to the same host?
Howie
Re: URL containing "?", "&" and "="
Posted by Vertical Search <ve...@gmail.com>.
First of all Thank You Richard and Mark.
I am able to move forward.
Now, I have to make sure, I dont parse unnecesary URLs in a given page.
Typically sites are organized such that there is a common look and feel
looping back to home and things like that..
I want to just ignore some URLs which is not relevant to my crawl and only
crawl those with specific pattern.
Can I use the whitelist urlfilter for this purpose.. Can some one help me
understand how it works.. I know how a plug in works. But I need to know,
how it actually works..
Thanks
On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
> Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
> crawl.
> I have tried all combinations of modifying crawl-urlfilter.txt and
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>
Re: URL containing "?", "&" and "="
Posted by Marko Bauhardt <mb...@media-style.com>.
Am 10.03.2006 um 05:58 schrieb Vertical Search:
> Okay, I have noticed that for URLs containing "?", "&" and "=" I
> cannot
> crawl.
> I have tried all combinations of modifying crawl-urlfilter.txt and
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]
Try #[?*!@=] anstead +[?*!@=].
Marko
Re: URL containing "?", "&" and "="
Posted by Vertical Search <ve...@gmail.com>.
Thanks Marko. The URLFilter not found was occuring when I try to run crawl
command from eclipse in a debug environment.
When I run from command (cygwin), I dont get the error. May be I am mising
something.. I will get it fixed.
Nowing coming back Crawling intranet and internet. I just tried crawling
intranet by modifying the crawl-urlfilter.txt.
It seems to be working..
For Internet I have to try, but will have to do it from my home computer..
My URL, am trying to fetch is as follows
search_results.html?country1=USA&search_type_form=quick&updated_since=sixtydays&basicsearch=0&advancedsearch=0&keywords_all=motel&search=Search&metro_area=1&kw=motel
Should I be changing anything in urlsfilelist ?
Thanks
Do you crawl the intranet or do you crawl the web? If you crawl the
web then you must edit the urlfilter-regex.txt and not the crawl-
urlfilter.txt.
In your first mail you said you get an exception like
"org.apache.nutch.net.URLFilter not found". Does the exception still
occur?
Marko
On 3/9/06, Vertical Search <ve...@gmail.com> wrote:
>
> Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
> crawl.
> I have tried all combinations of modifying crawl-urlfilter.txt and
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]
>
> But invain. I have hit a road block.. that is terrible.. :(
>
>
>