You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Damian Florczyk <th...@gentoo.org> on 2007/03/23 14:20:32 UTC
Nutch and GET
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi there,
Does nutch can index dynamic pages with multilpe GET parameters in request ?
- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGA9PTocEgB5I0QSkRAsEHAJ4xk5HbGZonbC+bTlvbWGPo5vVa4gCfQbx9
0kfqaVSXY4AY78DGo0pFg6Q=
=HTOD
-----END PGP SIGNATURE-----
Re: Nutch and GET
Posted by Damian Florczyk <th...@gentoo.org>.
Sami Siren wrote:
> Damian Florczyk wrote:
>
>> Hi there,
>>
>> Does nutch can index dynamic pages with multilpe GET parameters in request ?
>>
>>
>
> Have you allowed them in URL filter configuration? By default regex
> urlfilter filters away those:
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
>
>
> --
> Sami Siren
>
Yes, i did that. Anyway i've crawled another application which has
shorter GET parameters and everything works fine. strange maybe long
URLs are not probably crawled or sth ?
Re: Nutch and GET
Posted by Sami Siren <ss...@gmail.com>.
Damian Florczyk wrote:
> Hi there,
>
> Does nutch can index dynamic pages with multilpe GET parameters in request ?
>
Have you allowed them in URL filter configuration? By default regex
urlfilter filters away those:
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
--
Sami Siren
Re: Nutch and GET
Posted by Damian Florczyk <th...@gentoo.org>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ravi Chintakunta napisał(a):
> Hi,
>
> That shouldn't be an issue.
>
> Are you sure that this line
>
> -[?*!@=]
>
> is commented in crawl-urlfilter.txt file.
>
> - Ravi Chintakunta
>
> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> Ravi Chintakunta napisaB(a):
>> Try this:
>
>> db.max.anchor.length
>
>> - Ravi Chintakunta
>
>> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
>> Ravi Chintakunta napisaB(a):
>>> Yes, Nutch can crawl and index any URLs that it can access.
>
>>> You may have to tweak the max. URL length in the configuration.
>
>>> - Ravi Chintakunta
>
>>> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
>>> Hi there,
>
>>> Does nutch can index dynamic pages with multilpe GET parameters in
>>> request ?
>
>
>> Which param is it ? I cannot find it in nutch-default.xml
>
>
>
>
> Well, it donest resolve my problem. but my problem may be connected with
> URL which i'm trying to index. It's like
> http://some.example/dir/pre?sth=aa&sth2=bb
> Maybe this pre (which doesnt have any extension coudnt be assosiate with
> any parser ?
>>
Yes, i'm sure. URL's are more then 100 chars long
- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGA/O+ocEgB5I0QSkRApSzAJ9OPBR9/1NJbA5qB4bzGyVlW+Uc9QCbBI8I
ap+2hLJLCICGCdBycrazOu8=
=cJDC
-----END PGP SIGNATURE-----
Re: Nutch and GET
Posted by Ravi Chintakunta <ra...@gmail.com>.
Hi,
That shouldn't be an issue.
Are you sure that this line
-[?*!@=]
is commented in crawl-urlfilter.txt file.
- Ravi Chintakunta
On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ravi Chintakunta napisał(a):
> > Try this:
> >
> > db.max.anchor.length
> >
> > - Ravi Chintakunta
> >
> > On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> > Ravi Chintakunta napisaB(a):
> >> Yes, Nutch can crawl and index any URLs that it can access.
> >
> >> You may have to tweak the max. URL length in the configuration.
> >
> >> - Ravi Chintakunta
> >
> >> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> >> Hi there,
> >
> >> Does nutch can index dynamic pages with multilpe GET parameters in
> >> request ?
> >
> >
> > Which param is it ? I cannot find it in nutch-default.xml
> >
> >
> >>
>
> Well, it donest resolve my problem. but my problem may be connected with
> URL which i'm trying to index. It's like
> http://some.example/dir/pre?sth=aa&sth2=bb
> Maybe this pre (which doesnt have any extension coudnt be assosiate with
> any parser ?
> - --
> Damian Florczyk aka thunder
> Gentoo Developer, Gentoo/NetBSD Development Lead
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFGA+VUocEgB5I0QSkRArfeAJ45dy685MBCwAM1d3cQ0c5+Smq9FQCeJ25p
> kqHwQuS6Lg65zquDrK9PPT0=
> =jl/j
> -----END PGP SIGNATURE-----
>
Re: Nutch and GET
Posted by Damian Florczyk <th...@gentoo.org>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ravi Chintakunta napisał(a):
> Try this:
>
> db.max.anchor.length
>
> - Ravi Chintakunta
>
> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> Ravi Chintakunta napisaB(a):
>> Yes, Nutch can crawl and index any URLs that it can access.
>
>> You may have to tweak the max. URL length in the configuration.
>
>> - Ravi Chintakunta
>
>> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
>> Hi there,
>
>> Does nutch can index dynamic pages with multilpe GET parameters in
>> request ?
>
>
> Which param is it ? I cannot find it in nutch-default.xml
>
>
>>
Well, it donest resolve my problem. but my problem may be connected with
URL which i'm trying to index. It's like
http://some.example/dir/pre?sth=aa&sth2=bb
Maybe this pre (which doesnt have any extension coudnt be assosiate with
any parser ?
- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGA+VUocEgB5I0QSkRArfeAJ45dy685MBCwAM1d3cQ0c5+Smq9FQCeJ25p
kqHwQuS6Lg65zquDrK9PPT0=
=jl/j
-----END PGP SIGNATURE-----
Re: Nutch and GET
Posted by Ravi Chintakunta <ra...@gmail.com>.
Try this:
db.max.anchor.length
- Ravi Chintakunta
On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ravi Chintakunta napisał(a):
> > Yes, Nutch can crawl and index any URLs that it can access.
> >
> > You may have to tweak the max. URL length in the configuration.
> >
> > - Ravi Chintakunta
> >
> > On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> > Hi there,
> >
> > Does nutch can index dynamic pages with multilpe GET parameters in
> > request ?
> >
> >>
> Which param is it ? I cannot find it in nutch-default.xml
>
>
> - --
> Damian Florczyk aka thunder
> Gentoo Developer, Gentoo/NetBSD Development Lead
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFGA93hocEgB5I0QSkRAmlFAJ9lOLUAnLWRVA1NGfqPMJQH1Qk2eACdG9/r
> bmXOt6m6w7iO8Z/WKNTXuYU=
> =lCt7
> -----END PGP SIGNATURE-----
>
Re: Nutch and GET
Posted by Damian Florczyk <th...@gentoo.org>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ravi Chintakunta napisał(a):
> Yes, Nutch can crawl and index any URLs that it can access.
>
> You may have to tweak the max. URL length in the configuration.
>
> - Ravi Chintakunta
>
> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> Hi there,
>
> Does nutch can index dynamic pages with multilpe GET parameters in
> request ?
>
>>
Which param is it ? I cannot find it in nutch-default.xml
- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGA93hocEgB5I0QSkRAmlFAJ9lOLUAnLWRVA1NGfqPMJQH1Qk2eACdG9/r
bmXOt6m6w7iO8Z/WKNTXuYU=
=lCt7
-----END PGP SIGNATURE-----
Re: Nutch and GET
Posted by Ravi Chintakunta <ra...@gmail.com>.
Yes, Nutch can crawl and index any URLs that it can access.
You may have to tweak the max. URL length in the configuration.
- Ravi Chintakunta
On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi there,
>
> Does nutch can index dynamic pages with multilpe GET parameters in request ?
>
> - --
> Damian Florczyk aka thunder
> Gentoo Developer, Gentoo/NetBSD Development Lead
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFGA9PTocEgB5I0QSkRAsEHAJ4xk5HbGZonbC+bTlvbWGPo5vVa4gCfQbx9
> 0kfqaVSXY4AY78DGo0pFg6Q=
> =HTOD
> -----END PGP SIGNATURE-----
>