You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Damian Florczyk <th...@gentoo.org> on 2007/03/23 14:20:32 UTC

Nutch and GET

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi there,

Does nutch can index dynamic pages with multilpe GET parameters in request ?

- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGA9PTocEgB5I0QSkRAsEHAJ4xk5HbGZonbC+bTlvbWGPo5vVa4gCfQbx9
0kfqaVSXY4AY78DGo0pFg6Q=
=HTOD
-----END PGP SIGNATURE-----

Re: Nutch and GET

Posted by Damian Florczyk <th...@gentoo.org>.
Sami Siren wrote:
> Damian Florczyk wrote:
>   
>> Hi there,
>>
>> Does nutch can index dynamic pages with multilpe GET parameters in request ?
>>
>>     
>
> Have you allowed them in URL filter configuration? By default regex
> urlfilter filters away those:
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
>
>
> --
>  Sami Siren
>   


Yes, i did that. Anyway i've crawled another application which has 
shorter GET parameters and everything works fine. strange maybe long 
URLs are not probably crawled or sth ?

Re: Nutch and GET

Posted by Sami Siren <ss...@gmail.com>.
Damian Florczyk wrote:
> Hi there,
> 
> Does nutch can index dynamic pages with multilpe GET parameters in request ?
> 

Have you allowed them in URL filter configuration? By default regex
urlfilter filters away those:

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]



--
 Sami Siren

Re: Nutch and GET

Posted by Damian Florczyk <th...@gentoo.org>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ravi Chintakunta napisał(a):
> Hi,
> 
> That shouldn't be an issue.
> 
> Are you sure that this line
> 
> -[?*!@=]
> 
> is commented in crawl-urlfilter.txt file.
> 
> - Ravi Chintakunta
> 
> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> Ravi Chintakunta napisaB(a):
>> Try this:
> 
>> db.max.anchor.length
> 
>> - Ravi Chintakunta
> 
>> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
>> Ravi Chintakunta napisaB(a):
>>> Yes, Nutch can crawl and index any URLs that it can access.
> 
>>> You may have to tweak the max. URL length in the configuration.
> 
>>> - Ravi Chintakunta
> 
>>> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
>>> Hi there,
> 
>>> Does nutch can index dynamic pages with multilpe GET parameters in
>>> request ?
> 
> 
>> Which param is it ? I cannot find it in nutch-default.xml
> 
> 
> 
> 
> Well, it donest resolve my problem. but my problem may be connected with
> URL which i'm trying to index. It's like
> http://some.example/dir/pre?sth=aa&sth2=bb
> Maybe this pre (which doesnt have any extension coudnt be assosiate with
> any parser ?
>>
Yes, i'm sure. URL's are more then 100 chars long


- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGA/O+ocEgB5I0QSkRApSzAJ9OPBR9/1NJbA5qB4bzGyVlW+Uc9QCbBI8I
ap+2hLJLCICGCdBycrazOu8=
=cJDC
-----END PGP SIGNATURE-----

Re: Nutch and GET

Posted by Ravi Chintakunta <ra...@gmail.com>.
Hi,

That shouldn't be an issue.

Are you sure that this line

-[?*!@=]

is commented in crawl-urlfilter.txt file.

- Ravi Chintakunta

On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ravi Chintakunta napisał(a):
> > Try this:
> >
> > db.max.anchor.length
> >
> > - Ravi Chintakunta
> >
> > On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> > Ravi Chintakunta napisaB(a):
> >> Yes, Nutch can crawl and index any URLs that it can access.
> >
> >> You may have to tweak the max. URL length in the configuration.
> >
> >> - Ravi Chintakunta
> >
> >> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> >> Hi there,
> >
> >> Does nutch can index dynamic pages with multilpe GET parameters in
> >> request ?
> >
> >
> > Which param is it ? I cannot find it in nutch-default.xml
> >
> >
> >>
>
> Well, it donest resolve my problem. but my problem may be connected with
> URL which i'm trying to index. It's like
> http://some.example/dir/pre?sth=aa&sth2=bb
> Maybe this pre (which doesnt have any extension coudnt be assosiate with
> any parser ?
> - --
> Damian Florczyk aka thunder
> Gentoo Developer, Gentoo/NetBSD Development Lead
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFGA+VUocEgB5I0QSkRArfeAJ45dy685MBCwAM1d3cQ0c5+Smq9FQCeJ25p
> kqHwQuS6Lg65zquDrK9PPT0=
> =jl/j
> -----END PGP SIGNATURE-----
>

Re: Nutch and GET

Posted by Damian Florczyk <th...@gentoo.org>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ravi Chintakunta napisał(a):
> Try this:
> 
> db.max.anchor.length
> 
> - Ravi Chintakunta
> 
> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> Ravi Chintakunta napisaB(a):
>> Yes, Nutch can crawl and index any URLs that it can access.
> 
>> You may have to tweak the max. URL length in the configuration.
> 
>> - Ravi Chintakunta
> 
>> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
>> Hi there,
> 
>> Does nutch can index dynamic pages with multilpe GET parameters in
>> request ?
> 
> 
> Which param is it ? I cannot find it in nutch-default.xml
> 
> 
>>

Well, it donest resolve my problem. but my problem may be connected with
URL which i'm trying to index. It's like
http://some.example/dir/pre?sth=aa&sth2=bb
Maybe this pre (which doesnt have any extension coudnt be assosiate with
any parser ?
- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGA+VUocEgB5I0QSkRArfeAJ45dy685MBCwAM1d3cQ0c5+Smq9FQCeJ25p
kqHwQuS6Lg65zquDrK9PPT0=
=jl/j
-----END PGP SIGNATURE-----

Re: Nutch and GET

Posted by Ravi Chintakunta <ra...@gmail.com>.
Try this:

db.max.anchor.length

- Ravi Chintakunta

On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ravi Chintakunta napisał(a):
> > Yes, Nutch can crawl and index any URLs that it can access.
> >
> > You may have to tweak the max. URL length in the configuration.
> >
> > - Ravi Chintakunta
> >
> > On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> > Hi there,
> >
> > Does nutch can index dynamic pages with multilpe GET parameters in
> > request ?
> >
> >>
> Which param is it ? I cannot find it in nutch-default.xml
>
>
> - --
> Damian Florczyk aka thunder
> Gentoo Developer, Gentoo/NetBSD Development Lead
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFGA93hocEgB5I0QSkRAmlFAJ9lOLUAnLWRVA1NGfqPMJQH1Qk2eACdG9/r
> bmXOt6m6w7iO8Z/WKNTXuYU=
> =lCt7
> -----END PGP SIGNATURE-----
>

Re: Nutch and GET

Posted by Damian Florczyk <th...@gentoo.org>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ravi Chintakunta napisał(a):
> Yes, Nutch can crawl and index any URLs that it can access.
> 
> You may have to tweak the max. URL length in the configuration.
> 
> - Ravi Chintakunta
> 
> On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> Hi there,
> 
> Does nutch can index dynamic pages with multilpe GET parameters in
> request ?
> 
>>
Which param is it ? I cannot find it in nutch-default.xml


- --
Damian Florczyk aka thunder
Gentoo Developer, Gentoo/NetBSD Development Lead
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGA93hocEgB5I0QSkRAmlFAJ9lOLUAnLWRVA1NGfqPMJQH1Qk2eACdG9/r
bmXOt6m6w7iO8Z/WKNTXuYU=
=lCt7
-----END PGP SIGNATURE-----

Re: Nutch and GET

Posted by Ravi Chintakunta <ra...@gmail.com>.
Yes, Nutch can crawl and index any URLs that it can access.

You may have to tweak the max. URL length in the configuration.

- Ravi Chintakunta

On 3/23/07, Damian Florczyk <th...@gentoo.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi there,
>
> Does nutch can index dynamic pages with multilpe GET parameters in request ?
>
> - --
> Damian Florczyk aka thunder
> Gentoo Developer, Gentoo/NetBSD Development Lead
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFGA9PTocEgB5I0QSkRAsEHAJ4xk5HbGZonbC+bTlvbWGPo5vVa4gCfQbx9
> 0kfqaVSXY4AY78DGo0pFg6Q=
> =HTOD
> -----END PGP SIGNATURE-----
>