You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2009/01/20 16:19:24 UTC

Redirections and linkDB

Hi guys,

I had a look at the class URLUtil and found about the rules for choosing the
representation of a redirection. I tried on a very simple example and can
see that the target URL gets a *_repr_* as expected. However this does not
seem to be used when generating inverted links, i.e. the linkDB contains the
link between the source page and the redirected page but not to the target
page it redirects to. I can see that the *_repr_ *attribute is used during
indexing but I was expecting the linkDB to use it too.

Am I missing something here?

Thanks

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Redirections and linkDB

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

2009/1/20 Julien Nioche <li...@gmail.com>:
> Hello Doğacan,
>
>
> Thank you for your reply. It was more out of curiosity than anything else,
> although I suppose that this would allow us to get anchors for the target as
> well and hence potentially influence the scoring.
>

Using redirects for scoring sounds like a good idea. However, scoring-opic
can't really use redirects (as it is more of a on-the-fly scoring). I don't know
if Dennis Kubes' new scoring system uses redirects or can use. But that also
doesn't use linkdb, it builds a webgraph from segments (I think :).

> Julien
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>



-- 
Doğacan Güney

Re: AW: fetching https documents

Posted by Alex Basa <al...@yahoo.com>.

Thanks Vimal.

I switched plugin to protocol-httpclient
set http.useHttp11 to true
and updated

commons-httpclient-3.0.1.jar

to

commons-httpclient-3.1.jar

and it seems fine now.

--- On Wed, 1/21/09, Vimal Varghese <vi...@tcs.com> wrote:

> From: Vimal Varghese <vi...@tcs.com>
> Subject: Re: AW: fetching https documents
> To: nutch-user@lucene.apache.org
> Cc: "alex_basa@yahoo.com" <al...@yahoo.com>, "nutch-user@lucene.apache.org" <nu...@lucene.apache.org>
> Date: Wednesday, January 21, 2009, 10:45 PM
> Hi Alex,
> 
> If its not fetching https . you can try adding this https
> line to your 
> crawl-urlfilter.txt file 
> 
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*(DOMAIN1|DOMAIN2)/
> +^https://([a-z0-9]*\.)*(DOMAIN1|DOMAIN2)/
> 
> after adding this line it will fetch all the https urls. 
> 
> But i am still getting this exceptions for the https urls
> 
> javax.net.ssl.SSLException: Unrecognized SSL message,
> plaintext 
> connection?
> 
>  org.apache.nutch.protocol.http.api.HttpException: 
> java.net.UnknownHostException: secure.americanexpress.com
> 
> Vimal Varghese
> 
> 
> 
> 
> Koch Martina <Ko...@huberverlag.de> 
> 21-01-09 04:05 PM
> Please respond to
> nutch-user@lucene.apache.org
> 
> 
> To
> "nutch-user@lucene.apache.org"
> <nu...@lucene.apache.org>, 
> "alex_basa@yahoo.com" <al...@yahoo.com>
> cc
> 
> Subject
> AW: fetching https documents
> 
> 
> 
> 
> 
> 
> Hi Alex,
> 
> https pages can be fetched with the protocol-httpclient
> plugin.
> 
> Kind regards,
> Martina
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Alex Basa [mailto:alex_basa@yahoo.com] 
> Gesendet: Mittwoch, 21. Januar 2009 00:41
> An: nutch-user@lucene.apache.org
> Betreff: fetching https documents
> 
> I searched for patches and couldn't find one.  Does
> anyone know if nutch 
> 0.9 supports crawling https websites?  If so, can someone
> point me to the 
> patch?
> 
> Thanks in advance,
> 
> Alex
> 
> 
>  
> 
> ForwardSourceID:NT0001429A 
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you

Re: AW: fetching https documents

Posted by Vimal Varghese <vi...@tcs.com>.

Hi Alex,

If its not fetching https . you can try adding this https line to your 
crawl-urlfilter.txt file 

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*(DOMAIN1|DOMAIN2)/
+^https://([a-z0-9]*\.)*(DOMAIN1|DOMAIN2)/

after adding this line it will fetch all the https urls. 

But i am still getting this exceptions for the https urls

javax.net.ssl.SSLException: Unrecognized SSL message, plaintext 
connection?

 org.apache.nutch.protocol.http.api.HttpException: 
java.net.UnknownHostException: secure.americanexpress.com

Vimal Varghese




Koch Martina <Ko...@huberverlag.de> 
21-01-09 04:05 PM
Please respond to
nutch-user@lucene.apache.org


To
"nutch-user@lucene.apache.org" <nu...@lucene.apache.org>, 
"alex_basa@yahoo.com" <al...@yahoo.com>
cc

Subject
AW: fetching https documents






Hi Alex,

https pages can be fetched with the protocol-httpclient plugin.

Kind regards,
Martina


-----Ursprüngliche Nachricht-----
Von: Alex Basa [mailto:alex_basa@yahoo.com] 
Gesendet: Mittwoch, 21. Januar 2009 00:41
An: nutch-user@lucene.apache.org
Betreff: fetching https documents

I searched for patches and couldn't find one.  Does anyone know if nutch 
0.9 supports crawling https websites?  If so, can someone point me to the 
patch?

Thanks in advance,

Alex


 

ForwardSourceID:NT0001429A 
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you

AW: fetching https documents

Posted by Koch Martina <Ko...@huberverlag.de>.

Hi Alex,

https pages can be fetched with the protocol-httpclient plugin.

Kind regards,
Martina


-----Ursprüngliche Nachricht-----
Von: Alex Basa [mailto:alex_basa@yahoo.com] 
Gesendet: Mittwoch, 21. Januar 2009 00:41
An: nutch-user@lucene.apache.org
Betreff: fetching https documents

I searched for patches and couldn't find one.  Does anyone know if nutch 0.9 supports crawling https websites?  If so, can someone point me to the patch?

Thanks in advance,

Alex

fetching https documents

Posted by Alex Basa <al...@yahoo.com>.

I searched for patches and couldn't find one.  Does anyone know if nutch 0.9 supports crawling https websites?  If so, can someone point me to the patch?

Thanks in advance,

Alex

Re: Redirections and linkDB

Posted by Julien Nioche <li...@gmail.com>.

Hello Doğacan,


> On Tue, Jan 20, 2009 at 5:19 PM, Julien Nioche
> <li...@gmail.com> wrote:
> > Hi guys,
> >
> > I had a look at the class URLUtil and found about the rules for choosing
> the
> > representation of a redirection. I tried on a very simple example and can
> > see that the target URL gets a *_repr_* as expected. However this does
> not
> > seem to be used when generating inverted links, i.e. the linkDB contains
> the
> > link between the source page and the redirected page but not to the
> target
> > page it redirects to. I can see that the *_repr_ *attribute is used
> during
> > indexing but I was expecting the linkDB to use it too.
> >
> > Am I missing something here?
> >
>
> No, not really. We added URLUtil to fix the stupid problem of
> www.cnn.com not showing
> up when searching for "cnn" (instead of www.cnn.com/some_redirect).
> What use do you
> see in using redirects for linkdb?
>
>
Thank you for your reply. It was more out of curiosity than anything else,
although I suppose that this would allow us to get anchors for the target as
well and hence potentially influence the scoring.

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Redirections and linkDB

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

On Tue, Jan 20, 2009 at 5:19 PM, Julien Nioche
<li...@gmail.com> wrote:
> Hi guys,
>
> I had a look at the class URLUtil and found about the rules for choosing the
> representation of a redirection. I tried on a very simple example and can
> see that the target URL gets a *_repr_* as expected. However this does not
> seem to be used when generating inverted links, i.e. the linkDB contains the
> link between the source page and the redirected page but not to the target
> page it redirects to. I can see that the *_repr_ *attribute is used during
> indexing but I was expecting the linkDB to use it too.
>
> Am I missing something here?
>

No, not really. We added URLUtil to fix the stupid problem of
www.cnn.com not showing
up when searching for "cnn" (instead of www.cnn.com/some_redirect).
What use do you
see in using redirects for linkdb?

> Thanks
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>



-- 
Doğacan Güney