You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kingping <ik...@gmail.com> on 2012/03/13 13:47:44 UTC

unable to crwal a specefic site- Lithium Based

All, I have been working with Nutch 1.1 for quite some time now and everthing
is working fine, until I came across a site that I am having a ton of
trouble crawling (only one segment folder is created before everthing comes
to a halt). I have checked all the logs and nothing I can see is giving me
an idea on what might be the problem. The site in question is based on a
product called "Lithium Forums" and the link is
http://h30499.www3.hp.com/t5/Products/ct-p/sws-ProductFamilies. As I said, I
am able to crawl and index pretty much any other site except for this one.
Any suggestions or guidance are grealty appreciated.

Thank you

--
View this message in context: http://lucene.472066.n3.nabble.com/unable-to-crwal-a-specefic-site-Lithium-Based-tp3822114p3822114.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: unable to crwal a specefic site- Lithium Based

Posted by Jean-François Gingras <je...@gmail.com>.
Maybe because of this in the HTML header :  <meta name="robots" content="
NOFOLLOW" />

On Tue, Mar 13, 2012 at 1:02 PM, kingping <ik...@gmail.com> wrote:

> I don't know how to explain it, but if I crwal the same site by stripping
> all
> relevant paths (http://h30499.www3.hp.com) everything seems to work fine
> and
> I am able to crwal and index the entire site. The minute I include a
> complete releative path, it does not work. I can't explain this behavour
> but
> I guess I have to live with it fow now
>
> Works
> http://h30499.www3.hp.com/
>
> Does not work
> http://h30499.www3.hp.com/t5/Products/ct-p/sws-ProductFamilies/
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/unable-to-crwal-a-specefic-site-Lithium-Based-tp3822114p3822843.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Jean-François Gingras

Re: unable to crwal a specefic site- Lithium Based

Posted by kingping <ik...@gmail.com>.
I don't know how to explain it, but if I crwal the same site by stripping all
relevant paths (http://h30499.www3.hp.com) everything seems to work fine and
I am able to crwal and index the entire site. The minute I include a
complete releative path, it does not work. I can't explain this behavour but
I guess I have to live with it fow now

Works
http://h30499.www3.hp.com/

Does not work
http://h30499.www3.hp.com/t5/Products/ct-p/sws-ProductFamilies/

--
View this message in context: http://lucene.472066.n3.nabble.com/unable-to-crwal-a-specefic-site-Lithium-Based-tp3822114p3822843.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: unable to crwal a specefic site- Lithium Based

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Are you able to crawl similar URLs? It's a strange looking one which you've
provided. An ideal case for this is the parser checker. I never even used
1.1 before so I can't comment too much on even what the code is like. I
would really really advise upgrading to 1.4.
 Even 1.5 is used extensively in production.

Lewis

On Tue, Mar 13, 2012 at 12:47 PM, kingping <ik...@gmail.com> wrote:

> All, I have been working with Nutch 1.1 for quite some time now and
> everthing
> is working fine, until I came across a site that I am having a ton of
> trouble crawling (only one segment folder is created before everthing comes
> to a halt). I have checked all the logs and nothing I can see is giving me
> an idea on what might be the problem. The site in question is based on a
> product called "Lithium Forums" and the link is
> http://h30499.www3.hp.com/t5/Products/ct-p/sws-ProductFamilies. As I
> said, I
> am able to crawl and index pretty much any other site except for this one.
> Any suggestions or guidance are grealty appreciated.
>
> Thank you
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/unable-to-crwal-a-specefic-site-Lithium-Based-tp3822114p3822114.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*