You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by blunderboy <sa...@gmail.com> on 2012/03/26 10:32:59 UTC

Nutch not crawling jabong

Hi,
I am using apache-nutch 1.4 and it is crawling perfectly. But i have got
some issues in crawling some sites.
For testing my crawling, I took  http://www.jabong.com http://www.jabong.com 
I found out it is able to crawl categories but could not crawl pages.

For example look at this:-
http://www.jabong.com/men/shoes/mens-sports-shoes/               ----->
(Page1)

Now nutch does not crawl the pages present inside this page..
URL of one of the product is:-
http://www.jabong.com/Sports-White-Tennis-Shoes-2773.html    ------->
(Prod1)


After some research, I got to know the structure of this site is:
1. Home dir contains all the product pages.
If you see the source of page(Page1), it contains link to Prod1 which is
actually in the home directory.
So may be this is the reason it is not crawling product pages.

Can some body please tell me how to solve this and make nutch to crawl such
pages too.

--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-not-crawling-jabong-tp3857630p3857630.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch not crawling jabong

Posted by blunderboy <sa...@gmail.com>.
Hi,
After some research I am able to find out why the problem of crawling with
jabong page was there.

Actually, when we use nutch we have to configure it first. Initially, there
are some default configurations set in nutch-deault.xml present in conf
directory. You have to set the file content limit to -1. Initially there was
some length parameter specified So it was not actually parsing the whole
page. Only that much length was parsed. That's why we miss some of the links
to next pages.



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-not-crawling-jabong-tp3857630p4010062.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch not crawling jabong

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

there are plenty of reasons why a document is missing.
See http://wiki.apache.org/nutch/DebugTool for a list
of possible reasons (sorry, explanations are missing).

About the example from jabong. I got 680 outlinks for
  http://www.jabong.com/men/shoes/mens-sports-shoes/
by calling
 % nutch parsechecker http://www.jabong.com/men/shoes/men-sports-shoes/
but
  http://www.jabong.com/Sports-White-Tennis-Shoes-2773.html
isn't among them. Many other products are. For example,
 % nutch parsechecker -dumpText http://www.jabong.com/Grey-Running-Shoes-13010.html
succeeds and I got the content. So maybe,
the product has just been sold out? Even, in Firefox I can't
see this pair of shoes. Also, there are many reasons why the
content delivered to the crawler is different from that seen
in the browser: cookies, dynamic Ajax content, browser switches, ...

Sebastian

On 03/26/2012 10:32 AM, blunderboy wrote:
> Hi,
> I am using apache-nutch 1.4 and it is crawling perfectly. But i have got
> some issues in crawling some sites.
> For testing my crawling, I took  http://www.jabong.com http://www.jabong.com 
> I found out it is able to crawl categories but could not crawl pages.
> 
> For example look at this:-
> http://www.jabong.com/men/shoes/mens-sports-shoes/               ----->
> (Page1)
> 
> Now nutch does not crawl the pages present inside this page..
> URL of one of the product is:-
> http://www.jabong.com/Sports-White-Tennis-Shoes-2773.html    ------->
> (Prod1)
> 
> 
> After some research, I got to know the structure of this site is:
> 1. Home dir contains all the product pages.
> If you see the source of page(Page1), it contains link to Prod1 which is
> actually in the home directory.
> So may be this is the reason it is not crawling product pages.
> 
> Can some body please tell me how to solve this and make nutch to crawl such
> pages too.
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-not-crawling-jabong-tp3857630p3857630.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


Re: Nutch not crawling Matwali

Posted by scodebraker <sc...@gmail.com>.
I check google crawler .
i can't found my website on crawling plz help me to improve crawling in my
website .
www.matwali.com <http://www.matwali.com>  

please help me ...



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-not-crawling-jabong-tp3857630p4056629.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch not crawling jabong

Posted by Mansur <li...@gmail.com>.
Same thing is happening with me for below site:

www.linenclub.com <http://www.linenclub.com>  
www.linenore.com <http://www.linenore.com>  
www.zovi.com <http://www.zovi.com>  



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-not-crawling-jabong-tp3857630p4009650.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch not crawling jabong

Posted by blunderboy <sa...@gmail.com>.
Can somebody please help
Why do some sites are not being crawled..
eg.
Nutch failed to crawl
http://www.myntra.com
http://www.jabong.com
http://www.youtube.com

Successfully crawling some other sites.

--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-not-crawling-jabong-tp3857630p3857877.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch not crawling jabong

Posted by blunderboy <sa...@gmail.com>.
Observe the URL of product page
It is present in directory where index.html of jabong.com is present.

I hope i am clear :)

--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-not-crawling-jabong-tp3857630p3857632.html
Sent from the Nutch - User mailing list archive at Nabble.com.