You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jorge Luis Betancourt González <jl...@uci.cu> on 2015/07/09 14:41:18 UTC
Re: [MASSMAIL]Nutch not fetching HTML content for .com URL

The page is being fetched? have you configured something in any of the filters provided by Nutch? This is the output of a build of the Nutch trunk *without* any configuration change:

➜  local  bin/nutch parsechecker http://www.tripadvisor.com/Hotels-g187147-Paris_Ile_de_France-Hotels.html
fetching: http://www.tripadvisor.com/Hotels-g187147-Paris_Ile_de_France-Hotels.html
http://www.tripadvisor.com/Hotels-g187147-Paris_Ile_de_France-Hotels.html skipped. Content of size 95272 was truncated to 65536
Content is truncated, parse may fail!
parsing: http://www.tripadvisor.com/Hotels-g187147-Paris_Ile_de_France-Hotels.html
contentType: text/html
signature: 8c9c8eb6cef2414a9c6243dfeb24cac0
---------
Url
---------------

http://www.tripadvisor.com/Hotels-g187147-Paris_Ile_de_France-Hotels.html
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: 30 Best Paris Hotels on TripAdvisor - Prices & Reviews for the Top Rated Accommodation in Paris, France
Outlinks: 179
  outlink: toUrl: http://static.tacdn.com/favicon.ico anchor:
  outlink: toUrl: http://static.tacdn.com/img2/icons/ta_square.svg anchor:
  outlink: toUrl: http://www.tripadvisor.com/Hotels-g187147-Paris_Ile_de_France-Hotels.html anchor:
  outlink: toUrl: http://www.tripadvisor.co.uk/Hotels-g187147-Paris_Ile_de_France-Hotels.html anchor:
  ...

It seems to work just fine with the .com domain, Excepto for the warning of the truncated content (which by the way I get also from the .in domain).

Hope it helps,

----- Original Message -----
From: "Shilpa Reddy G" <sh...@gmail.com>
To: user@nutch.apache.org
Sent: Thursday, July 9, 2015 7:46:10 AM
Subject: [MASSMAIL]Nutch not fetching HTML content for .com URL

Hi..

Im new to nutch . I have an application which takes set of URL's as input
and gives HTML source as output , But i'm not able to get complete content
of HTML page if my URL contains .com . I'm just getting metadata for that
URL..

http://www.tripadvisor.com/Hotels-g187147-Paris_Ile_de_France-Hotels.html 

If i giv
http://www.tripadvisor.in/Hotels-g187147-Paris_Ile_de_France-Hotels.html as
my input im getting complete content

Can anyone tell me what is the issue and how i can solve it.. is it related
to domain or locale ???

Thanks in advance 



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-not-fetching-HTML-content-for-com-URL-tp4216523.html
Sent from the Nutch - User mailing list archive at Nabble.com.