You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arjun Kumar Reddy <ch...@iiitb.net> on 2011/01/25 17:16:12 UTC
Regarding crawling of short URL's

Hi,

My application needs to crawl a set of urls which I give to the urls
directory and fetch only the contents of that urls only.
I am not interested in the contents of the internal or external links.
So I have run the crawl command by giving depth as 1.

bin/nutch crawl urls -dir crawl -depth 1

Nutch crawls the urls and gives me the contents of the given urls.

I am reading the content using readseg utility.

bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch
-nogenerate -noparse -noparsedata

With this I am fetching the content of webpage.

The problem I am facing is if I give direct urls like

http://isoc.org/wp/worldipv6day/
http://openhackindia.eventbrite.com
http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php
http://bangalore.yahoo.com/labs/summerschool.html
http://riadevcamp.eventbrite.com
http://www.sleepingtime.org/

then I am able to get the contents of the webpage.
But when I give the set of urls as short urls like

http://is.gd/jOoAa9
http://is.gd/ubHRAF
http://is.gd/GiFqj9
http://is.gd/H5rUhg
http://is.gd/wvKINL
http://is.gd/K6jTNl
http://is.gd/mpa6fr
http://is.gd/fmobvj
http://is.gd/s7uZfr

I am not able to fetch the contents.

When I read the segments, it is not showing any content. Please find below
the content of dump file read from segments.

Recno:: 0
URL:: http://is.gd/0yKjO6

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1295969171407

Content::
Version: -1
url: http://is.gd/0yKjO6
base: http://is.gd/0yKjO6
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=
http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36
nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8
Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:


Recno:: 1
URL:: http://is.gd/1tpKaN

Content::
Version: -1
url: http://is.gd/1tpKaN
base: http://is.gd/1tpKaN
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=
http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1_fst_=36
nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8
Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0


I have also tried by setting the max.redirects property in nutch-default.xml
as 4 but dint find any progress.
Kindly provide me a solution for this problem.

Thanks and regards,*
*Ch. Arjun Kumar Reddy