You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by abhayd <aj...@hotmail.com> on 2011/08/22 07:31:02 UTC

readdblink not showing alllinks

hi
my seed has 4 links but it results in total 7200 crlwled links. My
readdblink just shows 4 links which are in seed list and Inlinks: for 4 seed
urls.

Is there anyway to get the entire link graph?


thanks

--
View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3274127.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: readdblink not showing alllinks

Posted by Markus Jelsma <ma...@openindex.io>.
> hi
> after doing invert link i see the complete link graph...THANKS
> 
> I m bit confused, please help me understand..
> 
> I do crawl using crawl command. I see around 7000+ urls when i dump
> crawldb. Then i do invertlink and i see the complete link graph.
> After this i do solrindex.
> 
> After solr indexing is completed i see only 2421 docs. I was expecting
> 7000+ docs (i.e exact number of unique urls which i got from dumping
> crawldb as text)

Did you consider URL's that responsed with a non-200 HTTP response code? They 
are not sent to the index.

> 
> Why i just see 2421 urls/docs in solr?
> Do i need to execute crawl again after invertlink?

No :)

> 
> Here are some settings
> --------------------------------------------------------------
>   <name>db.update.max.inlinks</name>
>   <value>10000</value>
> 
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
> 
>   <name>db.max.inlinks</name>
>   <value>10000</value>
> 
>   <name>db.max.outlinks.per.page</name>
>   <value>-1</value>
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp32741
> 27p3278779.html Sent from the Nutch - User mailing list archive at
> Nabble.com.

Re: readdblink not showing alllinks

Posted by abhayd <aj...@hotmail.com>.
hi 
I have started the crawl again, I will post crawl db out put as soon as
crawl finishes..


--
View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3279147.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: readdblink not showing alllinks

Posted by lewis john mcgibbney <le...@gmail.com>.
If you please post your crawldb dump then we could see the structure of your
crawldb and may be able to begin pin pointing the issue.

It should not be required for you to undertake another crawl after inverting
links for these URLs to be indexed when calling solrindex command... there
must be more to it.

On Tue, Aug 23, 2011 at 6:54 PM, abhayd <aj...@hotmail.com> wrote:

> hi
> after doing invert link i see the complete link graph...THANKS
>
> I m bit confused, please help me understand..
>
> I do crawl using crawl command. I see around 7000+ urls when i dump
> crawldb.
> Then i do invertlink and i see the complete link graph.
> After this i do solrindex.
>
> After solr indexing is completed i see only 2421 docs. I was expecting
> 7000+
> docs (i.e exact number of unique urls which i got from dumping crawldb as
> text)
>
> Why i just see 2421 urls/docs in solr?
> Do i need to execute crawl again after invertlink?
>
> Here are some settings
> --------------------------------------------------------------
>  <name>db.update.max.inlinks</name>
>  <value>10000</value>
>
>  <name>db.ignore.internal.links</name>
>  <value>false</value>
>
>  <name>db.max.inlinks</name>
>  <value>10000</value>
>
>  <name>db.max.outlinks.per.page</name>
>  <value>-1</value>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3278779.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: readdblink not showing alllinks

Posted by abhayd <aj...@hotmail.com>.
hi 
after doing invert link i see the complete link graph...THANKS

I m bit confused, please help me understand..

I do crawl using crawl command. I see around 7000+ urls when i dump crawldb. 
Then i do invertlink and i see the complete link graph.
After this i do solrindex. 

After solr indexing is completed i see only 2421 docs. I was expecting 7000+
docs (i.e exact number of unique urls which i got from dumping crawldb as 
text)

Why i just see 2421 urls/docs in solr? 
Do i need to execute crawl again after invertlink?

Here are some settings 
--------------------------------------------------------------
  <name>db.update.max.inlinks</name>
  <value>10000</value>

  <name>db.ignore.internal.links</name>
  <value>false</value>

  <name>db.max.inlinks</name>
  <value>10000</value>

  <name>db.max.outlinks.per.page</name>
  <value>-1</value>


--
View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3278779.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: readdblink not showing alllinks

Posted by abhayd <aj...@hotmail.com>.
hi 

Thanks for helping me with this.

After crawling i checked the crawldb. links with status(1) code were 2400 links which got into solr index fine.


thanks



Date: Tue, 23 Aug 2011 01:15:36 -0700
From: ml-node+3277359-1769568803-210077@n3.nabble.com
To: ajdabholkar@hotmail.com
Subject: Re: readdblink not showing alllinks



	There are a number of parameters which limit the number of outlinks for a

page (IIRC 100 by default) but also the number of inlinks to consider when

inverting. Have a look at nutch-default.xml and try modifying the values in

nutch-site.xml


On 22 August 2011 20:31, abhayd <[hidden email]> wrote:


> hi

> thx for response.

>

> i just ran

> bin/crawl urls -dir crawl -depth 20 -threads 10

>

> And used readdblink.

>

> My understanding from Nutch 1.3 tutorial is if i use bin/crawl ( and not

> step by step approach) i dont have  to do any other steps for indexing or

> reading crawl db.

>

> I am doing this

> -----------------------------------------------------

> 1. bin/crawl urls -dir crawl -depth 20 -threads 10

> 2.bin/nutch solrindex http://localhost:8080/solr/core3 crawl/crawldb

> crawl/linkdb crawl/segments/*

>

> Is that the correct approach?

>

>

>

> --

> View this message in context:

> http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3276112.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

>



-- 

*

*Open Source Solutions for Text Engineering


http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

	
	

	

	
	
		If you reply to this email, your message will be added to the discussion below:
		http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3277359.html
	
	
		
		To unsubscribe from readdblink not showing alllinks, click here.
	 		 	   		  

--
View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3282183.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: readdblink not showing alllinks

Posted by Julien Nioche <li...@gmail.com>.
There are a number of parameters which limit the number of outlinks for a
page (IIRC 100 by default) but also the number of inlinks to consider when
inverting. Have a look at nutch-default.xml and try modifying the values in
nutch-site.xml

On 22 August 2011 20:31, abhayd <aj...@hotmail.com> wrote:

> hi
> thx for response.
>
> i just ran
> bin/crawl urls -dir crawl -depth 20 -threads 10
>
> And used readdblink.
>
> My understanding from Nutch 1.3 tutorial is if i use bin/crawl ( and not
> step by step approach) i dont have  to do any other steps for indexing or
> reading crawl db.
>
> I am doing this
> -----------------------------------------------------
> 1. bin/crawl urls -dir crawl -depth 20 -threads 10
> 2.bin/nutch solrindex http://localhost:8080/solr/core3 crawl/crawldb
> crawl/linkdb crawl/segments/*
>
> Is that the correct approach?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3276112.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: readdblink not showing alllinks

Posted by abhayd <aj...@hotmail.com>.
hi 
thx for response.

i just ran 
bin/crawl urls -dir crawl -depth 20 -threads 10

And used readdblink.

My understanding from Nutch 1.3 tutorial is if i use bin/crawl ( and not
step by step approach) i dont have  to do any other steps for indexing or
reading crawl db.

I am doing this
-----------------------------------------------------
1. bin/crawl urls -dir crawl -depth 20 -threads 10
2.bin/nutch solrindex http://localhost:8080/solr/core3 crawl/crawldb
crawl/linkdb crawl/segments/*

Is that the correct approach?



--
View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3276112.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: readdblink not showing alllinks

Posted by Markus Jelsma <ma...@openindex.io>.
Did you invertlinks?

On Monday 22 August 2011 07:31:02 abhayd wrote:
> hi
> my seed has 4 links but it results in total 7200 crlwled links. My
> readdblink just shows 4 links which are in seed list and Inlinks: for 4
> seed urls.
> 
> Is there anyway to get the entire link graph?
> 
> 
> thanks
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp32741
> 27p3274127.html Sent from the Nutch - User mailing list archive at
> Nabble.com.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350