You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by abhayd <aj...@hotmail.com> on 2011/08/22 07:31:02 UTC
readdblink not showing alllinks
hi
my seed has 4 links but it results in total 7200 crlwled links. My
readdblink just shows 4 links which are in seed list and Inlinks: for 4 seed
urls.
Is there anyway to get the entire link graph?
thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3274127.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: readdblink not showing alllinks
Posted by Markus Jelsma <ma...@openindex.io>.
> hi
> after doing invert link i see the complete link graph...THANKS
>
> I m bit confused, please help me understand..
>
> I do crawl using crawl command. I see around 7000+ urls when i dump
> crawldb. Then i do invertlink and i see the complete link graph.
> After this i do solrindex.
>
> After solr indexing is completed i see only 2421 docs. I was expecting
> 7000+ docs (i.e exact number of unique urls which i got from dumping
> crawldb as text)
Did you consider URL's that responsed with a non-200 HTTP response code? They
are not sent to the index.
>
> Why i just see 2421 urls/docs in solr?
> Do i need to execute crawl again after invertlink?
No :)
>
> Here are some settings
> --------------------------------------------------------------
> <name>db.update.max.inlinks</name>
> <value>10000</value>
>
> <name>db.ignore.internal.links</name>
> <value>false</value>
>
> <name>db.max.inlinks</name>
> <value>10000</value>
>
> <name>db.max.outlinks.per.page</name>
> <value>-1</value>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp32741
> 27p3278779.html Sent from the Nutch - User mailing list archive at
> Nabble.com.
Re: readdblink not showing alllinks
Posted by abhayd <aj...@hotmail.com>.
hi
I have started the crawl again, I will post crawl db out put as soon as
crawl finishes..
--
View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3279147.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: readdblink not showing alllinks
Posted by lewis john mcgibbney <le...@gmail.com>.
If you please post your crawldb dump then we could see the structure of your
crawldb and may be able to begin pin pointing the issue.
It should not be required for you to undertake another crawl after inverting
links for these URLs to be indexed when calling solrindex command... there
must be more to it.
On Tue, Aug 23, 2011 at 6:54 PM, abhayd <aj...@hotmail.com> wrote:
> hi
> after doing invert link i see the complete link graph...THANKS
>
> I m bit confused, please help me understand..
>
> I do crawl using crawl command. I see around 7000+ urls when i dump
> crawldb.
> Then i do invertlink and i see the complete link graph.
> After this i do solrindex.
>
> After solr indexing is completed i see only 2421 docs. I was expecting
> 7000+
> docs (i.e exact number of unique urls which i got from dumping crawldb as
> text)
>
> Why i just see 2421 urls/docs in solr?
> Do i need to execute crawl again after invertlink?
>
> Here are some settings
> --------------------------------------------------------------
> <name>db.update.max.inlinks</name>
> <value>10000</value>
>
> <name>db.ignore.internal.links</name>
> <value>false</value>
>
> <name>db.max.inlinks</name>
> <value>10000</value>
>
> <name>db.max.outlinks.per.page</name>
> <value>-1</value>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3278779.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
--
*Lewis*
Re: readdblink not showing alllinks
Posted by abhayd <aj...@hotmail.com>.
hi
after doing invert link i see the complete link graph...THANKS
I m bit confused, please help me understand..
I do crawl using crawl command. I see around 7000+ urls when i dump crawldb.
Then i do invertlink and i see the complete link graph.
After this i do solrindex.
After solr indexing is completed i see only 2421 docs. I was expecting 7000+
docs (i.e exact number of unique urls which i got from dumping crawldb as
text)
Why i just see 2421 urls/docs in solr?
Do i need to execute crawl again after invertlink?
Here are some settings
--------------------------------------------------------------
<name>db.update.max.inlinks</name>
<value>10000</value>
<name>db.ignore.internal.links</name>
<value>false</value>
<name>db.max.inlinks</name>
<value>10000</value>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
--
View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3278779.html
Sent from the Nutch - User mailing list archive at Nabble.com.
RE: readdblink not showing alllinks
Posted by abhayd <aj...@hotmail.com>.
hi
Thanks for helping me with this.
After crawling i checked the crawldb. links with status(1) code were 2400 links which got into solr index fine.
thanks
Date: Tue, 23 Aug 2011 01:15:36 -0700
From: ml-node+3277359-1769568803-210077@n3.nabble.com
To: ajdabholkar@hotmail.com
Subject: Re: readdblink not showing alllinks
There are a number of parameters which limit the number of outlinks for a
page (IIRC 100 by default) but also the number of inlinks to consider when
inverting. Have a look at nutch-default.xml and try modifying the values in
nutch-site.xml
On 22 August 2011 20:31, abhayd <[hidden email]> wrote:
> hi
> thx for response.
>
> i just ran
> bin/crawl urls -dir crawl -depth 20 -threads 10
>
> And used readdblink.
>
> My understanding from Nutch 1.3 tutorial is if i use bin/crawl ( and not
> step by step approach) i dont have to do any other steps for indexing or
> reading crawl db.
>
> I am doing this
> -----------------------------------------------------
> 1. bin/crawl urls -dir crawl -depth 20 -threads 10
> 2.bin/nutch solrindex http://localhost:8080/solr/core3 crawl/crawldb
> crawl/linkdb crawl/segments/*
>
> Is that the correct approach?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3276112.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3277359.html
To unsubscribe from readdblink not showing alllinks, click here.
--
View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3282183.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: readdblink not showing alllinks
Posted by Julien Nioche <li...@gmail.com>.
There are a number of parameters which limit the number of outlinks for a
page (IIRC 100 by default) but also the number of inlinks to consider when
inverting. Have a look at nutch-default.xml and try modifying the values in
nutch-site.xml
On 22 August 2011 20:31, abhayd <aj...@hotmail.com> wrote:
> hi
> thx for response.
>
> i just ran
> bin/crawl urls -dir crawl -depth 20 -threads 10
>
> And used readdblink.
>
> My understanding from Nutch 1.3 tutorial is if i use bin/crawl ( and not
> step by step approach) i dont have to do any other steps for indexing or
> reading crawl db.
>
> I am doing this
> -----------------------------------------------------
> 1. bin/crawl urls -dir crawl -depth 20 -threads 10
> 2.bin/nutch solrindex http://localhost:8080/solr/core3 crawl/crawldb
> crawl/linkdb crawl/segments/*
>
> Is that the correct approach?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3276112.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: readdblink not showing alllinks
Posted by abhayd <aj...@hotmail.com>.
hi
thx for response.
i just ran
bin/crawl urls -dir crawl -depth 20 -threads 10
And used readdblink.
My understanding from Nutch 1.3 tutorial is if i use bin/crawl ( and not
step by step approach) i dont have to do any other steps for indexing or
reading crawl db.
I am doing this
-----------------------------------------------------
1. bin/crawl urls -dir crawl -depth 20 -threads 10
2.bin/nutch solrindex http://localhost:8080/solr/core3 crawl/crawldb
crawl/linkdb crawl/segments/*
Is that the correct approach?
--
View this message in context: http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp3274127p3276112.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: readdblink not showing alllinks
Posted by Markus Jelsma <ma...@openindex.io>.
Did you invertlinks?
On Monday 22 August 2011 07:31:02 abhayd wrote:
> hi
> my seed has 4 links but it results in total 7200 crlwled links. My
> readdblink just shows 4 links which are in seed list and Inlinks: for 4
> seed urls.
>
> Is there anyway to get the entire link graph?
>
>
> thanks
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/readdblink-not-showing-alllinks-tp32741
> 27p3274127.html Sent from the Nutch - User mailing list archive at
> Nabble.com.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350