You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ".: Abhishek :." <ab...@gmail.com> on 2011/01/25 03:04:47 UTC

Few questions from a newbie

Hi all,

 I am very new to Nutch and Lucene as well. I am having few questions about
Nutch, I know they are very much basic but I could not get clear cut answers
out of googling for this. The questions are,

   - If I have to crawl just 5-6 web sites or URL's should I use intranet
   crawl or whole web crawl.
   - How do I set recrawl's for these same web sites after the first crawl.
   - If I have to start search the results via my own java code which jar
   files or api's or samples should I be looking into.
   - Is there a book on Nutch?

Thanks a bunch for your patience. I appreciate your time.

./Abishek

RE: Few questions from a newbie

Posted by Chris Woolum <cw...@moonvalley.com>.
To use solr:
 
bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
 
assuming the crawl dir is crawl

________________________________

From: alxsss@aim.com [mailto:alxsss@aim.com]
Sent: Mon 1/24/2011 9:23 PM
To: user@nutch.apache.org
Subject: Re: Few questions from a newbie



How to use solr to index nutch segments?
What is the meaning of db.fetcher.interval? Does this mean that if I run the same crawl command before 30 days it will do nothing?

Thanks.
Alex.










-----Original Message-----
From: Charan K <ch...@gmail.com>
To: user <us...@nutch.apache.org>
Cc: user <us...@nutch.apache.org>
Sent: Mon, Jan 24, 2011 8:24 pm
Subject: Re: Few questions from a newbie


Refer NutchBean.java for the their question. You can run than from command line

to test the index.



 If you use SOLR indexing, it is going to be much simpler, they have a solr java

client..



Sent from my iPhone



On Jan 24, 2011, at 8:07 PM, Amna Waqar <am...@gmail.com> wrote:



> 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl

> gives u more control and speed

> 2.After the first crawl,the recrawling the same sites time is 30 days by

> default in db.fetcher.interval,you can change it according to ur own

> convenience.

> 3.I ve no idea about the third question

> cz  i m also a newbie

> Best of luck with nutch learning

>

>

> On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab...@gmail.com> wrote:

>

>> Hi all,

>>

>> I am very new to Nutch and Lucene as well. I am having few questions about

>> Nutch, I know they are very much basic but I could not get clear cut

>> answers

>> out of googling for this. The questions are,

>>

>>  - If I have to crawl just 5-6 web sites or URL's should I use intranet

>>  crawl or whole web crawl.

>>  - How do I set recrawl's for these same web sites after the first crawl.

>>  - If I have to start search the results via my own java code which jar

>>  files or api's or samples should I be looking into.

>>  - Is there a book on Nutch?

>>

>> Thanks a bunch for your patience. I appreciate your time.

>>

>> ./Abishek

>>








Re: Few questions from a newbie

Posted by ".: Abhishek :." <ab...@gmail.com>.
Thanks Julien. I will get the book :)

On Wed, Jan 26, 2011 at 5:09 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Tom White's book on Hadoop is a must have for anyone wanting to understand
> how Nutch and Hadoop work. There is a section in it specifically about
> Nutch
> written by Andrzej as well
>
>
> On 26 January 2011 03:02, .: Abhishek :. <ab...@gmail.com> wrote:
>
> > Thanks a bunch Markus.
> >
> > By the way, is there some book or material on Nutch which would help me
> > understanding it better? I  come from an application development
> background
> > and all the crawl n search stuff is *very* new to me :)
> >
> >
> > On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > These values come from the CrawlDB and have the following meaning.
> > >
> > > db_unfetched
> > > This is the number of URL's that are to be crawled when the next batch
> is
> > > started. This number is usually limited with the generate.max.per.host
> > > setting. So, if there are 5000 unfetched and generate.max.per.host is
> set
> > > to
> > > 1000, the next batch will fetch only 1000. Watch, the number of
> unfetched
> > > will
> > > usually not be 5000-1000 because new URL's have been discovered and
> added
> > > to
> > > the CrawlDB.
> > >
> > > db_fetched
> > > These URL's have been fetched. Their next fetch will be
> > > db.fetcher.interval.
> > > But, this is not always the case. There the adaprive schedule algorithm
> > can
> > > tune this number depending on several settings. With these you can tune
> > the
> > > interval when a page is modified or not modified.
> > >
> > > db_gone
> > > HTTP 404 Not Found
> > >
> > > db_redir-temp
> > > HTTP 307 Temporary Redirect
> > >
> > > db_redir_perm
> > > HTTP 301 Moved Permanently
> > >
> > > Code:
> > >
> > >
> >
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
> > >
> > > Configuration:
> > > http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> > > default.xml?view=markup<
> >
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup
> > >
> > >
> > > > Thanks Chris, Charan and Alex.
> > > >
> > > > I am looking into the crawl statistics now. And I see fields like
> > > > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm,
> > what
> > > do
> > > > they mean?
> > > >
> > > > And, I also see the db_unfetched is way too high than the db_fetched.
> > > Does
> > > > it mean most of the pages did not crawl at all due to some issues?
> > > >
> > > > Thanks again for your time!
> > > >
> > > > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar <
> charan.kumar@gmail.com
> > > >wrote:
> > > > > db.fetcher.interval : It means that URLS which were fetched in the
> > last
> > > > > 30 days <default> will not be fetched. Or A URL is eligible for
> > refetch
> > > > > only after 30 days of last crawl.
> > > > >
> > > > > On Mon, Jan 24, 2011 at 9:23 PM, <al...@aim.com> wrote:
> > > > > > How to use solr to index nutch segments?
> > > > > > What is the meaning of db.fetcher.interval? Does this mean that
> if
> > I
> > > > > > run the same crawl command before 30 days it will do nothing?
> > > > > >
> > > > > > Thanks.
> > > > > > Alex.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Charan K <ch...@gmail.com>
> > > > > > To: user <us...@nutch.apache.org>
> > > > > > Cc: user <us...@nutch.apache.org>
> > > > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > > > Subject: Re: Few questions from a newbie
> > > > > >
> > > > > >
> > > > > > Refer NutchBean.java for the their question. You can run than
> from
> > > > >
> > > > > command
> > > > >
> > > > > > line
> > > > > >
> > > > > > to test the index.
> > > > > >
> > > > > >  If you use SOLR indexing, it is going to be much simpler, they
> > have
> > > a
> > > > >
> > > > > solr
> > > > >
> > > > > > java
> > > > > >
> > > > > > client..
> > > > > >
> > > > > >
> > > > > >
> > > > > > Sent from my iPhone
> > > > > >
> > > > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar <amna.waqar.ee@gmail.com
> >
> > > wrote:
> > > > > > > 1,to crawl just 5 to 6 websites,u can use both cases but
> intranet
> > > > > > > crawl
> > > > > > >
> > > > > > > gives u more control and speed
> > > > > > >
> > > > > > > 2.After the first crawl,the recrawling the same sites time is
> 30
> > > days
> > > > >
> > > > > by
> > > > >
> > > > > > > default in db.fetcher.interval,you can change it according to
> ur
> > > own
> > > > > > >
> > > > > > > convenience.
> > > > > > >
> > > > > > > 3.I ve no idea about the third question
> > > > > > >
> > > > > > > cz  i m also a newbie
> > > > > > >
> > > > > > > Best of luck with nutch learning
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <
> > ab1sh3k@gmail.com
> > > >
> > > > > >
> > > > > > wrote:
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> I am very new to Nutch and Lucene as well. I am having few
> > > questions
> > > > > >
> > > > > > about
> > > > > >
> > > > > > >> Nutch, I know they are very much basic but I could not get
> clear
> > > cut
> > > > > > >>
> > > > > > >> answers
> > > > > > >>
> > > > > > >> out of googling for this. The questions are,
> > > > > > >>
> > > > > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > > > >
> > > > > intranet
> > > > >
> > > > > > >>  crawl or whole web crawl.
> > > > > > >>
> > > > > > >>  - How do I set recrawl's for these same web sites after the
> > first
> > > > > >
> > > > > > crawl.
> > > > > >
> > > > > > >>  - If I have to start search the results via my own java code
> > > which
> > > > >
> > > > > jar
> > > > >
> > > > > > >>  files or api's or samples should I be looking into.
> > > > > > >>
> > > > > > >>  - Is there a book on Nutch?
> > > > > > >>
> > > > > > >> Thanks a bunch for your patience. I appreciate your time.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> ./Abishek
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Few questions from a newbie

Posted by Julien Nioche <li...@gmail.com>.
Tom White's book on Hadoop is a must have for anyone wanting to understand
how Nutch and Hadoop work. There is a section in it specifically about Nutch
written by Andrzej as well


On 26 January 2011 03:02, .: Abhishek :. <ab...@gmail.com> wrote:

> Thanks a bunch Markus.
>
> By the way, is there some book or material on Nutch which would help me
> understanding it better? I  come from an application development background
> and all the crawl n search stuff is *very* new to me :)
>
>
> On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
> > These values come from the CrawlDB and have the following meaning.
> >
> > db_unfetched
> > This is the number of URL's that are to be crawled when the next batch is
> > started. This number is usually limited with the generate.max.per.host
> > setting. So, if there are 5000 unfetched and generate.max.per.host is set
> > to
> > 1000, the next batch will fetch only 1000. Watch, the number of unfetched
> > will
> > usually not be 5000-1000 because new URL's have been discovered and added
> > to
> > the CrawlDB.
> >
> > db_fetched
> > These URL's have been fetched. Their next fetch will be
> > db.fetcher.interval.
> > But, this is not always the case. There the adaprive schedule algorithm
> can
> > tune this number depending on several settings. With these you can tune
> the
> > interval when a page is modified or not modified.
> >
> > db_gone
> > HTTP 404 Not Found
> >
> > db_redir-temp
> > HTTP 307 Temporary Redirect
> >
> > db_redir_perm
> > HTTP 301 Moved Permanently
> >
> > Code:
> >
> >
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
> >
> > Configuration:
> > http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> > default.xml?view=markup<
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup
> >
> >
> > > Thanks Chris, Charan and Alex.
> > >
> > > I am looking into the crawl statistics now. And I see fields like
> > > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm,
> what
> > do
> > > they mean?
> > >
> > > And, I also see the db_unfetched is way too high than the db_fetched.
> > Does
> > > it mean most of the pages did not crawl at all due to some issues?
> > >
> > > Thanks again for your time!
> > >
> > > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar <charan.kumar@gmail.com
> > >wrote:
> > > > db.fetcher.interval : It means that URLS which were fetched in the
> last
> > > > 30 days <default> will not be fetched. Or A URL is eligible for
> refetch
> > > > only after 30 days of last crawl.
> > > >
> > > > On Mon, Jan 24, 2011 at 9:23 PM, <al...@aim.com> wrote:
> > > > > How to use solr to index nutch segments?
> > > > > What is the meaning of db.fetcher.interval? Does this mean that if
> I
> > > > > run the same crawl command before 30 days it will do nothing?
> > > > >
> > > > > Thanks.
> > > > > Alex.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Charan K <ch...@gmail.com>
> > > > > To: user <us...@nutch.apache.org>
> > > > > Cc: user <us...@nutch.apache.org>
> > > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > > Subject: Re: Few questions from a newbie
> > > > >
> > > > >
> > > > > Refer NutchBean.java for the their question. You can run than from
> > > >
> > > > command
> > > >
> > > > > line
> > > > >
> > > > > to test the index.
> > > > >
> > > > >  If you use SOLR indexing, it is going to be much simpler, they
> have
> > a
> > > >
> > > > solr
> > > >
> > > > > java
> > > > >
> > > > > client..
> > > > >
> > > > >
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar <am...@gmail.com>
> > wrote:
> > > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > > > crawl
> > > > > >
> > > > > > gives u more control and speed
> > > > > >
> > > > > > 2.After the first crawl,the recrawling the same sites time is 30
> > days
> > > >
> > > > by
> > > >
> > > > > > default in db.fetcher.interval,you can change it according to ur
> > own
> > > > > >
> > > > > > convenience.
> > > > > >
> > > > > > 3.I ve no idea about the third question
> > > > > >
> > > > > > cz  i m also a newbie
> > > > > >
> > > > > > Best of luck with nutch learning
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <
> ab1sh3k@gmail.com
> > >
> > > > >
> > > > > wrote:
> > > > > >> Hi all,
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> I am very new to Nutch and Lucene as well. I am having few
> > questions
> > > > >
> > > > > about
> > > > >
> > > > > >> Nutch, I know they are very much basic but I could not get clear
> > cut
> > > > > >>
> > > > > >> answers
> > > > > >>
> > > > > >> out of googling for this. The questions are,
> > > > > >>
> > > > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > > >
> > > > intranet
> > > >
> > > > > >>  crawl or whole web crawl.
> > > > > >>
> > > > > >>  - How do I set recrawl's for these same web sites after the
> first
> > > > >
> > > > > crawl.
> > > > >
> > > > > >>  - If I have to start search the results via my own java code
> > which
> > > >
> > > > jar
> > > >
> > > > > >>  files or api's or samples should I be looking into.
> > > > > >>
> > > > > >>  - Is there a book on Nutch?
> > > > > >>
> > > > > >> Thanks a bunch for your patience. I appreciate your time.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> ./Abishek
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Few questions from a newbie

Posted by Churchill Nanje Mambe <ma...@afrovisiongroup.com>.
even if the url being crawled is shortened, it will still lead nutch to the
actual link and nutch will fetch it

Churchill Nanje Mambe
237 77545907,
AfroVisioN Founder, President,CEO
www.camerborn.com/mambenanje
http://www.afrovisiongroup.com | http://mambenanje.blogspot.com
skypeID: mambenanje
www.twitter.com/mambenanje



On Wed, Jan 26, 2011 at 4:56 PM, Arjun Kumar Reddy <
charjunkumar.reddy@iiitb.net> wrote:

> Yea Hi Mambe,
>
> Thanks for the feedback. I have mentioned the details of my application in
> the above post.
> I have tried doing this crawling job using php-multi curl and I am getting
> results which are good enough but the problem I am facing is that it is
> taking hell lot of time to get the contents of the urls. I have done this
> without using any API or conversions.
>
> So, in order to crawl in lesser time limits and also helps me to scale my
> application, I have chosen Nutch crawler.
>
> Thanks and regards,*
> *Ch. Arjun Kumar Reddy
>
> On Wed, Jan 26, 2011 at 9:19 PM, Churchill Nanje Mambe <
> mambenanje@afrovisiongroup.com> wrote:
>
> > hello
> >  you have to use the short url APIs and get the long URLs... its abit
> > complex as you have to determine the url if its short, then determine the
> > url shortening service used eg: tinyurl.com bit.ly or goo.gl and then
> you
> > use their respective api and send in the url and they will return the
> long
> > url... I used this before but it was a simple php based aggregator and
> not
> > nutch
> >
>

Re: Few questions from a newbie

Posted by al...@aim.com.
you can put fetch external and internal links to false and increase depth.
 

 


 

 

-----Original Message-----
From: Churchill Nanje Mambe <ma...@afrovisiongroup.com>
To: user <us...@nutch.apache.org>
Sent: Wed, Jan 26, 2011 8:03 am
Subject: Re: Few questions from a newbie


even if the url being crawled is shortened, it will still lead nutch to the

actual link and nutch will fetch it




 

Re: Few questions from a newbie

Posted by Churchill Nanje Mambe <ma...@afrovisiongroup.com>.
even if the url being crawled is shortened, it will still lead nutch to the
actual link and nutch will fetch it

Re: Few questions from a newbie

Posted by Arjun Kumar Reddy <ch...@iiitb.net>.
Yea Hi Mambe,

Thanks for the feedback. I have mentioned the details of my application in
the above post.
I have tried doing this crawling job using php-multi curl and I am getting
results which are good enough but the problem I am facing is that it is
taking hell lot of time to get the contents of the urls. I have done this
without using any API or conversions.

So, in order to crawl in lesser time limits and also helps me to scale my
application, I have chosen Nutch crawler.

Thanks and regards,*
*Ch. Arjun Kumar Reddy

On Wed, Jan 26, 2011 at 9:19 PM, Churchill Nanje Mambe <
mambenanje@afrovisiongroup.com> wrote:

> hello
>  you have to use the short url APIs and get the long URLs... its abit
> complex as you have to determine the url if its short, then determine the
> url shortening service used eg: tinyurl.com bit.ly or goo.gl and then you
> use their respective api and send in the url and they will return the long
> url... I used this before but it was a simple php based aggregator and not
> nutch
>

Re: Few questions from a newbie

Posted by Churchill Nanje Mambe <ma...@afrovisiongroup.com>.
hello
 you have to use the short url APIs and get the long URLs... its abit
complex as you have to determine the url if its short, then determine the
url shortening service used eg: tinyurl.com bit.ly or goo.gl and then you
use their respective api and send in the url and they will return the long
url... I used this before but it was a simple php based aggregator and not
nutch

Re: Re: Few questions from a newbie

Posted by Arjun Kumar Reddy <ch...@iiitb.net>.
Hi Mike,

Actually in my application, I am working on twitter feeds where I
am filtering the tweets present with inks and I am storing the contents of
the links. I am maintaining all such links in the urls file giving it as an
input to nutch crawler. Here, I am not bothered about the inlinks or
outlinks of that particular link.

So, at first I have given the depth as 1 and later on increased to 3. If I
increase the depth, I can prevent the unwanted crawls. So is there any other
solution for this?

I have also changed the number of redirects configuration paramater to 4 in
nutch-default.xml file.

Thanks and regards,*
*Ch. Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 8:28 PM, Mike Zuehlke <Mi...@zanox.com>wrote:

> Hi Arjun,
>
> nutch handles redirect by itself - like the return codes 301 and 302.
>
> Did you check how much redirects you have to follow until you get
> HTTP_ACCESS (200).
> I think there are four redirects needed to get the given url content. So
> you have to increase the depth for your crawling.
>
> Regards
> Mike
>
>
>
>
> Von:    Arjun Kumar Reddy <ch...@iiitb.net>
> An:     user@nutch.apache.org
> Datum:  26.01.2011 15:43
> Betreff:        Re: Few questions from a newbie
>
>
>
> I am developing an application based on twitter feeds...so 90% of the
> url's
> will be short urls.
> So, it is difficult for me to manually convert all these urls to actual
> urls. Do we have any other solution for this?
>
>
> Thanks and regards,
> Arjun Kumar Reddy
>
>
> On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
> estrada.adam.groups@gmail.com> wrote:
>
> > You probably have to literally click on each URL to get the URL it's
> > referencing. Those are URL shorteners  and probably won't play nicely
> with a
> > crawler because of the redirection.
> >
> > Adam
> >
> > Sent from my iPhone
> >
> > On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> > charjunkumar.reddy@iiitb.net> wrote:
> >
> > > Hi list,
> > >
> > > I have given the set of urls as
> > >
> > > http://is.gd/Jt32Cf
> > > http://is.gd/hS3lEJ
> > > http://is.gd/Jy1Im3
> > > http://is.gd/QoJ8xy
> > > http://is.gd/e4ct89
> > > http://is.gd/WAOVmd
> > > http://is.gd/lhkA69
> > > http://is.gd/3OilLD
> > > ..... 43 such urls
> > >
> > > And I have run the crawl command bin/nutch crawl urls/ -dir crawl
> -depth
> > 3
> > >
> > > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > > *CrawlDb statistics start: crawl/crawldb*
> > > *Statistics for CrawlDb: crawl/crawldb*
> > > *TOTAL urls: 43*
> > > *retry 0: 43*
> > > *min score: 1.0*
> > > *avg score: 1.0*
> > > *max score: 1.0*
> > > *status 3 (db_gone): 1*
> > > *status 4 (db_redir_temp): 1*
> > > *status 5 (db_redir_perm): 41*
> > > *CrawlDb statistics: done*
> > >
> > > When I am trying to read the content from the segments, the content
> block
> > is
> > > empty for every record.
> > >
> > > Can you please tell me where I can get the content of these urls.
> > >
> > > Thanks and regards,*
> > > *Arjun Kumar Reddy
> >
>
>
>
>
>
>
> <img src="http://www.zanox.com/disclaimer/znx_logo_01.gif" alt="disclaimer
> logo: ZANOX.de AG">
>
> --------------------------------------------------------------------------------
> We will create the ultimate global alliance to monetize the Internet
>
> --------------------------------------------------------------------------------
>
> ZANOX.de AG | Headquarters: Berlin
> AG Charlottenburg | HRB 75459 | Ust-ID: DE 209981705
> Executive Board: Philipp Justus (CEO) | Christian Kleinsorge (CSO) | Daniel
> Keller (CTO)
> Chairman of the Supervisory Board: Ralph Büchi

Antwort: Re: Few questions from a newbie

Posted by Mike Zuehlke <Mi...@zanox.com>.
Hi Arjun,

nutch handles redirect by itself - like the return codes 301 and 302.

Did you check how much redirects you have to follow until you get 
HTTP_ACCESS (200).
I think there are four redirects needed to get the given url content. So 
you have to increase the depth for your crawling.

Regards
Mike




Von:    Arjun Kumar Reddy <ch...@iiitb.net>
An:     user@nutch.apache.org
Datum:  26.01.2011 15:43
Betreff:        Re: Few questions from a newbie



I am developing an application based on twitter feeds...so 90% of the 
url's
will be short urls.
So, it is difficult for me to manually convert all these urls to actual
urls. Do we have any other solution for this?


Thanks and regards,
Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
estrada.adam.groups@gmail.com> wrote:

> You probably have to literally click on each URL to get the URL it's
> referencing. Those are URL shorteners  and probably won't play nicely 
with a
> crawler because of the redirection.
>
> Adam
>
> Sent from my iPhone
>
> On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> charjunkumar.reddy@iiitb.net> wrote:
>
> > Hi list,
> >
> > I have given the set of urls as
> >
> > http://is.gd/Jt32Cf
> > http://is.gd/hS3lEJ
> > http://is.gd/Jy1Im3
> > http://is.gd/QoJ8xy
> > http://is.gd/e4ct89
> > http://is.gd/WAOVmd
> > http://is.gd/lhkA69
> > http://is.gd/3OilLD
> > ..... 43 such urls
> >
> > And I have run the crawl command bin/nutch crawl urls/ -dir crawl 
-depth
> 3
> >
> > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > *CrawlDb statistics start: crawl/crawldb*
> > *Statistics for CrawlDb: crawl/crawldb*
> > *TOTAL urls: 43*
> > *retry 0: 43*
> > *min score: 1.0*
> > *avg score: 1.0*
> > *max score: 1.0*
> > *status 3 (db_gone): 1*
> > *status 4 (db_redir_temp): 1*
> > *status 5 (db_redir_perm): 41*
> > *CrawlDb statistics: done*
> >
> > When I am trying to read the content from the segments, the content 
block
> is
> > empty for every record.
> >
> > Can you please tell me where I can get the content of these urls.
> >
> > Thanks and regards,*
> > *Arjun Kumar Reddy
>






<img src="http://www.zanox.com/disclaimer/znx_logo_01.gif" alt="disclaimer logo: ZANOX.de AG">
--------------------------------------------------------------------------------
We will create the ultimate global alliance to monetize the Internet
--------------------------------------------------------------------------------

ZANOX.de AG | Headquarters: Berlin
AG Charlottenburg | HRB 75459 | Ust-ID: DE 209981705
Executive Board: Philipp Justus (CEO) | Christian Kleinsorge (CSO) | Daniel Keller (CTO)
Chairman of the Supervisory Board: Ralph Büchi

Re: Few questions from a newbie

Posted by Arjun Kumar Reddy <ch...@iiitb.net>.
I am developing an application based on twitter feeds...so 90% of the url's
will be short urls.
So, it is difficult for me to manually convert all these urls to actual
urls. Do we have any other solution for this?


Thanks and regards,
Arjun Kumar Reddy


On Wed, Jan 26, 2011 at 7:09 PM, Estrada Groups <
estrada.adam.groups@gmail.com> wrote:

> You probably have to literally click on each URL to get the URL it's
> referencing. Those are URL shorteners  and probably won't play nicely with a
> crawler because of the redirection.
>
> Adam
>
> Sent from my iPhone
>
> On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <
> charjunkumar.reddy@iiitb.net> wrote:
>
> > Hi list,
> >
> > I have given the set of urls as
> >
> > http://is.gd/Jt32Cf
> > http://is.gd/hS3lEJ
> > http://is.gd/Jy1Im3
> > http://is.gd/QoJ8xy
> > http://is.gd/e4ct89
> > http://is.gd/WAOVmd
> > http://is.gd/lhkA69
> > http://is.gd/3OilLD
> > ..... 43 such urls
> >
> > And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth
> 3
> >
> > *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> > *CrawlDb statistics start: crawl/crawldb*
> > *Statistics for CrawlDb: crawl/crawldb*
> > *TOTAL urls: 43*
> > *retry 0: 43*
> > *min score: 1.0*
> > *avg score: 1.0*
> > *max score: 1.0*
> > *status 3 (db_gone): 1*
> > *status 4 (db_redir_temp): 1*
> > *status 5 (db_redir_perm): 41*
> > *CrawlDb statistics: done*
> >
> > When I am trying to read the content from the segments, the content block
> is
> > empty for every record.
> >
> > Can you please tell me where I can get the content of these urls.
> >
> > Thanks and regards,*
> > *Arjun Kumar Reddy
>

Re: Few questions from a newbie

Posted by Estrada Groups <es...@gmail.com>.
You probably have to literally click on each URL to get the URL it's referencing. Those are URL shorteners  and probably won't play nicely with a crawler because of the redirection.

Adam

Sent from my iPhone

On Jan 26, 2011, at 8:02 AM, Arjun Kumar Reddy <ch...@iiitb.net> wrote:

> Hi list,
> 
> I have given the set of urls as
> 
> http://is.gd/Jt32Cf
> http://is.gd/hS3lEJ
> http://is.gd/Jy1Im3
> http://is.gd/QoJ8xy
> http://is.gd/e4ct89
> http://is.gd/WAOVmd
> http://is.gd/lhkA69
> http://is.gd/3OilLD
> ..... 43 such urls
> 
> And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth 3
> 
> *arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
> *CrawlDb statistics start: crawl/crawldb*
> *Statistics for CrawlDb: crawl/crawldb*
> *TOTAL urls: 43*
> *retry 0: 43*
> *min score: 1.0*
> *avg score: 1.0*
> *max score: 1.0*
> *status 3 (db_gone): 1*
> *status 4 (db_redir_temp): 1*
> *status 5 (db_redir_perm): 41*
> *CrawlDb statistics: done*
> 
> When I am trying to read the content from the segments, the content block is
> empty for every record.
> 
> Can you please tell me where I can get the content of these urls.
> 
> Thanks and regards,*
> *Arjun Kumar Reddy

Re: Few questions from a newbie

Posted by Arjun Kumar Reddy <ch...@iiitb.net>.
Hi list,

I have given the set of urls as

http://is.gd/Jt32Cf
http://is.gd/hS3lEJ
http://is.gd/Jy1Im3
http://is.gd/QoJ8xy
http://is.gd/e4ct89
http://is.gd/WAOVmd
http://is.gd/lhkA69
http://is.gd/3OilLD
..... 43 such urls

And I have run the crawl command bin/nutch crawl urls/ -dir crawl -depth 3

*arjun@arjun-ninjas:~/nutch$* bin/nutch readdb crawl/crawldb -stats
*CrawlDb statistics start: crawl/crawldb*
*Statistics for CrawlDb: crawl/crawldb*
*TOTAL urls: 43*
*retry 0: 43*
*min score: 1.0*
*avg score: 1.0*
*max score: 1.0*
*status 3 (db_gone): 1*
*status 4 (db_redir_temp): 1*
*status 5 (db_redir_perm): 41*
*CrawlDb statistics: done*

When I am trying to read the content from the segments, the content block is
empty for every record.

Can you please tell me where I can get the content of these urls.

Thanks and regards,*
*Arjun Kumar Reddy

RE: Few questions from a newbie

Posted by "McGibbney, Lewis John" <Le...@gcu.ac.uk>.
I can only speak for myself but I think that reading up on 'search' E.g. Lucene, is really the first stop prior to engaging with the crawling stuff. There are publications out there dealing with building search applications but these only contain small sections on web crawlers and code examples are fairly dated now.

Hope this helps

________________________________________
From: .: Abhishek :. [ab1sh3k@gmail.com]
Sent: 26 January 2011 03:02
To: markus.jelsma@openindex.io
Cc: user@nutch.apache.org
Subject: Re: Few questions from a newbie

Thanks a bunch Markus.

By the way, is there some book or material on Nutch which would help me
understanding it better? I  come from an application development background
and all the crawl n search stuff is *very* new to me :)


On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> These values come from the CrawlDB and have the following meaning.
>
> db_unfetched
> This is the number of URL's that are to be crawled when the next batch is
> started. This number is usually limited with the generate.max.per.host
> setting. So, if there are 5000 unfetched and generate.max.per.host is set
> to
> 1000, the next batch will fetch only 1000. Watch, the number of unfetched
> will
> usually not be 5000-1000 because new URL's have been discovered and added
> to
> the CrawlDB.
>
> db_fetched
> These URL's have been fetched. Their next fetch will be
> db.fetcher.interval.
> But, this is not always the case. There the adaprive schedule algorithm can
> tune this number depending on several settings. With these you can tune the
> interval when a page is modified or not modified.
>
> db_gone
> HTTP 404 Not Found
>
> db_redir-temp
> HTTP 307 Temporary Redirect
>
> db_redir_perm
> HTTP 301 Moved Permanently
>
> Code:
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
>
> Configuration:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> default.xml?view=markup<http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup>
>
> > Thanks Chris, Charan and Alex.
> >
> > I am looking into the crawl statistics now. And I see fields like
> > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what
> do
> > they mean?
> >
> > And, I also see the db_unfetched is way too high than the db_fetched.
> Does
> > it mean most of the pages did not crawl at all due to some issues?
> >
> > Thanks again for your time!
> >
> > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar <charan.kumar@gmail.com
> >wrote:
> > > db.fetcher.interval : It means that URLS which were fetched in the last
> > > 30 days <default> will not be fetched. Or A URL is eligible for refetch
> > > only after 30 days of last crawl.
> > >
> > > On Mon, Jan 24, 2011 at 9:23 PM, <al...@aim.com> wrote:
> > > > How to use solr to index nutch segments?
> > > > What is the meaning of db.fetcher.interval? Does this mean that if I
> > > > run the same crawl command before 30 days it will do nothing?
> > > >
> > > > Thanks.
> > > > Alex.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Charan K <ch...@gmail.com>
> > > > To: user <us...@nutch.apache.org>
> > > > Cc: user <us...@nutch.apache.org>
> > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > Subject: Re: Few questions from a newbie
> > > >
> > > >
> > > > Refer NutchBean.java for the their question. You can run than from
> > >
> > > command
> > >
> > > > line
> > > >
> > > > to test the index.
> > > >
> > > >  If you use SOLR indexing, it is going to be much simpler, they have
> a
> > >
> > > solr
> > >
> > > > java
> > > >
> > > > client..
> > > >
> > > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar <am...@gmail.com>
> wrote:
> > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > > crawl
> > > > >
> > > > > gives u more control and speed
> > > > >
> > > > > 2.After the first crawl,the recrawling the same sites time is 30
> days
> > >
> > > by
> > >
> > > > > default in db.fetcher.interval,you can change it according to ur
> own
> > > > >
> > > > > convenience.
> > > > >
> > > > > 3.I ve no idea about the third question
> > > > >
> > > > > cz  i m also a newbie
> > > > >
> > > > > Best of luck with nutch learning
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab1sh3k@gmail.com
> >
> > > >
> > > > wrote:
> > > > >> Hi all,
> > > > >>
> > > > >>
> > > > >>
> > > > >> I am very new to Nutch and Lucene as well. I am having few
> questions
> > > >
> > > > about
> > > >
> > > > >> Nutch, I know they are very much basic but I could not get clear
> cut
> > > > >>
> > > > >> answers
> > > > >>
> > > > >> out of googling for this. The questions are,
> > > > >>
> > > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > >
> > > intranet
> > >
> > > > >>  crawl or whole web crawl.
> > > > >>
> > > > >>  - How do I set recrawl's for these same web sites after the first
> > > >
> > > > crawl.
> > > >
> > > > >>  - If I have to start search the results via my own java code
> which
> > >
> > > jar
> > >
> > > > >>  files or api's or samples should I be looking into.
> > > > >>
> > > > >>  - Is there a book on Nutch?
> > > > >>
> > > > >> Thanks a bunch for your patience. I appreciate your time.
> > > > >>
> > > > >>
> > > > >>
> > > > >> ./Abishek
>

Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Re: Few questions from a newbie

Posted by ".: Abhishek :." <ab...@gmail.com>.
Thanks a bunch Markus.

By the way, is there some book or material on Nutch which would help me
understanding it better? I  come from an application development background
and all the crawl n search stuff is *very* new to me :)


On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> These values come from the CrawlDB and have the following meaning.
>
> db_unfetched
> This is the number of URL's that are to be crawled when the next batch is
> started. This number is usually limited with the generate.max.per.host
> setting. So, if there are 5000 unfetched and generate.max.per.host is set
> to
> 1000, the next batch will fetch only 1000. Watch, the number of unfetched
> will
> usually not be 5000-1000 because new URL's have been discovered and added
> to
> the CrawlDB.
>
> db_fetched
> These URL's have been fetched. Their next fetch will be
> db.fetcher.interval.
> But, this is not always the case. There the adaprive schedule algorithm can
> tune this number depending on several settings. With these you can tune the
> interval when a page is modified or not modified.
>
> db_gone
> HTTP 404 Not Found
>
> db_redir-temp
> HTTP 307 Temporary Redirect
>
> db_redir_perm
> HTTP 301 Moved Permanently
>
> Code:
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
>
> Configuration:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> default.xml?view=markup<http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup>
>
> > Thanks Chris, Charan and Alex.
> >
> > I am looking into the crawl statistics now. And I see fields like
> > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what
> do
> > they mean?
> >
> > And, I also see the db_unfetched is way too high than the db_fetched.
> Does
> > it mean most of the pages did not crawl at all due to some issues?
> >
> > Thanks again for your time!
> >
> > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar <charan.kumar@gmail.com
> >wrote:
> > > db.fetcher.interval : It means that URLS which were fetched in the last
> > > 30 days <default> will not be fetched. Or A URL is eligible for refetch
> > > only after 30 days of last crawl.
> > >
> > > On Mon, Jan 24, 2011 at 9:23 PM, <al...@aim.com> wrote:
> > > > How to use solr to index nutch segments?
> > > > What is the meaning of db.fetcher.interval? Does this mean that if I
> > > > run the same crawl command before 30 days it will do nothing?
> > > >
> > > > Thanks.
> > > > Alex.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Charan K <ch...@gmail.com>
> > > > To: user <us...@nutch.apache.org>
> > > > Cc: user <us...@nutch.apache.org>
> > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > Subject: Re: Few questions from a newbie
> > > >
> > > >
> > > > Refer NutchBean.java for the their question. You can run than from
> > >
> > > command
> > >
> > > > line
> > > >
> > > > to test the index.
> > > >
> > > >  If you use SOLR indexing, it is going to be much simpler, they have
> a
> > >
> > > solr
> > >
> > > > java
> > > >
> > > > client..
> > > >
> > > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar <am...@gmail.com>
> wrote:
> > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > > crawl
> > > > >
> > > > > gives u more control and speed
> > > > >
> > > > > 2.After the first crawl,the recrawling the same sites time is 30
> days
> > >
> > > by
> > >
> > > > > default in db.fetcher.interval,you can change it according to ur
> own
> > > > >
> > > > > convenience.
> > > > >
> > > > > 3.I ve no idea about the third question
> > > > >
> > > > > cz  i m also a newbie
> > > > >
> > > > > Best of luck with nutch learning
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab1sh3k@gmail.com
> >
> > > >
> > > > wrote:
> > > > >> Hi all,
> > > > >>
> > > > >>
> > > > >>
> > > > >> I am very new to Nutch and Lucene as well. I am having few
> questions
> > > >
> > > > about
> > > >
> > > > >> Nutch, I know they are very much basic but I could not get clear
> cut
> > > > >>
> > > > >> answers
> > > > >>
> > > > >> out of googling for this. The questions are,
> > > > >>
> > > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > >
> > > intranet
> > >
> > > > >>  crawl or whole web crawl.
> > > > >>
> > > > >>  - How do I set recrawl's for these same web sites after the first
> > > >
> > > > crawl.
> > > >
> > > > >>  - If I have to start search the results via my own java code
> which
> > >
> > > jar
> > >
> > > > >>  files or api's or samples should I be looking into.
> > > > >>
> > > > >>  - Is there a book on Nutch?
> > > > >>
> > > > >> Thanks a bunch for your patience. I appreciate your time.
> > > > >>
> > > > >>
> > > > >>
> > > > >> ./Abishek
>

Re: Few questions from a newbie

Posted by Markus Jelsma <ma...@openindex.io>.
These values come from the CrawlDB and have the following meaning.

db_unfetched
This is the number of URL's that are to be crawled when the next batch is 
started. This number is usually limited with the generate.max.per.host 
setting. So, if there are 5000 unfetched and generate.max.per.host is set to 
1000, the next batch will fetch only 1000. Watch, the number of unfetched will 
usually not be 5000-1000 because new URL's have been discovered and added to 
the CrawlDB.

db_fetched
These URL's have been fetched. Their next fetch will be db.fetcher.interval. 
But, this is not always the case. There the adaprive schedule algorithm can 
tune this number depending on several settings. With these you can tune the 
interval when a page is modified or not modified.

db_gone
HTTP 404 Not Found

db_redir-temp
HTTP 307 Temporary Redirect

db_redir_perm
HTTP 301 Moved Permanently

Code:
http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup

Configuration:
http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
default.xml?view=markup

> Thanks Chris, Charan and Alex.
> 
> I am looking into the crawl statistics now. And I see fields like
> db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what do
> they mean?
> 
> And, I also see the db_unfetched is way too high than the db_fetched. Does
> it mean most of the pages did not crawl at all due to some issues?
> 
> Thanks again for your time!
> 
> On Tue, Jan 25, 2011 at 2:33 PM, charan kumar <ch...@gmail.com>wrote:
> > db.fetcher.interval : It means that URLS which were fetched in the last
> > 30 days <default> will not be fetched. Or A URL is eligible for refetch
> > only after 30 days of last crawl.
> > 
> > On Mon, Jan 24, 2011 at 9:23 PM, <al...@aim.com> wrote:
> > > How to use solr to index nutch segments?
> > > What is the meaning of db.fetcher.interval? Does this mean that if I
> > > run the same crawl command before 30 days it will do nothing?
> > > 
> > > Thanks.
> > > Alex.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: Charan K <ch...@gmail.com>
> > > To: user <us...@nutch.apache.org>
> > > Cc: user <us...@nutch.apache.org>
> > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > Subject: Re: Few questions from a newbie
> > > 
> > > 
> > > Refer NutchBean.java for the their question. You can run than from
> > 
> > command
> > 
> > > line
> > > 
> > > to test the index.
> > > 
> > >  If you use SOLR indexing, it is going to be much simpler, they have a
> > 
> > solr
> > 
> > > java
> > > 
> > > client..
> > > 
> > > 
> > > 
> > > Sent from my iPhone
> > > 
> > > On Jan 24, 2011, at 8:07 PM, Amna Waqar <am...@gmail.com> wrote:
> > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > crawl
> > > > 
> > > > gives u more control and speed
> > > > 
> > > > 2.After the first crawl,the recrawling the same sites time is 30 days
> > 
> > by
> > 
> > > > default in db.fetcher.interval,you can change it according to ur own
> > > > 
> > > > convenience.
> > > > 
> > > > 3.I ve no idea about the third question
> > > > 
> > > > cz  i m also a newbie
> > > > 
> > > > Best of luck with nutch learning
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab...@gmail.com>
> > > 
> > > wrote:
> > > >> Hi all,
> > > >> 
> > > >> 
> > > >> 
> > > >> I am very new to Nutch and Lucene as well. I am having few questions
> > > 
> > > about
> > > 
> > > >> Nutch, I know they are very much basic but I could not get clear cut
> > > >> 
> > > >> answers
> > > >> 
> > > >> out of googling for this. The questions are,
> > > >> 
> > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > 
> > intranet
> > 
> > > >>  crawl or whole web crawl.
> > > >>  
> > > >>  - How do I set recrawl's for these same web sites after the first
> > > 
> > > crawl.
> > > 
> > > >>  - If I have to start search the results via my own java code which
> > 
> > jar
> > 
> > > >>  files or api's or samples should I be looking into.
> > > >>  
> > > >>  - Is there a book on Nutch?
> > > >> 
> > > >> Thanks a bunch for your patience. I appreciate your time.
> > > >> 
> > > >> 
> > > >> 
> > > >> ./Abishek

Re: Few questions from a newbie

Posted by ".: Abhishek :." <ab...@gmail.com>.
Thanks Chris, Charan and Alex.

I am looking into the crawl statistics now. And I see fields like
db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what do
they mean?

And, I also see the db_unfetched is way too high than the db_fetched. Does
it mean most of the pages did not crawl at all due to some issues?

Thanks again for your time!


On Tue, Jan 25, 2011 at 2:33 PM, charan kumar <ch...@gmail.com>wrote:

> db.fetcher.interval : It means that URLS which were fetched in the last 30
> days <default> will not be fetched. Or A URL is eligible for refetch only
> after 30 days of last crawl.
>
>
> On Mon, Jan 24, 2011 at 9:23 PM, <al...@aim.com> wrote:
>
> > How to use solr to index nutch segments?
> > What is the meaning of db.fetcher.interval? Does this mean that if I run
> > the same crawl command before 30 days it will do nothing?
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Charan K <ch...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Cc: user <us...@nutch.apache.org>
> > Sent: Mon, Jan 24, 2011 8:24 pm
> > Subject: Re: Few questions from a newbie
> >
> >
> > Refer NutchBean.java for the their question. You can run than from
> command
> > line
> >
> > to test the index.
> >
> >
> >
> >  If you use SOLR indexing, it is going to be much simpler, they have a
> solr
> > java
> >
> > client..
> >
> >
> >
> > Sent from my iPhone
> >
> >
> >
> > On Jan 24, 2011, at 8:07 PM, Amna Waqar <am...@gmail.com> wrote:
> >
> >
> >
> > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl
> >
> > > gives u more control and speed
> >
> > > 2.After the first crawl,the recrawling the same sites time is 30 days
> by
> >
> > > default in db.fetcher.interval,you can change it according to ur own
> >
> > > convenience.
> >
> > > 3.I ve no idea about the third question
> >
> > > cz  i m also a newbie
> >
> > > Best of luck with nutch learning
> >
> > >
> >
> > >
> >
> > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab...@gmail.com>
> > wrote:
> >
> > >
> >
> > >> Hi all,
> >
> > >>
> >
> > >> I am very new to Nutch and Lucene as well. I am having few questions
> > about
> >
> > >> Nutch, I know they are very much basic but I could not get clear cut
> >
> > >> answers
> >
> > >> out of googling for this. The questions are,
> >
> > >>
> >
> > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> intranet
> >
> > >>  crawl or whole web crawl.
> >
> > >>  - How do I set recrawl's for these same web sites after the first
> > crawl.
> >
> > >>  - If I have to start search the results via my own java code which
> jar
> >
> > >>  files or api's or samples should I be looking into.
> >
> > >>  - Is there a book on Nutch?
> >
> > >>
> >
> > >> Thanks a bunch for your patience. I appreciate your time.
> >
> > >>
> >
> > >> ./Abishek
> >
> > >>
> >
> >
> >
> >
> >
> >
>

Re: Few questions from a newbie

Posted by charan kumar <ch...@gmail.com>.
db.fetcher.interval : It means that URLS which were fetched in the last 30
days <default> will not be fetched. Or A URL is eligible for refetch only
after 30 days of last crawl.


On Mon, Jan 24, 2011 at 9:23 PM, <al...@aim.com> wrote:

> How to use solr to index nutch segments?
> What is the meaning of db.fetcher.interval? Does this mean that if I run
> the same crawl command before 30 days it will do nothing?
>
> Thanks.
> Alex.
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: Charan K <ch...@gmail.com>
> To: user <us...@nutch.apache.org>
> Cc: user <us...@nutch.apache.org>
> Sent: Mon, Jan 24, 2011 8:24 pm
> Subject: Re: Few questions from a newbie
>
>
> Refer NutchBean.java for the their question. You can run than from command
> line
>
> to test the index.
>
>
>
>  If you use SOLR indexing, it is going to be much simpler, they have a solr
> java
>
> client..
>
>
>
> Sent from my iPhone
>
>
>
> On Jan 24, 2011, at 8:07 PM, Amna Waqar <am...@gmail.com> wrote:
>
>
>
> > 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl
>
> > gives u more control and speed
>
> > 2.After the first crawl,the recrawling the same sites time is 30 days by
>
> > default in db.fetcher.interval,you can change it according to ur own
>
> > convenience.
>
> > 3.I ve no idea about the third question
>
> > cz  i m also a newbie
>
> > Best of luck with nutch learning
>
> >
>
> >
>
> > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab...@gmail.com>
> wrote:
>
> >
>
> >> Hi all,
>
> >>
>
> >> I am very new to Nutch and Lucene as well. I am having few questions
> about
>
> >> Nutch, I know they are very much basic but I could not get clear cut
>
> >> answers
>
> >> out of googling for this. The questions are,
>
> >>
>
> >>  - If I have to crawl just 5-6 web sites or URL's should I use intranet
>
> >>  crawl or whole web crawl.
>
> >>  - How do I set recrawl's for these same web sites after the first
> crawl.
>
> >>  - If I have to start search the results via my own java code which jar
>
> >>  files or api's or samples should I be looking into.
>
> >>  - Is there a book on Nutch?
>
> >>
>
> >> Thanks a bunch for your patience. I appreciate your time.
>
> >>
>
> >> ./Abishek
>
> >>
>
>
>
>
>
>

Re: Few questions from a newbie

Posted by al...@aim.com.
How to use solr to index nutch segments?
What is the meaning of db.fetcher.interval? Does this mean that if I run the same crawl command before 30 days it will do nothing?

Thanks.
Alex.

 

 


 

 

-----Original Message-----
From: Charan K <ch...@gmail.com>
To: user <us...@nutch.apache.org>
Cc: user <us...@nutch.apache.org>
Sent: Mon, Jan 24, 2011 8:24 pm
Subject: Re: Few questions from a newbie


Refer NutchBean.java for the their question. You can run than from command line 

to test the index.



 If you use SOLR indexing, it is going to be much simpler, they have a solr java 

client.. 



Sent from my iPhone



On Jan 24, 2011, at 8:07 PM, Amna Waqar <am...@gmail.com> wrote:



> 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl

> gives u more control and speed

> 2.After the first crawl,the recrawling the same sites time is 30 days by

> default in db.fetcher.interval,you can change it according to ur own

> convenience.

> 3.I ve no idea about the third question

> cz  i m also a newbie

> Best of luck with nutch learning

> 

> 

> On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab...@gmail.com> wrote:

> 

>> Hi all,

>> 

>> I am very new to Nutch and Lucene as well. I am having few questions about

>> Nutch, I know they are very much basic but I could not get clear cut

>> answers

>> out of googling for this. The questions are,

>> 

>>  - If I have to crawl just 5-6 web sites or URL's should I use intranet

>>  crawl or whole web crawl.

>>  - How do I set recrawl's for these same web sites after the first crawl.

>>  - If I have to start search the results via my own java code which jar

>>  files or api's or samples should I be looking into.

>>  - Is there a book on Nutch?

>> 

>> Thanks a bunch for your patience. I appreciate your time.

>> 

>> ./Abishek

>> 




 

Re: Few questions from a newbie

Posted by Charan K <ch...@gmail.com>.
Refer NutchBean.java for the their question. You can run than from command line to test the index.

 If you use SOLR indexing, it is going to be much simpler, they have a solr java client.. 

Sent from my iPhone

On Jan 24, 2011, at 8:07 PM, Amna Waqar <am...@gmail.com> wrote:

> 1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl
> gives u more control and speed
> 2.After the first crawl,the recrawling the same sites time is 30 days by
> default in db.fetcher.interval,you can change it according to ur own
> convenience.
> 3.I ve no idea about the third question
> cz  i m also a newbie
> Best of luck with nutch learning
> 
> 
> On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab...@gmail.com> wrote:
> 
>> Hi all,
>> 
>> I am very new to Nutch and Lucene as well. I am having few questions about
>> Nutch, I know they are very much basic but I could not get clear cut
>> answers
>> out of googling for this. The questions are,
>> 
>>  - If I have to crawl just 5-6 web sites or URL's should I use intranet
>>  crawl or whole web crawl.
>>  - How do I set recrawl's for these same web sites after the first crawl.
>>  - If I have to start search the results via my own java code which jar
>>  files or api's or samples should I be looking into.
>>  - Is there a book on Nutch?
>> 
>> Thanks a bunch for your patience. I appreciate your time.
>> 
>> ./Abishek
>> 

Re: Few questions from a newbie

Posted by Amna Waqar <am...@gmail.com>.
1,to crawl just 5 to 6 websites,u can use both cases but intranet crawl
gives u more control and speed
2.After the first crawl,the recrawling the same sites time is 30 days by
default in db.fetcher.interval,you can change it according to ur own
convenience.
3.I ve no idea about the third question
cz  i m also a newbie
Best of luck with nutch learning


On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab...@gmail.com> wrote:

> Hi all,
>
>  I am very new to Nutch and Lucene as well. I am having few questions about
> Nutch, I know they are very much basic but I could not get clear cut
> answers
> out of googling for this. The questions are,
>
>   - If I have to crawl just 5-6 web sites or URL's should I use intranet
>   crawl or whole web crawl.
>   - How do I set recrawl's for these same web sites after the first crawl.
>   - If I have to start search the results via my own java code which jar
>   files or api's or samples should I be looking into.
>   - Is there a book on Nutch?
>
> Thanks a bunch for your patience. I appreciate your time.
>
> ./Abishek
>