You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2008/01/09 23:26:25 UTC

some crawl problems

Hello all,

I am using nutch 9 and when I fetch a couple of sites nutch does not include pages other that the main one.
For example, if I have mysite.com/cv.htm, nutch fetches only mysite.com. It does not fetch cv.htm and other files in the site.
I noticed that if I do? bin/nutch generate crawl/crawldb crawl/segments -topN 1000?
after?
?bin/nutch generate crawl/crawldb crawl/segments 

it includes some of those pages but not all of them.

Is there any way to tell nutch to crawl all the objects in mysite.com

Also, I wondered how to put nutch in a website, let say mysite.com/search?

Thanks in advance.
Alex.



-----Original Message-----
From: payo <pa...@yahoo.com>
To: nutch-user@lucene.apache.org
Sent: Wed, 9 Jan 2008 10:18 am
Subject: Re: subcollections




hi to all

i can configure this part.

1.- agree subcollection plucgin in nutch-site.xml in the tomcat 

Tomcat\webapps\ROOT\WEB-INF\classes\nutch-site.xml

2.- agree label select in te serach.jsp indicating the subcollections

line 147 <form name="search" action="../search.jsp" method="get">
 <SELECT NAME="subcollection">   
   <option selected value=<%=subcoleccion%>><%=subcoleccion%></option> 
   <OPTION VALUE="apache">Apache</OPTION> 
   <OPTION VALUE="nutch">Nutch</OPTION> 
   <OPTION VALUE="xml">XML</OPTION> 
</SELECT>


thanks

-- 
View this message in context: http://www.nabble.com/subcollections-tp14373976p14716644.html
Sent from the Nutch - User mailing list archive at Nabble.com.



________________________________________________________________________
More new features than ever.  Check out the new AIM(R) Mail ! - http://webmail.aim.com

Re: some crawl problems

Posted by al...@aim.com.
 Hi,

In my urls file I have mysite.com and this site has links to all files like cv.htm mypaper.pdf and etc. 

Thanks.
Alex.


 


 

-----Original Message-----
From: Susam Pal <su...@gmail.com>
To: nutch-user@lucene.apache.org
Sent: Wed, 9 Jan 2008 8:34 pm
Subject: Re: some crawl problems










What is present in your seed url list? Nutch fetches new URLs during a
fetch in the next level of depth by discovering new URLs from the
current fetch.

So, if you have http://mysite.com/ in your seed URL list and the home
page does not have a link to http://mysite.com/cv.htm, the crawler
wouldn't be able to reach that page.

Regards,
Susam Pal

On Jan 10, 2008 3:56 AM,  <al...@aim.com> wrote:
>
> Hello all,
>
> I am using nutch 9 and when I fetch a couple of sites nutch does not include 
pages other that the main one.
> For example, if I have mysite.com/cv.htm, nutch fetches only mysite.com. It 
does not fetch cv.htm and other files in the site.
> I noticed that if I do? bin/nutch generate crawl/crawldb crawl/segments -topN 
1000?
> after?
> ?bin/nutch generate crawl/crawldb crawl/segments
>
> it includes some of those pages but not all of them.
>
> Is there any way to tell nutch to crawl all the objects in mysite.com
>
> Also, I wondered how to put nutch in a website, let say mysite.com/search?
>
> Thanks in advance.
> Alex.
>
>
>
> -----Original Message-----
> From: payo <pa...@yahoo.com>
> To: nutch-user@lucene.apache.org
> Sent: Wed, 9 Jan 2008 10:18 am
> Subject: Re: subcollections
>
>
>
>
> hi to all
>
> i can configure this part.
>
> 1.- agree subcollection plucgin in nutch-site.xml in the tomcat
>
> Tomcat\webapps\ROOT\WEB-INF\classes\nutch-site.xml
>
> 2.- agree label select in te serach.jsp indicating the subcollections
>
> line 147 <form name="search" action="../search.jsp" method="get">
>  <SELECT NAME="subcollection">
>    <option selected value=<%=subcoleccion%>><%=subcoleccion%></option>
>    <OPTION VALUE="apache">Apache</OPTION>
>    <OPTION VALUE="nutch">Nutch</OPTION>
>    <OPTION VALUE="xml">XML</OPTION>
> </SELECT>
>
>
> thanks
>
> --
> View this message in context: http://www.nabble.com/subcollections-tp14373976p14716644.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
> ________________________________________________________________________
> More new features than ever.  Check out the new AIM(R) Mail ! - 
http://webmail.aim.com
>



 


________________________________________________________________________
More new features than ever.  Check out the new AIM(R) Mail ! - http://webmail.aim.com

Re: some crawl problems

Posted by Susam Pal <su...@gmail.com>.
What is present in your seed url list? Nutch fetches new URLs during a
fetch in the next level of depth by discovering new URLs from the
current fetch.

So, if you have http://mysite.com/ in your seed URL list and the home
page does not have a link to http://mysite.com/cv.htm, the crawler
wouldn't be able to reach that page.

Regards,
Susam Pal

On Jan 10, 2008 3:56 AM,  <al...@aim.com> wrote:
>
> Hello all,
>
> I am using nutch 9 and when I fetch a couple of sites nutch does not include pages other that the main one.
> For example, if I have mysite.com/cv.htm, nutch fetches only mysite.com. It does not fetch cv.htm and other files in the site.
> I noticed that if I do? bin/nutch generate crawl/crawldb crawl/segments -topN 1000?
> after?
> ?bin/nutch generate crawl/crawldb crawl/segments
>
> it includes some of those pages but not all of them.
>
> Is there any way to tell nutch to crawl all the objects in mysite.com
>
> Also, I wondered how to put nutch in a website, let say mysite.com/search?
>
> Thanks in advance.
> Alex.
>
>
>
> -----Original Message-----
> From: payo <pa...@yahoo.com>
> To: nutch-user@lucene.apache.org
> Sent: Wed, 9 Jan 2008 10:18 am
> Subject: Re: subcollections
>
>
>
>
> hi to all
>
> i can configure this part.
>
> 1.- agree subcollection plucgin in nutch-site.xml in the tomcat
>
> Tomcat\webapps\ROOT\WEB-INF\classes\nutch-site.xml
>
> 2.- agree label select in te serach.jsp indicating the subcollections
>
> line 147 <form name="search" action="../search.jsp" method="get">
>  <SELECT NAME="subcollection">
>    <option selected value=<%=subcoleccion%>><%=subcoleccion%></option>
>    <OPTION VALUE="apache">Apache</OPTION>
>    <OPTION VALUE="nutch">Nutch</OPTION>
>    <OPTION VALUE="xml">XML</OPTION>
> </SELECT>
>
>
> thanks
>
> --
> View this message in context: http://www.nabble.com/subcollections-tp14373976p14716644.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
> ________________________________________________________________________
> More new features than ever.  Check out the new AIM(R) Mail ! - http://webmail.aim.com
>