You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Del Rio, Ann" <ad...@ebay.com> on 2008/06/16 18:22:50 UTC
how does nutch connect to urls internally?
Good morning,
Can you please point me to a Nutch documentation where I can find how
nutch connects to the webpages when it crawls? I think it is through
HTTP but i would like to confirm and get more details so i can write a
very small test java program to connect to one of the webpages i am
having trouble connecting / crawling. I bought Lucene in Action and am
half way thru the book and so far there is very little about Nutch.
Thanks,
Ann Del Rio
Ph: 408.376.6504
E-mail: adelrio@ebay.com
Skype: delrio_alan
RE: how does nutch connect to urls internally?
Posted by "Del Rio, Ann" <ad...@ebay.com>.
Hello,
I tried this simple junit program before I will try the nutch classes
for http,
import java.io.BufferedInputStream;
import java.io.StringWriter;
import java.net.URL;
import junit.framework.TestCase;
public class BinDoxTest extends TestCase {
public void testHttp() {
try {
URL url = new
URL("http://v4:10000/lib");
StringWriter writer = new
StringWriter();
BufferedInputStream in = new
BufferedInputStream(url.openStream());
for (int c = in.read(); c != -1; c =
in.read()) {
writer.write(c);
}
System.out.println(writer);
} catch (Exception e) {
// TODO: handle exception
}
}
}
And got the following output which is the same as if I issued a wget in
linux shell.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<HTML>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Bindox Library</title>
<link rel="icon"
href="/classpath/com/ebay/content/sharedcontent/images/favicon.ico"
type="image/vnd.microsoft.icon">
<script language="JavaScript">
function topicLoaded(href, title) {
ContentFrame.ContentToolbarFrame.setTitle(title);
}
var maximizeListeners=new Object();
function registerMaximizeListener(name, listener){
maximizeListeners[name]=listener;
}
function notifyMaximizeListeners(name, maximizedNotRestored){
maximizeListeners[name](maximizedNotRestored);
}
var leftCols = "29.5%";
var rightCols = "70.5%";
// called from *Toolbar pages
function toggleFrame(title)
{
var frameset = document.getElementById("BindoxFrameset");
var navFrameSize = frameset.getAttribute("cols");
var comma = navFrameSize.indexOf(',');
var left = navFrameSize.substring(0,comma);
var right = navFrameSize.substring(comma+1);
if (left == "*" || right == "*") {
// restore frames
frameset.frameSpacing="3";
frameset.setAttribute("border", "6");
frameset.setAttribute("cols", leftCols+","+rightCols);
notifyMaximizeListeners(title, false);
} else {
// the "cols" attribute is not always accurate,
especially after resizing.
// offsetWidth is also not accurate, so we do a
combination of both and
// should get a reasonable behavior
var leftSize = NavFrame.document.body.offsetWidth;
var rightSize = ContentFrame.document.body.offsetWidth;
leftCols = leftSize * 100 / (leftSize + rightSize);
rightCols = 100 - leftCols;
// maximize the frame.
//leftCols = left;
//rightCols = right;
if (title == "Contents") // this is the content toolbar
frameset.setAttribute("cols", "*,100%");
else // this is the left side for left-to-right
rendering
frameset.setAttribute("cols", "100%,*");
frameset.frameSpacing="0";
frameset.setAttribute("border", "1");
notifyMaximizeListeners(title, true);
}
}
</script>
</head>
<frameset id="BindoxFrameset" cols="29.5%,70.5%" framespacing="4"
border="4" frameborder="1" scrolling="no">
<frame class="nav" name="NavFrame" title="Layout frame:
NavFrame" src='/com/ebay/content/sharedcontent/toc/NavFrame.jsp?null'
marginwidth="0" marginheight="0" scrolling="no" frameborder="1"
resize=yes>
<frame class="content" name="ContentFrame" title="Layout frame:
ContentFrame"
src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp'
marginwidth="0" marginheight="0" scrolling="no" frameborder="0"
resize=yes>
</frameset>
</HTML>
Can you please help provide enlightenment if there is something funky
with this starting page of the website from where Nutch gives me a
"SocketException: Connection Reset Error" when I run the nutch to start
indexing from the page above? Can nutch index "frames"?
I will try http next as our network admin said it might be an issue with
VM Ware freezing or timing-out for http 1.0 and not http 1.1
Thanks,
Ann Del Rio
-----Original Message-----
From: Susam Pal [mailto:susam.pal@gmail.com]
Sent: Monday, June 16, 2008 9:48 AM
To: nutch-user@lucene.apache.org
Subject: Re: how does nutch connect to urls internally?
Hi,
It depends on which protocol plugin is enabled in your
'conf/nutch-site.xml'. The property to look for is 'plugins.include'
in the XML file. If this is not present in 'conf/nutch-site.xml', it
means you are using the default 'plugins.include' of
'conf/nutch-default.xml'.
If protocol-http is enabled, then you have to go through the code in:-
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.ja
va
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpRes
ponse.java
If protocol-httpclient is enabled, then you have to go through:-
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/Http.java
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/HttpResponse.java
Enabling DEBUG logs in 'conf/log4j.properties' will also give you clues
about the problems. The logs are written to 'logs/hadoop.log'.
To enable the DEBUG logs for a particular package, say, the httpclient
package, you can open 'conf/log4j.properties' and add the following
line:
log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
Regards,
Susam Pal
On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann <ad...@ebay.com> wrote:
> Good morning,
>
> Can you please point me to a Nutch documentation where I can find how
> nutch connects to the webpages when it crawls? I think it is through
> HTTP but i would like to confirm and get more details so i can write a
> very small test java program to connect to one of the webpages i am
> having trouble connecting / crawling. I bought Lucene in Action and am
> half way thru the book and so far there is very little about Nutch.
>
> Thanks,
> Ann Del Rio
RE: how does nutch connect to urls internally?
Posted by "Del Rio, Ann" <ad...@ebay.com>.
Thank you for the great and detailed information Susam!
Will post back my test program when successful.
Thanks,
Ann Del Rio
-----Original Message-----
From: Susam Pal [mailto:susam.pal@gmail.com]
Sent: Monday, June 16, 2008 9:48 AM
To: nutch-user@lucene.apache.org
Subject: Re: how does nutch connect to urls internally?
Hi,
It depends on which protocol plugin is enabled in your
'conf/nutch-site.xml'. The property to look for is 'plugins.include'
in the XML file. If this is not present in 'conf/nutch-site.xml', it
means you are using the default 'plugins.include' of
'conf/nutch-default.xml'.
If protocol-http is enabled, then you have to go through the code in:-
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.ja
va
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpRes
ponse.java
If protocol-httpclient is enabled, then you have to go through:-
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/Http.java
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/HttpResponse.java
Enabling DEBUG logs in 'conf/log4j.properties' will also give you clues
about the problems. The logs are written to 'logs/hadoop.log'.
To enable the DEBUG logs for a particular package, say, the httpclient
package, you can open 'conf/log4j.properties' and add the following
line:
log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
Regards,
Susam Pal
On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann <ad...@ebay.com> wrote:
> Good morning,
>
> Can you please point me to a Nutch documentation where I can find how
> nutch connects to the webpages when it crawls? I think it is through
> HTTP but i would like to confirm and get more details so i can write a
> very small test java program to connect to one of the webpages i am
> having trouble connecting / crawling. I bought Lucene in Action and am
> half way thru the book and so far there is very little about Nutch.
>
> Thanks,
> Ann Del Rio
Re: how does nutch connect to urls internally?
Posted by Susam Pal <su...@gmail.com>.
Hi,
It depends on which protocol plugin is enabled in your
'conf/nutch-site.xml'. The property to look for is 'plugins.include'
in the XML file. If this is not present in 'conf/nutch-site.xml', it
means you are using the default 'plugins.include' of
'conf/nutch-default.xml'.
If protocol-http is enabled, then you have to go through the code in:-
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
If protocol-httpclient is enabled, then you have to go through:-
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
Enabling DEBUG logs in 'conf/log4j.properties' will also give you
clues about the problems. The logs are written to 'logs/hadoop.log'.
To enable the DEBUG logs for a particular package, say, the httpclient
package, you can open 'conf/log4j.properties' and add the following
line:
log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
Regards,
Susam Pal
On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann <ad...@ebay.com> wrote:
> Good morning,
>
> Can you please point me to a Nutch documentation where I can find how nutch
> connects to the webpages when it crawls? I think it is through HTTP but i
> would like to confirm and get more details so i can write a very small test
> java program to connect to one of the webpages i am having trouble
> connecting / crawling. I bought Lucene in Action and am half way thru the
> book and so far there is very little about Nutch.
>
> Thanks,
> Ann Del Rio
> Ph: 408.376.6504
> E-mail: adelrio@ebay.com
> Skype: delrio_alan
>