You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Del Rio, Ann" <ad...@ebay.com> on 2008/06/16 18:22:50 UTC

how does nutch connect to urls internally?

Good morning,
 
Can you please point me to a Nutch documentation where I can find how
nutch connects to the webpages when it crawls? I think it is through
HTTP but i would like to confirm and get more details so i can write a
very small test java program to connect to one of the webpages i am
having trouble connecting / crawling. I bought Lucene in Action and am
half way thru the book and so far there is very little about Nutch.
 
Thanks,
 
Ann Del Rio
Ph: 408.376.6504
E-mail: adelrio@ebay.com
Skype: delrio_alan

RE: how does nutch connect to urls internally?

Posted by "Del Rio, Ann" <ad...@ebay.com>.

Hello,

I tried this simple junit program before I will try the nutch classes
for http,

	import java.io.BufferedInputStream;
	import java.io.StringWriter;
	import java.net.URL;
	import junit.framework.TestCase;
	public class BinDoxTest extends TestCase {
		public void testHttp() {
			try {
				URL url = new
URL("http://v4:10000/lib");
				StringWriter writer = new
StringWriter();
				BufferedInputStream in = new
BufferedInputStream(url.openStream());
				for (int c = in.read(); c != -1; c =
in.read()) {
					writer.write(c);
				}
				System.out.println(writer);
			} catch (Exception e) {
				// TODO: handle exception
			}
		}
	}

And got the following output which is the same as if I issued a wget in
linux shell.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<HTML>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Bindox Library</title>
<link rel="icon"
href="/classpath/com/ebay/content/sharedcontent/images/favicon.ico"
type="image/vnd.microsoft.icon">
<script language="JavaScript">

function topicLoaded(href, title) {
	ContentFrame.ContentToolbarFrame.setTitle(title);
}

var maximizeListeners=new Object();
function registerMaximizeListener(name, listener){
	maximizeListeners[name]=listener;
}
function notifyMaximizeListeners(name, maximizedNotRestored){
	maximizeListeners[name](maximizedNotRestored);
}

var leftCols = "29.5%";
var rightCols = "70.5%";

// called from *Toolbar pages
function toggleFrame(title)
{
	var frameset = document.getElementById("BindoxFrameset"); 
	var navFrameSize = frameset.getAttribute("cols");
	var comma = navFrameSize.indexOf(',');
	var left = navFrameSize.substring(0,comma);
	var right = navFrameSize.substring(comma+1);

	if (left == "*" || right == "*") {
		// restore frames
		frameset.frameSpacing="3";
		frameset.setAttribute("border", "6");
		frameset.setAttribute("cols", leftCols+","+rightCols);
		notifyMaximizeListeners(title, false);
	} else {
		// the "cols" attribute is not always accurate,
especially after resizing.
		// offsetWidth is also not accurate, so we do a
combination of both and 
		// should get a reasonable behavior

		var leftSize = NavFrame.document.body.offsetWidth;
		var rightSize = ContentFrame.document.body.offsetWidth;

		
		leftCols = leftSize * 100 / (leftSize + rightSize);
		rightCols = 100 - leftCols;

		// maximize the frame.
		//leftCols = left;
		//rightCols = right;
		if (title == "Contents") // this is the content toolbar
			frameset.setAttribute("cols", "*,100%");
		else // this is the left side for left-to-right
rendering
			frameset.setAttribute("cols", "100%,*");
	
		frameset.frameSpacing="0";
		frameset.setAttribute("border", "1");
		notifyMaximizeListeners(title, true);
	}
}

</script>

</head>

<frameset id="BindoxFrameset" cols="29.5%,70.5%" framespacing="4"
border="4"  frameborder="1"   scrolling="no">

   	<frame class="nav" name="NavFrame" title="Layout frame:
NavFrame" src='/com/ebay/content/sharedcontent/toc/NavFrame.jsp?null'
marginwidth="0" marginheight="0" scrolling="no" frameborder="1"
resize=yes>

   	<frame class="content" name="ContentFrame" title="Layout frame:
ContentFrame"
src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp'
marginwidth="0" marginheight="0" scrolling="no" frameborder="0"
resize=yes>

</frameset>
</HTML>


Can you please help provide enlightenment if there is something funky
with this starting page of the website from where Nutch gives me a
"SocketException: Connection Reset Error" when I run the nutch to start
indexing from the page above? Can nutch index "frames"?

I will try http next as our network admin said it might be an issue with
VM Ware freezing or timing-out for http 1.0 and not http 1.1

Thanks,
Ann Del Rio

-----Original Message-----
From: Susam Pal [mailto:susam.pal@gmail.com] 
Sent: Monday, June 16, 2008 9:48 AM
To: nutch-user@lucene.apache.org
Subject: Re: how does nutch connect to urls internally?

Hi,

It depends on which protocol plugin is enabled in your
'conf/nutch-site.xml'. The property to look for is 'plugins.include'
in the XML file. If this is not present in 'conf/nutch-site.xml', it
means you are using the default 'plugins.include' of
'conf/nutch-default.xml'.

If protocol-http is enabled, then you have to go through the code in:-

src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.ja
va
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpRes
ponse.java

If protocol-httpclient is enabled, then you have to go through:-

src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/Http.java
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/HttpResponse.java

Enabling DEBUG logs in 'conf/log4j.properties' will also give you clues
about the problems. The logs are written to 'logs/hadoop.log'.
To enable the DEBUG logs for a particular package, say, the httpclient
package, you can open 'conf/log4j.properties' and add the following
line:

log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout

Regards,
Susam Pal

On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann <ad...@ebay.com> wrote:
> Good morning,
>
> Can you please point me to a Nutch documentation where I can find how 
> nutch connects to the webpages when it crawls? I think it is through 
> HTTP but i would like to confirm and get more details so i can write a

> very small test java program to connect to one of the webpages i am 
> having trouble connecting / crawling. I bought Lucene in Action and am

> half way thru the book and so far there is very little about Nutch.
>
> Thanks,
> Ann Del Rio

RE: how does nutch connect to urls internally?

Posted by "Del Rio, Ann" <ad...@ebay.com>.

Thank you for the great and detailed information Susam! 
Will post back my test program when successful.

Thanks, 
Ann Del Rio

-----Original Message-----
From: Susam Pal [mailto:susam.pal@gmail.com] 
Sent: Monday, June 16, 2008 9:48 AM
To: nutch-user@lucene.apache.org
Subject: Re: how does nutch connect to urls internally?

Hi,

It depends on which protocol plugin is enabled in your
'conf/nutch-site.xml'. The property to look for is 'plugins.include'
in the XML file. If this is not present in 'conf/nutch-site.xml', it
means you are using the default 'plugins.include' of
'conf/nutch-default.xml'.

If protocol-http is enabled, then you have to go through the code in:-

src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.ja
va
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpRes
ponse.java

If protocol-httpclient is enabled, then you have to go through:-

src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/Http.java
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/HttpResponse.java

Enabling DEBUG logs in 'conf/log4j.properties' will also give you clues
about the problems. The logs are written to 'logs/hadoop.log'.
To enable the DEBUG logs for a particular package, say, the httpclient
package, you can open 'conf/log4j.properties' and add the following
line:

log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout

Regards,
Susam Pal

On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann <ad...@ebay.com> wrote:
> Good morning,
>
> Can you please point me to a Nutch documentation where I can find how 
> nutch connects to the webpages when it crawls? I think it is through 
> HTTP but i would like to confirm and get more details so i can write a

> very small test java program to connect to one of the webpages i am 
> having trouble connecting / crawling. I bought Lucene in Action and am

> half way thru the book and so far there is very little about Nutch.
>
> Thanks,
> Ann Del Rio

Re: how does nutch connect to urls internally?

Posted by Susam Pal <su...@gmail.com>.

Hi,

It depends on which protocol plugin is enabled in your
'conf/nutch-site.xml'. The property to look for is 'plugins.include'
in the XML file. If this is not present in 'conf/nutch-site.xml', it
means you are using the default 'plugins.include' of
'conf/nutch-default.xml'.

If protocol-http is enabled, then you have to go through the code in:-

src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java

If protocol-httpclient is enabled, then you have to go through:-

src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java

Enabling DEBUG logs in 'conf/log4j.properties' will also give you
clues about the problems. The logs are written to 'logs/hadoop.log'.
To enable the DEBUG logs for a particular package, say, the httpclient
package, you can open 'conf/log4j.properties' and add the following
line:

log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout

Regards,
Susam Pal

On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann <ad...@ebay.com> wrote:
> Good morning,
>
> Can you please point me to a Nutch documentation where I can find how nutch
> connects to the webpages when it crawls? I think it is through HTTP but i
> would like to confirm and get more details so i can write a very small test
> java program to connect to one of the webpages i am having trouble
> connecting / crawling. I bought Lucene in Action and am half way thru the
> book and so far there is very little about Nutch.
>
> Thanks,
> Ann Del Rio
> Ph: 408.376.6504
> E-mail: adelrio@ebay.com
> Skype: delrio_alan
>