You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Del Rio, Ann" <ad...@ebay.com> on 2008/06/02 18:54:16 UTC

RE: Indexing XML-based document format per DITA standard

Here's what I got from our network people, 

"interestingly enough, both the server where you are running nutch and
v4 are in the same area, they are even on the same network segment,  so,
firewall or proxy settings won't be an issue here. it's possible the
java servlet or tomcat container for the v4 BinDox app have
crawler-specific responses if the crawler is too aggressive or trips
some threshhold.  if not, wonder if nutch has a special way of
referencing a non-standard http port (ie, in your case: 10000)?"

Thanks, 
Ann Del Rio


-----Original Message-----
From: ogjunk-nutch@yahoo.com [mailto:ogjunk-nutch@yahoo.com] 
Sent: Friday, May 30, 2008 3:47 PM
To: nutch-user@lucene.apache.org
Subject: Re: Indexing XML-based document format per DITA standard

The fact that you got "java.net.SocketException: Connection reset" in
that error tells you and your network people this is a networking
problem.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "Del Rio, Ann" <ad...@ebay.com>
> To: nutch-user@lucene.apache.org
> Sent: Friday, May 30, 2008 5:37:48 PM
> Subject: RE: Indexing XML-based document format per DITA standard
> 
> 
> Yes, I can reproduce it and it happens everytime.
> 
> Apparently, it only happens to this website, that is why I was 
> wondering it has something to do with the way the pages are formatted
or fetched.
> All the other internal websites that I am crawling are fine, the 
> difference is that the other URLs do not have port numbers and are 
> more of static pages instead of a DITA framework that fetches and 
> redirects the pages from a servlet.
> 
> At the same time, I am also checking with network security if it is a 
> firewall issue or a port that they need to open for crawler-type 
> traffic.
> 
> Thanks,
> Ann Del Rio
> 
> 
> -----Original Message-----
> From: ogjunk-nutch@yahoo.com [mailto:ogjunk-nutch@yahoo.com]
> Sent: Friday, May 30, 2008 2:16 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Indexing XML-based document format per DITA standard
> 
> It looks like you can indeed connect to that v4 machine from the 
> machine running Nutch.  I can't tell from here why you got the error 
> you originally reported.  Does it happen every time you try running
Nutch?
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: "Del Rio, Ann" 
> > To: nutch-user@lucene.apache.org
> > Sent: Friday, May 30, 2008 3:23:00 PM
> > Subject: RE: Indexing XML-based document format per DITA standard
> > 
> > Thank you for your response and help Otis!
> > I greatly appreciate it and am sure others will.
> > 
> > 
> > I did a wget from the machine where I was running Nutch and got the 
> > following...
> > 
> > -bash-2.05b$ wget http://v4:10000/lib
> > --10:37:52--  http://v4:10000/lib
> >            => `lib.1'
> > Resolving v4... done.
> > Connecting to v4:10000... connected.
> > HTTP request sent, awaiting response... 200 OK
> > Length: 2,717 [text/html]
> > 100%[====================================>] 2,717          2.59M/s
> > ETA 00:00
> > 10:37:52 (2.59 MB/s) - `lib.1' saved [2717/2717]
> > 
> > Then I tried to telnet too and got a connection closed.
> > 
> > -bash-2.05b$ telnet
> > telnet> open
> > (to) v4 10000
> > Trying xxx.xxx.231.40...
> > Connected to xxxx.ebay.com (xxx.xxx.231.40).
> > Escape character is '^]'.
> > Connection closed by foreign host.
> > 
> > Doesn't telnet service/ports need to be enabled on the other end's 
> > server first before we can telnet to it? Does the nutch crawler use 
> > telnet to fetch the URL?
> > 
> > Apparently, we do not use proxy hosts and ports here at eBay in any 
> > of
> 
> > our APIs, so I am not sure how to get those. But I will still ask 
> > around if they know what proxy hosts and ports we are using.
> > 
> > Also, when I browse the URL it is fine, so I checked my IE browser 
> > options and checked on the LAN Settings to look for the proxy 
> > address and port and we are not using any as well.
> > 
> > 
> > Thanks,
> > Ann Del Rio
> > 
> > -----Original Message-----
> > From: ogjunk-nutch@yahoo.com [mailto:ogjunk-nutch@yahoo.com]
> > Sent: Friday, May 30, 2008 10:17 AM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: Indexing XML-based document format per DITA standard
> > 
> > Can you connect to it (telnet to it, for example) directly from the
> > machine(s) where you are running Nutch?
> > (this is a network issue, nothing to do with XML/parsing)
> > 
> > 
> > Maybe you need to go through some eBay proxy?
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > ----- Original Message ----
> > > From: "Del Rio, Ann" 
> > > To: nutch-user@lucene.apache.org
> > > Sent: Friday, May 30, 2008 6:24:01 PM
> > > Subject: Indexing XML-based document format per DITA standard
> > > 
> > > I added a new URL to index which is in a XML-based document format

> > > per
> > 
> > > DITA standard and I get the following error.
> > > 
> > > java.net.SocketException: Connection reset
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > java.net.SocketInputStream.read(SocketInputStream.java:168)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > >
> > org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java
> > :7
> > 7)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > >
> org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnecti
> > > on
> > > .j
> > > av
> > > a:1115)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$H
> > > tt
> > > pC
> > > on
> > >
> nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMe
> > > th
> > > od
> > > Ba
> > > se.java:1832)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMeth
> > > od
> > > Ba
> > > se
> > > .java:1590)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > >
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.
> > > ja
> > > va
> > > :995)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(
> > > Ht
> > > tp
> > > Me
> > > thodDirector.java:397)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Htt
> > > pM
> > > et
> > > ho
> > > dDirector.java:170)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.
> > > ja
> > > va
> > > :3
> > > 96)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.
> > > ja
> > > va
> > > :3
> > > 24)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> > > va:96)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > >
org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(Http
> > > Ba
> > > se
> > > .j
> > > ava:219)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > >
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> > > 2008-05-27 17:56:58 INFO  Fetcher              fetch of
> > > http://v4:10000/lib   failed with:
> > > java.net.SocketException: Connection reset
> > > 
> > > i googled and found no solution so far...
> > > 
> > > do i need to setup some config / host file to specify the ports?
> > > the URL is an internal website.
> > > 
> > > any response will be appreciated.
> > > 
> > > Thanks,
> > > Ann Del Rio
> > > Senior Developer
> > > eBay, Inc