You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/08/10 14:32:28 UTC

questions about solrwriter

Hello,

>From SolrWriter.java:

  public void write(NutchDocument doc) throws IOException {

    final SolrInputDocument inputDoc = new SolrInputDocument();

    for(final Entry<String, NutchField> e : doc) {
      for (final Object val : e.getValue().getValues()) {
    	
        // normalise the string representation for a Date
        Object val2 = val;

        if (val instanceof Date){
          val2 = DateUtil.getThreadLocalDateFormat().format(val);
        }

        if (e.getKey().equals("content")||e.getKey().equals("e_features")) {
        	if(val!=null) {
        		val2 = stripNonCharCodepoints((String)val);
        	}
        }

        inputDoc.addField(solrMapping.mapKey(e.getKey()), val2,
e.getValue().getWeight());
        String sCopy = solrMapping.mapCopyKey(e.getKey());
        if (sCopy != e.getKey()) {
        	inputDoc.addField(sCopy, val);	
        }

      }
    }
    inputDoc.setDocumentBoost(doc.getWeight());
    inputDocs.add(inputDoc);
    if (inputDocs.size() >= commitSize) {
      try {
        LOG.info("Adding " + Integer.toString(inputDocs.size()) + " documents");
        solr.add(inputDocs);
      } catch (final SolrServerException e) {
        throw makeIOException(e);
      }
      inputDocs.clear();
    }
  }


what is happening after inputDoc.addField.... ? I am getting exception
while indexing e_features, because of UTF8 encoding error. previously
we patched this problem because of content, and now i have another
field called e_features, and I wanted to stripNonCharCodepoints from
that s well, but I dont understand why we are doing the   if (sCopy !=
e.getKey()) { inputDoc.addField(sCopy, val);}


Best Regards,
C.B.

Re: fetcher runs without error with no internet connection

Posted by Markus Jelsma <ma...@openindex.io>.
DNS? DSL? 

A common practice to solve overloading a DNS-server is to host your own DNS-
server. Bind is a good choice. You can also try a local DNS-caching server.

> It is the DNS problem, because it was giving a lot of UnknownHost
> exception. I decreased thread number to 5, but still DSL fails
> periodically. I wondered what is the common internet connection for
> fetching about 3500 domains. I currently have DSL with 3 Mps.
> 
> Thanks.
> Alex.
> 
> 
> 
> -----Original Message-----
> From: Markus Jelsma <ma...@openindex.io>
> To: user <us...@nutch.apache.org>
> Sent: Mon, Aug 29, 2011 5:19 pm
> Subject: Re: fetcher runs without error with no internet connection
> 
> 
> I didn't say you have a DNS-problem only that these exception may occur if
> the DNS can't keep up with the requests you make. Make sure you have a DNS
> problem before trying to solve a problem that doesn't exist. It's normal
> to have these exceptions once in a while.
> 
> Solving DNS issues are beyond the scope of this list. You may, however, opt
> for some DNS caching in your network.
> 
> > What is the solution to the issue with DNS server?
> > 
> > 
> > 
> > 
> > 
> > -----Original Message-----
> > From: Markus Jelsma <ma...@openindex.io>
> > To: user <us...@nutch.apache.org>
> > Sent: Tue, Aug 23, 2011 12:32 pm
> > Subject: Re: fetcher runs without error with no internet connection
> > 
> > 
> > If you fetch too hard, your DNS-server may not be able to keep up.
> > 
> > > Hi Lewis,
> > > 
> > > I stopped fetcher and started it on the same segment again.
> > > But before doing that I turned off modem and fetcher started giving
> > > Unknown.Host exception. It was not giving any error, with dsl failure,
> > > i.e. I was not able to connect to any sites. Again this is nutch-1.2.
> > > 
> > > Thanks.
> > > Alex.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: lewis john mcgibbney <le...@gmail.com>
> > > To: user <us...@nutch.apache.org>
> > > Sent: Tue, Aug 23, 2011 6:37 am
> > > Subject: Re: fetcher runs without error with no internet connection
> > > 
> > > 
> > > Hi Alex,
> > > 
> > > Did you get anywhere with this?
> > > 
> > > What condition led to you seeing unknown host exception?
> > > 
> > > Unless segment gets corrupted, I would assume you could fetch again.
> > > Hopefully you can confirm this.
> > > 
> > > On Tue, Aug 16, 2011 at 9:23 PM, <al...@aim.com> wrote:
> > > > Hello,
> > > > 
> > > > After running bin/nutch fetch $segment for 2 days, internet
> > > > connection was lost, but nutch did not give any errors. Usually I
> > > > was seeing Unknown host exception before.
> > > > Any ideas what happened and is it OK to stop the fetch and run it
> > > > again on the same (old) segment? This is nutch -1.2
> > > > 
> > > > Thanks.
> > > > Alex.

Re: fetcher runs without error with no internet connection

Posted by al...@aim.com.
It is the DNS problem, because it was giving a lot of UnknownHost exception. I decreased thread number to 5, but still DSL fails periodically. 
I wondered what is the common internet connection for fetching about 3500 domains. I currently have DSL with 3 Mps.

Thanks.
Alex.

 

-----Original Message-----
From: Markus Jelsma <ma...@openindex.io>
To: user <us...@nutch.apache.org>
Sent: Mon, Aug 29, 2011 5:19 pm
Subject: Re: fetcher runs without error with no internet connection


I didn't say you have a DNS-problem only that these exception may occur if the 
DNS can't keep up with the requests you make. Make sure you have a DNS problem 
before trying to solve a problem that doesn't exist. It's normal to have these 
exceptions once in a while.

Solving DNS issues are beyond the scope of this list. You may, however, opt 
for some DNS caching in your network.

> What is the solution to the issue with DNS server?
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Markus Jelsma <ma...@openindex.io>
> To: user <us...@nutch.apache.org>
> Sent: Tue, Aug 23, 2011 12:32 pm
> Subject: Re: fetcher runs without error with no internet connection
> 
> 
> If you fetch too hard, your DNS-server may not be able to keep up.
> 
> > Hi Lewis,
> > 
> > I stopped fetcher and started it on the same segment again.
> > But before doing that I turned off modem and fetcher started giving
> > Unknown.Host exception. It was not giving any error, with dsl failure,
> > i.e. I was not able to connect to any sites. Again this is nutch-1.2.
> > 
> > Thanks.
> > Alex.
> > 
> > 
> > 
> > 
> > 
> > -----Original Message-----
> > From: lewis john mcgibbney <le...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Tue, Aug 23, 2011 6:37 am
> > Subject: Re: fetcher runs without error with no internet connection
> > 
> > 
> > Hi Alex,
> > 
> > Did you get anywhere with this?
> > 
> > What condition led to you seeing unknown host exception?
> > 
> > Unless segment gets corrupted, I would assume you could fetch again.
> > Hopefully you can confirm this.
> > 
> > On Tue, Aug 16, 2011 at 9:23 PM, <al...@aim.com> wrote:
> > > Hello,
> > > 
> > > After running bin/nutch fetch $segment for 2 days, internet connection
> > > was lost, but nutch did not give any errors. Usually I was seeing
> > > Unknown host exception before.
> > > Any ideas what happened and is it OK to stop the fetch and run it again
> > > on the same (old) segment? This is nutch -1.2
> > > 
> > > Thanks.
> > > Alex.

 

Re: fetcher runs without error with no internet connection

Posted by Markus Jelsma <ma...@openindex.io>.
I didn't say you have a DNS-problem only that these exception may occur if the 
DNS can't keep up with the requests you make. Make sure you have a DNS problem 
before trying to solve a problem that doesn't exist. It's normal to have these 
exceptions once in a while.

Solving DNS issues are beyond the scope of this list. You may, however, opt 
for some DNS caching in your network.

> What is the solution to the issue with DNS server?
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Markus Jelsma <ma...@openindex.io>
> To: user <us...@nutch.apache.org>
> Sent: Tue, Aug 23, 2011 12:32 pm
> Subject: Re: fetcher runs without error with no internet connection
> 
> 
> If you fetch too hard, your DNS-server may not be able to keep up.
> 
> > Hi Lewis,
> > 
> > I stopped fetcher and started it on the same segment again.
> > But before doing that I turned off modem and fetcher started giving
> > Unknown.Host exception. It was not giving any error, with dsl failure,
> > i.e. I was not able to connect to any sites. Again this is nutch-1.2.
> > 
> > Thanks.
> > Alex.
> > 
> > 
> > 
> > 
> > 
> > -----Original Message-----
> > From: lewis john mcgibbney <le...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Tue, Aug 23, 2011 6:37 am
> > Subject: Re: fetcher runs without error with no internet connection
> > 
> > 
> > Hi Alex,
> > 
> > Did you get anywhere with this?
> > 
> > What condition led to you seeing unknown host exception?
> > 
> > Unless segment gets corrupted, I would assume you could fetch again.
> > Hopefully you can confirm this.
> > 
> > On Tue, Aug 16, 2011 at 9:23 PM, <al...@aim.com> wrote:
> > > Hello,
> > > 
> > > After running bin/nutch fetch $segment for 2 days, internet connection
> > > was lost, but nutch did not give any errors. Usually I was seeing
> > > Unknown host exception before.
> > > Any ideas what happened and is it OK to stop the fetch and run it again
> > > on the same (old) segment? This is nutch -1.2
> > > 
> > > Thanks.
> > > Alex.

Re: fetcher runs without error with no internet connection

Posted by al...@aim.com.
What is the solution to the issue with DNS server?

 

 

-----Original Message-----
From: Markus Jelsma <ma...@openindex.io>
To: user <us...@nutch.apache.org>
Sent: Tue, Aug 23, 2011 12:32 pm
Subject: Re: fetcher runs without error with no internet connection


If you fetch too hard, your DNS-server may not be able to keep up.

> Hi Lewis,
> 
> I stopped fetcher and started it on the same segment again.
> But before doing that I turned off modem and fetcher started giving
> Unknown.Host exception. It was not giving any error, with dsl failure,
> i.e. I was not able to connect to any sites. Again this is nutch-1.2.
> 
> Thanks.
> Alex.
> 
> 
> 
> 
> 
> -----Original Message-----
> From: lewis john mcgibbney <le...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Tue, Aug 23, 2011 6:37 am
> Subject: Re: fetcher runs without error with no internet connection
> 
> 
> Hi Alex,
> 
> Did you get anywhere with this?
> 
> What condition led to you seeing unknown host exception?
> 
> Unless segment gets corrupted, I would assume you could fetch again.
> Hopefully you can confirm this.
> 
> On Tue, Aug 16, 2011 at 9:23 PM, <al...@aim.com> wrote:
> > Hello,
> > 
> > After running bin/nutch fetch $segment for 2 days, internet connection
> > was lost, but nutch did not give any errors. Usually I was seeing
> > Unknown host exception before.
> > Any ideas what happened and is it OK to stop the fetch and run it again
> > on the same (old) segment? This is nutch -1.2
> > 
> > Thanks.
> > Alex.

 

Re: fetcher runs without error with no internet connection

Posted by Markus Jelsma <ma...@openindex.io>.
If you fetch too hard, your DNS-server may not be able to keep up.

> Hi Lewis,
> 
> I stopped fetcher and started it on the same segment again.
> But before doing that I turned off modem and fetcher started giving
> Unknown.Host exception. It was not giving any error, with dsl failure,
> i.e. I was not able to connect to any sites. Again this is nutch-1.2.
> 
> Thanks.
> Alex.
> 
> 
> 
> 
> 
> -----Original Message-----
> From: lewis john mcgibbney <le...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Tue, Aug 23, 2011 6:37 am
> Subject: Re: fetcher runs without error with no internet connection
> 
> 
> Hi Alex,
> 
> Did you get anywhere with this?
> 
> What condition led to you seeing unknown host exception?
> 
> Unless segment gets corrupted, I would assume you could fetch again.
> Hopefully you can confirm this.
> 
> On Tue, Aug 16, 2011 at 9:23 PM, <al...@aim.com> wrote:
> > Hello,
> > 
> > After running bin/nutch fetch $segment for 2 days, internet connection
> > was lost, but nutch did not give any errors. Usually I was seeing
> > Unknown host exception before.
> > Any ideas what happened and is it OK to stop the fetch and run it again
> > on the same (old) segment? This is nutch -1.2
> > 
> > Thanks.
> > Alex.

Re: fetcher runs without error with no internet connection

Posted by al...@aim.com.
Hi Lewis,

I stopped fetcher and started it on the same segment again. 
But before doing that I turned off modem and fetcher started giving Unknown.Host exception.
It was not giving any error, with dsl failure, i.e. I was not able to connect to any sites. Again this is nutch-1.2.

Thanks.
Alex.

 

 

-----Original Message-----
From: lewis john mcgibbney <le...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Tue, Aug 23, 2011 6:37 am
Subject: Re: fetcher runs without error with no internet connection


Hi Alex,

Did you get anywhere with this?

What condition led to you seeing unknown host exception?

Unless segment gets corrupted, I would assume you could fetch again.
Hopefully you can confirm this.

On Tue, Aug 16, 2011 at 9:23 PM, <al...@aim.com> wrote:

> Hello,
>
> After running bin/nutch fetch $segment for 2 days, internet connection was
> lost, but nutch did not give any errors. Usually I was seeing Unknown host
> exception before.
> Any ideas what happened and is it OK to stop the fetch and run it again on
> the same (old) segment? This is nutch -1.2
>
> Thanks.
> Alex.
>



-- 
*Lewis*

 

Re: fetcher runs without error with no internet connection

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Alex,

Did you get anywhere with this?

What condition led to you seeing unknown host exception?

Unless segment gets corrupted, I would assume you could fetch again.
Hopefully you can confirm this.

On Tue, Aug 16, 2011 at 9:23 PM, <al...@aim.com> wrote:

> Hello,
>
> After running bin/nutch fetch $segment for 2 days, internet connection was
> lost, but nutch did not give any errors. Usually I was seeing Unknown host
> exception before.
> Any ideas what happened and is it OK to stop the fetch and run it again on
> the same (old) segment? This is nutch -1.2
>
> Thanks.
> Alex.
>



-- 
*Lewis*

fetcher runs without error with no internet connection

Posted by al...@aim.com.
Hello,

After running bin/nutch fetch $segment for 2 days, internet connection was lost, but nutch did not give any errors. Usually I was seeing Unknown host exception before. 
Any ideas what happened and is it OK to stop the fetch and run it again on the same (old) segment? This is nutch -1.2

Thanks.
Alex.

Re: questions about solrwriter

Posted by Cam Bazz <ca...@gmail.com>.
thank you that is what I have done.

On Wed, Aug 10, 2011 at 4:06 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Hmmm, maybe we should just strip the codepoints on all fields. We're already
> doing it on content which is by far the largest field, all other fields are
> tiny compared to this one. If we do it on all String fields then this would
> also fix unknown fields added by custom plugins.
>
> The part your refer to is for the solr field mapping code. Strip codepoints
> before the mapping code or you'll end up with one stripped and one not if you
> use copyFields in here.
>
>
> On Wednesday 10 August 2011 14:32:28 Cam Bazz wrote:
>> Hello,
>>
>> From SolrWriter.java:
>>
>>   public void write(NutchDocument doc) throws IOException {
>>
>>     final SolrInputDocument inputDoc = new SolrInputDocument();
>>
>>     for(final Entry<String, NutchField> e : doc) {
>>       for (final Object val : e.getValue().getValues()) {
>>
>>         // normalise the string representation for a Date
>>         Object val2 = val;
>>
>>         if (val instanceof Date){
>>           val2 = DateUtil.getThreadLocalDateFormat().format(val);
>>         }
>>
>>         if (e.getKey().equals("content")||e.getKey().equals("e_features"))
>> { if(val!=null) {
>>                       val2 = stripNonCharCodepoints((String)val);
>>               }
>>         }
>>
>>         inputDoc.addField(solrMapping.mapKey(e.getKey()), val2,
>> e.getValue().getWeight());
>>         String sCopy = solrMapping.mapCopyKey(e.getKey());
>>         if (sCopy != e.getKey()) {
>>               inputDoc.addField(sCopy, val);
>>         }
>>
>>       }
>>     }
>>     inputDoc.setDocumentBoost(doc.getWeight());
>>     inputDocs.add(inputDoc);
>>     if (inputDocs.size() >= commitSize) {
>>       try {
>>         LOG.info("Adding " + Integer.toString(inputDocs.size()) + "
>> documents"); solr.add(inputDocs);
>>       } catch (final SolrServerException e) {
>>         throw makeIOException(e);
>>       }
>>       inputDocs.clear();
>>     }
>>   }
>>
>>
>> what is happening after inputDoc.addField.... ? I am getting exception
>> while indexing e_features, because of UTF8 encoding error. previously
>> we patched this problem because of content, and now i have another
>> field called e_features, and I wanted to stripNonCharCodepoints from
>> that s well, but I dont understand why we are doing the   if (sCopy !=
>> e.getKey()) { inputDoc.addField(sCopy, val);}
>>
>>
>> Best Regards,
>> C.B.
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: questions about solrwriter

Posted by Markus Jelsma <ma...@openindex.io>.
Hmmm, maybe we should just strip the codepoints on all fields. We're already 
doing it on content which is by far the largest field, all other fields are 
tiny compared to this one. If we do it on all String fields then this would 
also fix unknown fields added by custom plugins.

The part your refer to is for the solr field mapping code. Strip codepoints 
before the mapping code or you'll end up with one stripped and one not if you 
use copyFields in here.


On Wednesday 10 August 2011 14:32:28 Cam Bazz wrote:
> Hello,
> 
> From SolrWriter.java:
> 
>   public void write(NutchDocument doc) throws IOException {
> 
>     final SolrInputDocument inputDoc = new SolrInputDocument();
> 
>     for(final Entry<String, NutchField> e : doc) {
>       for (final Object val : e.getValue().getValues()) {
> 
>         // normalise the string representation for a Date
>         Object val2 = val;
> 
>         if (val instanceof Date){
>           val2 = DateUtil.getThreadLocalDateFormat().format(val);
>         }
> 
>         if (e.getKey().equals("content")||e.getKey().equals("e_features"))
> { if(val!=null) {
>         		val2 = stripNonCharCodepoints((String)val);
>         	}
>         }
> 
>         inputDoc.addField(solrMapping.mapKey(e.getKey()), val2,
> e.getValue().getWeight());
>         String sCopy = solrMapping.mapCopyKey(e.getKey());
>         if (sCopy != e.getKey()) {
>         	inputDoc.addField(sCopy, val);
>         }
> 
>       }
>     }
>     inputDoc.setDocumentBoost(doc.getWeight());
>     inputDocs.add(inputDoc);
>     if (inputDocs.size() >= commitSize) {
>       try {
>         LOG.info("Adding " + Integer.toString(inputDocs.size()) + "
> documents"); solr.add(inputDocs);
>       } catch (final SolrServerException e) {
>         throw makeIOException(e);
>       }
>       inputDocs.clear();
>     }
>   }
> 
> 
> what is happening after inputDoc.addField.... ? I am getting exception
> while indexing e_features, because of UTF8 encoding error. previously
> we patched this problem because of content, and now i have another
> field called e_features, and I wanted to stripNonCharCodepoints from
> that s well, but I dont understand why we are doing the   if (sCopy !=
> e.getKey()) { inputDoc.addField(sCopy, val);}
> 
> 
> Best Regards,
> C.B.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350