You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2017/10/17 23:38:09 UTC

addBinaryContent and string length must be a multiple of four

I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186
I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I get a bunch of failures with messages that say "String length must be a multiple of four." The same is true if I omit the base64 argument.
Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr 5.4.1.

Re: addBinaryContent and string length must be a multiple of four

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Michael,

I tried to reproduce the problem with the current Nutch master and Solr 6.6.0
without success, resp. indexing the binary content succeeded:
- that's the case for two of the URLs you sent
- those from buzz.money.cnn.com are blocked somehow (fetching failed)

Building Nutch isn't difficult:
 git clone http://github.com/apache/nutch.git
 cd nutch
 ant
You'll find the Nutch runtime is in runtime/local/ or runtime/deploy/ (for usage on Hadoop).

The tutorial
  https://wiki.apache.org/nutch/NutchTutorial
should be already up-to-date on how to use recent
Solr versions.


Best,
Sebastian



{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{

"q":"id:http\\://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html",
      "indent":"on",
      "wt":"json",
      "_":"1508829081797"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "date":"2017-10-24T07:01:05.593Z",
        "author":"Matt Egan",
        "title":"Trump adviser Carl Icahn bets against the Trump rally - Mar. 7, 2017",
        "type":["application/xhtml+xml",
          "application",
          "xhtml+xml"],

"url":"http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html",
        "content":"Trump adviser Carl Icahn bets against the Trump rally - Mar. 7, ...",
        "tstamp":"2017-10-24T07:01:05.593Z",
        "segment":"20171024090054",
        "digest":"cff265f11bd74bd104f3c6e1c7185484",
        "boost":1.0,

"id":"http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html",
        "_version_":1582121409782480896,

"binaryContent":"+IDxzY3JpcHQgdHlwZT0idGV4dC9qYXZhc2NyaXB0Ij4gdmFyIHVybFByZT0iaHR0cDovL21hcmtld...""}]
  }}


On 10/24/2017 01:07 AM, Michael Coffey wrote:
> http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html
> 
> 
> http://buzz.money.cnn.com/author/ctymkiw/
> 
> http://abcnews.go.com/GMA/video/rose-mcgowan-dropped-agent-calling-sexist-casting-note-32047448
> 
> http://buzz.money.cnn.com/tag/investing/
> 
> Meanwhile, the following URL also gets an "error adding field" message but with "msg=Illegal character" instead of "String length must be a multiple of four". Don't know if it's related.
> 
> http://buzz.money.cnn.com/author/byheatherlong/

Re: addBinaryContent and string length must be a multiple of four

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.

Thanks for the reply!

I'm not sure the best way to illustrate the issue, as I struggle with solr log management within docker. However, here are a few URLs that have exhibited the problem. In each case, Solr complains "Error adding field 'binaryContent'" ... "msg=String length must be a multiple of four"


http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html


http://buzz.money.cnn.com/author/ctymkiw/

http://abcnews.go.com/GMA/video/rose-mcgowan-dropped-agent-calling-sexist-casting-note-32047448

http://buzz.money.cnn.com/tag/investing/

Meanwhile, the following URL also gets an "error adding field" message but with "msg=Illegal character" instead of "String length must be a multiple of four". Don't know if it's related.

http://buzz.money.cnn.com/author/byheatherlong/


All tests done with Nutch 1.12, Solr 5.4.1.

BTW, I wouldn't mind updating Nutch and Solr. What is your recommended most-stable combination of versions? I am using Hadoop 2.7.3 (from Hortonworks).


At one point, Lewis John McG reported on such an issue in https://issues.apache.org/jira/browse/NUTCH-2186

Re: addBinaryContent and string length must be a multiple of four

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Michael,

can you share more information regarding Nutch and Solr version and at least one document
to make the problem reproducible. Looks like that's not a general problem - at least,
I'm not able to reproduce it, indexing with -addBinaryContent -base64 succeeds (recent
Nutch snapshot / master, Solr 6.6.0).

Thanks,
Sebastian

On 10/20/2017 06:46 PM, Michael Coffey wrote:
> I guess there is no solution or workaround for the addBinaryContent bug, so I have to write code to read directly from segment data. If not writing Java, I guess I have to do readseg-dump and then parse the output text file.
> 
> 
> -- original message --
> I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186
> 
> I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I get a bunch of failures with messages that say "String length must be a multiple of four." The same is true if I omit the base64 argument.
> 
> Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr 5.4.1.
>

Re: addBinaryContent and string length must be a multiple of four

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.

I guess there is no solution or workaround for the addBinaryContent bug, so I have to write code to read directly from segment data. If not writing Java, I guess I have to do readseg-dump and then parse the output text file.


-- original message --
I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186

I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I get a bunch of failures with messages that say "String length must be a multiple of four." The same is true if I omit the base64 argument.

Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr 5.4.1.