You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2017/10/17 23:38:09 UTC
addBinaryContent and string length must be a multiple of four
I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186
I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I get a bunch of failures with messages that say "String length must be a multiple of four." The same is true if I omit the base64 argument.
Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr 5.4.1.
Re: addBinaryContent and string length must be a multiple of four
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Michael,
I tried to reproduce the problem with the current Nutch master and Solr 6.6.0
without success, resp. indexing the binary content succeeded:
- that's the case for two of the URLs you sent
- those from buzz.money.cnn.com are blocked somehow (fetching failed)
Building Nutch isn't difficult:
git clone http://github.com/apache/nutch.git
cd nutch
ant
You'll find the Nutch runtime is in runtime/local/ or runtime/deploy/ (for usage on Hadoop).
The tutorial
https://wiki.apache.org/nutch/NutchTutorial
should be already up-to-date on how to use recent
Solr versions.
Best,
Sebastian
{
"responseHeader":{
"status":0,
"QTime":2,
"params":{
"q":"id:http\\://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html",
"indent":"on",
"wt":"json",
"_":"1508829081797"}},
"response":{"numFound":1,"start":0,"docs":[
{
"date":"2017-10-24T07:01:05.593Z",
"author":"Matt Egan",
"title":"Trump adviser Carl Icahn bets against the Trump rally - Mar. 7, 2017",
"type":["application/xhtml+xml",
"application",
"xhtml+xml"],
"url":"http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html",
"content":"Trump adviser Carl Icahn bets against the Trump rally - Mar. 7, ...",
"tstamp":"2017-10-24T07:01:05.593Z",
"segment":"20171024090054",
"digest":"cff265f11bd74bd104f3c6e1c7185484",
"boost":1.0,
"id":"http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html",
"_version_":1582121409782480896,
"binaryContent":"+IDxzY3JpcHQgdHlwZT0idGV4dC9qYXZhc2NyaXB0Ij4gdmFyIHVybFByZT0iaHR0cDovL21hcmtld...""}]
}}
On 10/24/2017 01:07 AM, Michael Coffey wrote:
> http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html
>
>
> http://buzz.money.cnn.com/author/ctymkiw/
>
> http://abcnews.go.com/GMA/video/rose-mcgowan-dropped-agent-calling-sexist-casting-note-32047448
>
> http://buzz.money.cnn.com/tag/investing/
>
> Meanwhile, the following URL also gets an "error adding field" message but with "msg=Illegal character" instead of "String length must be a multiple of four". Don't know if it's related.
>
> http://buzz.money.cnn.com/author/byheatherlong/
Re: addBinaryContent and string length must be a multiple of four
Posted by Michael Coffey <mc...@yahoo.com.INVALID>.
Thanks for the reply!
I'm not sure the best way to illustrate the issue, as I struggle with solr log management within docker. However, here are a few URLs that have exhibited the problem. In each case, Solr complains "Error adding field 'binaryContent'" ... "msg=String length must be a multiple of four"
http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html
http://buzz.money.cnn.com/author/ctymkiw/
http://abcnews.go.com/GMA/video/rose-mcgowan-dropped-agent-calling-sexist-casting-note-32047448
http://buzz.money.cnn.com/tag/investing/
Meanwhile, the following URL also gets an "error adding field" message but with "msg=Illegal character" instead of "String length must be a multiple of four". Don't know if it's related.
http://buzz.money.cnn.com/author/byheatherlong/
All tests done with Nutch 1.12, Solr 5.4.1.
BTW, I wouldn't mind updating Nutch and Solr. What is your recommended most-stable combination of versions? I am using Hadoop 2.7.3 (from Hortonworks).
At one point, Lewis John McG reported on such an issue in https://issues.apache.org/jira/browse/NUTCH-2186
Re: addBinaryContent and string length must be a multiple of four
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Michael,
can you share more information regarding Nutch and Solr version and at least one document
to make the problem reproducible. Looks like that's not a general problem - at least,
I'm not able to reproduce it, indexing with -addBinaryContent -base64 succeeds (recent
Nutch snapshot / master, Solr 6.6.0).
Thanks,
Sebastian
On 10/20/2017 06:46 PM, Michael Coffey wrote:
> I guess there is no solution or workaround for the addBinaryContent bug, so I have to write code to read directly from segment data. If not writing Java, I guess I have to do readseg-dump and then parse the output text file.
>
>
> -- original message --
> I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186
>
> I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I get a bunch of failures with messages that say "String length must be a multiple of four." The same is true if I omit the base64 argument.
>
> Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr 5.4.1.
>
Re: addBinaryContent and string length must be a multiple of four
Posted by Michael Coffey <mc...@yahoo.com.INVALID>.
I guess there is no solution or workaround for the addBinaryContent bug, so I have to write code to read directly from segment data. If not writing Java, I guess I have to do readseg-dump and then parse the output text file.
-- original message --
I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186
I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I get a bunch of failures with messages that say "String length must be a multiple of four." The same is true if I omit the base64 argument.
Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr 5.4.1.