You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by KRIS MUSSHORN <mu...@comcast.net> on 2016/09/30 13:25:17 UTC

control order of operations

I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ). 

I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. 

Can I set nutch so that whitelist applies before truncation? 

Kris 

RE: control order of operations

Posted by Kris Musshorn <mu...@comcast.net>.
Blackice suggested a plugin or build a custom plug in.
That’s not going to work for me.
Any other solutions?

Why does the whitelist html not work?

Kris

-----Original Message-----
From: KRIS MUSSHORN [mailto:musshorns@comcast.net] 
Sent: Friday, September 30, 2016 1:36 PM
To: user@nutch.apache.org
Subject: Re: control order of operations

Ok basic knowledge deficit. 

Looks like parser.html.whitelist settings will not prevent sections of the html from being indexed into solr. 

How can i accomplish my goal of preventing header, footer and a few divs from being indexed into the content field of solr? 

Kris 


----- Original Message -----

From: "KRIS MUSSHORN" <mu...@comcast.net> 
To: user@nutch.apache.org 
Sent: Friday, September 30, 2016 11:54:40 AM 
Subject: Re: control order of operations 

would a better option be to use this property? 

indexer.max.content.length = 32765 

----- Original Message ----- 

From: "KRIS MUSSHORN" <mu...@comcast.net> 
To: user@nutch.apache.org 
Sent: Friday, September 30, 2016 9:25:17 AM 
Subject: control order of operations 

I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ). 

I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. 

Can I set nutch so that whitelist applies before truncation? 

Kris 




RE: control order of operations

Posted by Kris Musshorn <mu...@comcast.net>.
Any other options for this issue?

-----Original Message-----
From: BlackIce [mailto:blackice2k4@gmail.com] 
Sent: Saturday, October 1, 2016 2:11 AM
To: user@nutch.apache.org
Subject: RE: control order of operations

Then make your own :)

On Sep 30, 2016 11:13 PM, "Kris Musshorn" <mu...@comcast.net> wrote:

> Thanks blackice but I cant use a plug in that’s not been maintained in 
> a year in my production environment
>
> -----Original Message-----
> From: BlackIce [mailto:blackice2k4@gmail.com]
> Sent: Friday, September 30, 2016 2:42 PM
> To: user@nutch.apache.org
> Subject: Re: control order of operations
>
> Try these, don't remember which I used and don't have access to my 
> setup right now (there used to be a whitelist/blacklist plugin, but I 
> don't seem to be able to find it on Google right now)
>
> https://github.com/BayanGroup/nutch-custom-search
>
> On Sep 30, 2016 7:35 PM, "KRIS MUSSHORN" <mu...@comcast.net> wrote:
>
> Ok basic knowledge deficit.
>
> Looks like parser.html.whitelist settings will not prevent sections of 
> the html from being indexed into solr.
>
> How can i accomplish my goal of preventing header, footer and a few 
> divs from being indexed into the content field of solr?
>
> Kris
>
>
> ----- Original Message -----
>
> From: "KRIS MUSSHORN" <mu...@comcast.net>
> To: user@nutch.apache.org
> Sent: Friday, September 30, 2016 11:54:40 AM
> Subject: Re: control order of operations
>
> would a better option be to use this property?
>
> indexer.max.content.length = 32765
>
> ----- Original Message -----
>
> From: "KRIS MUSSHORN" <mu...@comcast.net>
> To: user@nutch.apache.org
> Sent: Friday, September 30, 2016 9:25:17 AM
> Subject: control order of operations
>
> I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of 
> solr max ).
>
> I also have parser.html.whitelist set to ignore a bunch of irrelevant tags.
>
> Can I set nutch so that whitelist applies before truncation?
>
> Kris
>
>


Re: control order of operations

Posted by Comcast <mu...@comcast.net>.
Someday

Sent from my iPhone

> On Oct 1, 2016, at 2:11 AM, BlackIce <bl...@gmail.com> wrote:
> 
> Then make your own :)
> 
>> On Sep 30, 2016 11:13 PM, "Kris Musshorn" <mu...@comcast.net> wrote:
>> 
>> Thanks blackice but I cant use a plug in that’s not been maintained in a
>> year in my production environment
>> 
>> -----Original Message-----
>> From: BlackIce [mailto:blackice2k4@gmail.com]
>> Sent: Friday, September 30, 2016 2:42 PM
>> To: user@nutch.apache.org
>> Subject: Re: control order of operations
>> 
>> Try these, don't remember which I used and don't have access to my setup
>> right now (there used to be a whitelist/blacklist plugin, but I don't seem
>> to be able to find it on Google right now)
>> 
>> https://github.com/BayanGroup/nutch-custom-search
>> 
>> On Sep 30, 2016 7:35 PM, "KRIS MUSSHORN" <mu...@comcast.net> wrote:
>> 
>> Ok basic knowledge deficit.
>> 
>> Looks like parser.html.whitelist settings will not prevent sections of the
>> html from being indexed into solr.
>> 
>> How can i accomplish my goal of preventing header, footer and a few divs
>> from being indexed into the content field of solr?
>> 
>> Kris
>> 
>> 
>> ----- Original Message -----
>> 
>> From: "KRIS MUSSHORN" <mu...@comcast.net>
>> To: user@nutch.apache.org
>> Sent: Friday, September 30, 2016 11:54:40 AM
>> Subject: Re: control order of operations
>> 
>> would a better option be to use this property?
>> 
>> indexer.max.content.length = 32765
>> 
>> ----- Original Message -----
>> 
>> From: "KRIS MUSSHORN" <mu...@comcast.net>
>> To: user@nutch.apache.org
>> Sent: Friday, September 30, 2016 9:25:17 AM
>> Subject: control order of operations
>> 
>> I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of
>> solr max ).
>> 
>> I also have parser.html.whitelist set to ignore a bunch of irrelevant tags.
>> 
>> Can I set nutch so that whitelist applies before truncation?
>> 
>> Kris
>> 
>> 


RE: control order of operations

Posted by BlackIce <bl...@gmail.com>.
Then make your own :)

On Sep 30, 2016 11:13 PM, "Kris Musshorn" <mu...@comcast.net> wrote:

> Thanks blackice but I cant use a plug in that’s not been maintained in a
> year in my production environment
>
> -----Original Message-----
> From: BlackIce [mailto:blackice2k4@gmail.com]
> Sent: Friday, September 30, 2016 2:42 PM
> To: user@nutch.apache.org
> Subject: Re: control order of operations
>
> Try these, don't remember which I used and don't have access to my setup
> right now (there used to be a whitelist/blacklist plugin, but I don't seem
> to be able to find it on Google right now)
>
> https://github.com/BayanGroup/nutch-custom-search
>
> On Sep 30, 2016 7:35 PM, "KRIS MUSSHORN" <mu...@comcast.net> wrote:
>
> Ok basic knowledge deficit.
>
> Looks like parser.html.whitelist settings will not prevent sections of the
> html from being indexed into solr.
>
> How can i accomplish my goal of preventing header, footer and a few divs
> from being indexed into the content field of solr?
>
> Kris
>
>
> ----- Original Message -----
>
> From: "KRIS MUSSHORN" <mu...@comcast.net>
> To: user@nutch.apache.org
> Sent: Friday, September 30, 2016 11:54:40 AM
> Subject: Re: control order of operations
>
> would a better option be to use this property?
>
> indexer.max.content.length = 32765
>
> ----- Original Message -----
>
> From: "KRIS MUSSHORN" <mu...@comcast.net>
> To: user@nutch.apache.org
> Sent: Friday, September 30, 2016 9:25:17 AM
> Subject: control order of operations
>
> I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of
> solr max ).
>
> I also have parser.html.whitelist set to ignore a bunch of irrelevant tags.
>
> Can I set nutch so that whitelist applies before truncation?
>
> Kris
>
>

RE: control order of operations

Posted by Kris Musshorn <mu...@comcast.net>.
Thanks blackice but I cant use a plug in that’s not been maintained in a year in my production environment

-----Original Message-----
From: BlackIce [mailto:blackice2k4@gmail.com] 
Sent: Friday, September 30, 2016 2:42 PM
To: user@nutch.apache.org
Subject: Re: control order of operations

Try these, don't remember which I used and don't have access to my setup right now (there used to be a whitelist/blacklist plugin, but I don't seem to be able to find it on Google right now)

https://github.com/BayanGroup/nutch-custom-search

On Sep 30, 2016 7:35 PM, "KRIS MUSSHORN" <mu...@comcast.net> wrote:

Ok basic knowledge deficit.

Looks like parser.html.whitelist settings will not prevent sections of the html from being indexed into solr.

How can i accomplish my goal of preventing header, footer and a few divs from being indexed into the content field of solr?

Kris


----- Original Message -----

From: "KRIS MUSSHORN" <mu...@comcast.net>
To: user@nutch.apache.org
Sent: Friday, September 30, 2016 11:54:40 AM
Subject: Re: control order of operations

would a better option be to use this property?

indexer.max.content.length = 32765

----- Original Message -----

From: "KRIS MUSSHORN" <mu...@comcast.net>
To: user@nutch.apache.org
Sent: Friday, September 30, 2016 9:25:17 AM
Subject: control order of operations

I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ).

I also have parser.html.whitelist set to ignore a bunch of irrelevant tags.

Can I set nutch so that whitelist applies before truncation?

Kris


Re: control order of operations

Posted by BlackIce <bl...@gmail.com>.
Try these, don't remember which I used and don't have access to my setup
right now (there used to be a whitelist/blacklist plugin, but I don't seem
to be able to find it on Google right now)

https://github.com/BayanGroup/nutch-custom-search

On Sep 30, 2016 7:35 PM, "KRIS MUSSHORN" <mu...@comcast.net> wrote:

Ok basic knowledge deficit.

Looks like parser.html.whitelist settings will not prevent sections of the
html from being indexed into solr.

How can i accomplish my goal of preventing header, footer and a few divs
from being indexed into the content field of solr?

Kris


----- Original Message -----

From: "KRIS MUSSHORN" <mu...@comcast.net>
To: user@nutch.apache.org
Sent: Friday, September 30, 2016 11:54:40 AM
Subject: Re: control order of operations

would a better option be to use this property?

indexer.max.content.length = 32765

----- Original Message -----

From: "KRIS MUSSHORN" <mu...@comcast.net>
To: user@nutch.apache.org
Sent: Friday, September 30, 2016 9:25:17 AM
Subject: control order of operations

I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr
max ).

I also have parser.html.whitelist set to ignore a bunch of irrelevant tags.

Can I set nutch so that whitelist applies before truncation?

Kris

Re: control order of operations

Posted by KRIS MUSSHORN <mu...@comcast.net>.
Ok basic knowledge deficit. 

Looks like parser.html.whitelist settings will not prevent sections of the html from being indexed into solr. 

How can i accomplish my goal of preventing header, footer and a few divs from being indexed into the content field of solr? 

Kris 


----- Original Message -----

From: "KRIS MUSSHORN" <mu...@comcast.net> 
To: user@nutch.apache.org 
Sent: Friday, September 30, 2016 11:54:40 AM 
Subject: Re: control order of operations 

would a better option be to use this property? 

indexer.max.content.length = 32765 

----- Original Message ----- 

From: "KRIS MUSSHORN" <mu...@comcast.net> 
To: user@nutch.apache.org 
Sent: Friday, September 30, 2016 9:25:17 AM 
Subject: control order of operations 

I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ). 

I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. 

Can I set nutch so that whitelist applies before truncation? 

Kris 



Re: control order of operations

Posted by KRIS MUSSHORN <mu...@comcast.net>.
thanks Markus 
----- Original Message -----

From: "Markus Jelsma" <ma...@openindex.io> 
To: user@nutch.apache.org 
Sent: Tuesday, October 4, 2016 7:01:57 AM 
Subject: RE: control order of operations 

Hello - this is not Solr's maximum for a field at all. But it is Java's maximum for String. Just don't use string when indexing. 
Markus 

-----Original message----- 
> From:KRIS MUSSHORN <mu...@comcast.net> 
> Sent: Friday 30th September 2016 17:54 
> To: user@nutch.apache.org 
> Subject: Re: control order of operations 
> 
> would a better option be to use this property? 
> 
> indexer.max.content.length = 32765 
> 
> ----- Original Message ----- 
> 
> From: "KRIS MUSSHORN" <mu...@comcast.net> 
> To: user@nutch.apache.org 
> Sent: Friday, September 30, 2016 9:25:17 AM 
> Subject: control order of operations 
> 
> I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ). 
> 
> I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. 
> 
> Can I set nutch so that whitelist applies before truncation? 
> 
> Kris 
> 
> 


RE: control order of operations

Posted by Markus Jelsma <ma...@openindex.io>.
Hello - this is not Solr's  maximum for a field at all. But it is Java's maximum for String. Just don't use string when indexing.
Markus 
 
-----Original message-----
> From:KRIS MUSSHORN <mu...@comcast.net>
> Sent: Friday 30th September 2016 17:54
> To: user@nutch.apache.org
> Subject: Re: control order of operations
> 
> would a better option be to use this property? 
> 
> indexer.max.content.length = 32765 
> 
> ----- Original Message -----
> 
> From: "KRIS MUSSHORN" <mu...@comcast.net> 
> To: user@nutch.apache.org 
> Sent: Friday, September 30, 2016 9:25:17 AM 
> Subject: control order of operations 
> 
> I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ). 
> 
> I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. 
> 
> Can I set nutch so that whitelist applies before truncation? 
> 
> Kris 
> 
> 

Re: control order of operations

Posted by KRIS MUSSHORN <mu...@comcast.net>.
would a better option be to use this property? 

indexer.max.content.length = 32765 

----- Original Message -----

From: "KRIS MUSSHORN" <mu...@comcast.net> 
To: user@nutch.apache.org 
Sent: Friday, September 30, 2016 9:25:17 AM 
Subject: control order of operations 

I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ). 

I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. 

Can I set nutch so that whitelist applies before truncation? 

Kris