You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil> on 2016/09/09 13:41:49 UTC

RE: [Non-DoD Source] Re: indexing metatags with Nutch 1.12 (UNCLASSIFIED)

CLASSIFICATION: UNCLASSIFIED

Are you suggesting I should remove the index.metadata property completely or just supply no value?

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~

-----Original Message-----
From: BlackIce [mailto:blackice2k4@gmail.com] 
Sent: Friday, September 09, 2016 9:31 AM
To: user@nutch.apache.org
Subject: [Non-DoD Source] Re: indexing metatags with Nutch 1.12

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

I had a similar problem, took me days to figure it out, I can't remember what exactly was going on, but it was some sort of conflict between parameters in site.xml. I think you need to leave this BLANK:

<property>
		<name>
			index.metadata
		</name>
		<value>
			description,keywords
		</value>
	</property>


My Set-up (Nutch 1.11):

Nutch-stie.xml:

<property>
  <name>plugin.includes</name>
  <value>nutch-extensionpoints|headings|language-identifier|
protocol-http|urlfilter-regex|parse-(html|tika|metatags)|
index-(basic|anchor|more|metadata)|indexer-solr|scoring-opic|urlnormalizer-(
pass|regex|basic)</value>

</property>

<!-- index-metadata plugin properties -->

<property>
  <name>index.parse.md</name>
  <value>metatag.description,metatag.keywords,h1,h2,h3,h4,
h5,h6,metatag.title</value>

</property>



<!-- parse-metatags plugin properties --> <property>
  <name>metatags.names</name>
  <value>description,keywords,title,h1,h2,h3,h4,h5,h6</value>

</property>

On Fri, Sep 9, 2016 at 3:00 PM, BlackIce <bl...@gmail.com> wrote:

> I had a similar problem once.. it was some stupid synrtax thing, lemme 
> check my setup....
>
> On Fri, Sep 9, 2016 at 2:46 PM, KRIS MUSSHORN <mu...@comcast.net>
> wrote:
>
>> Looks like this is NOT in fact working.
>>
>> How do I get the metatags into Solr?
>>
>> i have a webpage @ 
>> Caution-https://snip/inside/directorates/cisd/asset.cfm that has this in source:
>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
>> Caution-http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>> <html xmlns="Caution-http://www.w3.org/1999/xhtml">
>> <head>
>> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
>> <title>Asset Control and Behavior Branch</title> <meta 
>> name="keywords" content="Computational and Information Sciences, 
>> CISD, Tokarcik, research, data fusion, knowledge management, 
>> battlespace weather, environmental effects, computational science and 
>> engineering, battlefield communications and networks "> <meta 
>> name="description" content="This page explains the CISD mission and 
>> hosts the biographies of the CISD Director and Deputy Director.">
>>
>> The parse metatags plugin is setup in nutch-site.xml as
>> parse-(html|tika|metatags)
>>
>> Solr schema.xml is correctly set up to receive the metatags:
>> <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.StandardTokenizerFactory" /> <filter 
>> class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="false" />
>> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> <analyzer 
>> type="query"> <tokenizer class="solr.StandardTokenizerFactory" /> 
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true" />
>> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> 
>> </fieldType>
>>
>> <field name="metatag.description" type="text_general" stored="true"
>> indexed="true" default="none" />
>> <field name="metatag.keywords" type="text_general" stored="true"
>> indexed="true" default="none" />
>> <field name="metatag.date" type="text_general" stored="true"
>> indexed="true" default="none" />
>>
>> After indexing the document solr shows:
>> " title ": "Asset Control and Behavior Branch" , " metatag.date ": 
>> "none" , " metatag.description ": "none" , " metatag.keywords ": 
>> "none"
>>
>> How do I get solr result of:
>> " title ": "Asset Control and Behavior Branch" , " metatag.date ": 
>> "none" , " metatag.description ": "This page explains the CISD 
>> mission and hosts the biographies of the CISD Director and Deputy 
>> Director." , " metatag.keywords ": "Computational and Information 
>> Sciences, CISD, Tokarcik, research, data fusion, knowledge 
>> management, battlespace weather, environmental effects, computational 
>> science and engineering, battlefield communications and networks"
>>
>> Kris
>>
>
>


CLASSIFICATION: UNCLASSIFIED

RE: [Non-DoD Source] Re: indexing metatags with Nutch 1.12 (UNCLASSIFIED)

Posted by BlackIce <bl...@gmail.com>.
I don't have it at all

On Sep 9, 2016 3:42 PM, "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <
kris.t.musshorn.ctr@mail.mil> wrote:

> CLASSIFICATION: UNCLASSIFIED
>
> Are you suggesting I should remove the index.metadata property completely
> or just supply no value?
>
> Thanks,
> Kris
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor – Catapult Technology Inc.
> US Army Research Lab
> Aberdeen Proving Ground
> Application Management & Development Branch
> 410-278-7251
> kris.t.musshorn.ctr@mail.mil
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> -----Original Message-----
> From: BlackIce [mailto:blackice2k4@gmail.com]
> Sent: Friday, September 09, 2016 9:31 AM
> To: user@nutch.apache.org
> Subject: [Non-DoD Source] Re: indexing metatags with Nutch 1.12
>
> All active links contained in this email were disabled.  Please verify the
> identity of the sender, and confirm the authenticity of all links contained
> within the message prior to copying and pasting the address to a Web
> browser.
>
>
>
>
> ----
>
> I had a similar problem, took me days to figure it out, I can't remember
> what exactly was going on, but it was some sort of conflict between
> parameters in site.xml. I think you need to leave this BLANK:
>
> <property>
>                 <name>
>                         index.metadata
>                 </name>
>                 <value>
>                         description,keywords
>                 </value>
>         </property>
>
>
> My Set-up (Nutch 1.11):
>
> Nutch-stie.xml:
>
> <property>
>   <name>plugin.includes</name>
>   <value>nutch-extensionpoints|headings|language-identifier|
> protocol-http|urlfilter-regex|parse-(html|tika|metatags)|
> index-(basic|anchor|more|metadata)|indexer-solr|
> scoring-opic|urlnormalizer-(
> pass|regex|basic)</value>
>
> </property>
>
> <!-- index-metadata plugin properties -->
>
> <property>
>   <name>index.parse.md</name>
>   <value>metatag.description,metatag.keywords,h1,h2,h3,h4,
> h5,h6,metatag.title</value>
>
> </property>
>
>
>
> <!-- parse-metatags plugin properties --> <property>
>   <name>metatags.names</name>
>   <value>description,keywords,title,h1,h2,h3,h4,h5,h6</value>
>
> </property>
>
> On Fri, Sep 9, 2016 at 3:00 PM, BlackIce <bl...@gmail.com> wrote:
>
> > I had a similar problem once.. it was some stupid synrtax thing, lemme
> > check my setup....
> >
> > On Fri, Sep 9, 2016 at 2:46 PM, KRIS MUSSHORN <mu...@comcast.net>
> > wrote:
> >
> >> Looks like this is NOT in fact working.
> >>
> >> How do I get the metatags into Solr?
> >>
> >> i have a webpage @
> >> Caution-https://snip/inside/directorates/cisd/asset.cfm that has this
> in source:
> >> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
> >> Caution-http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> >> <html xmlns="Caution-http://www.w3.org/1999/xhtml">
> >> <head>
> >> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> >> <title>Asset Control and Behavior Branch</title> <meta
> >> name="keywords" content="Computational and Information Sciences,
> >> CISD, Tokarcik, research, data fusion, knowledge management,
> >> battlespace weather, environmental effects, computational science and
> >> engineering, battlefield communications and networks "> <meta
> >> name="description" content="This page explains the CISD mission and
> >> hosts the biographies of the CISD Director and Deputy Director.">
> >>
> >> The parse metatags plugin is setup in nutch-site.xml as
> >> parse-(html|tika|metatags)
> >>
> >> Solr schema.xml is correctly set up to receive the metatags:
> >> <fieldType name="text_general" class="solr.TextField"
> >> positionIncrementGap="100">
> >> <analyzer type="index">
> >> <tokenizer class="solr.StandardTokenizerFactory" /> <filter
> >> class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt" />
> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >> ignoreCase="true" expand="false" />
> >> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> <analyzer
> >> type="query"> <tokenizer class="solr.StandardTokenizerFactory" />
> >> <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt" />
> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >> ignoreCase="true" expand="true" />
> >> <filter class="solr.LowerCaseFilterFactory" /> </analyzer>
> >> </fieldType>
> >>
> >> <field name="metatag.description" type="text_general" stored="true"
> >> indexed="true" default="none" />
> >> <field name="metatag.keywords" type="text_general" stored="true"
> >> indexed="true" default="none" />
> >> <field name="metatag.date" type="text_general" stored="true"
> >> indexed="true" default="none" />
> >>
> >> After indexing the document solr shows:
> >> " title ": "Asset Control and Behavior Branch" , " metatag.date ":
> >> "none" , " metatag.description ": "none" , " metatag.keywords ":
> >> "none"
> >>
> >> How do I get solr result of:
> >> " title ": "Asset Control and Behavior Branch" , " metatag.date ":
> >> "none" , " metatag.description ": "This page explains the CISD
> >> mission and hosts the biographies of the CISD Director and Deputy
> >> Director." , " metatag.keywords ": "Computational and Information
> >> Sciences, CISD, Tokarcik, research, data fusion, knowledge
> >> management, battlespace weather, environmental effects, computational
> >> science and engineering, battlefield communications and networks"
> >>
> >> Kris
> >>
> >
> >
>
>
> CLASSIFICATION: UNCLASSIFIED
>