You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by KRIS MUSSHORN <mu...@comcast.net> on 2016/10/04 14:52:43 UTC

parsing issue - content and title fields combined

Nutch 1.12 
Solr 5.4.1 

I have a simple webpage that I am crawling with Nutch (attached). 

Nutch picks it up as application/xhtml according to the doc type definition. 

In parse-plugins I am specifically telling nutch to use parse-html. 

<mimeType name="application/xhtml+xml"> 
<plugin id="parse-html" /> 
<!-- <plugin id="parse-tika" /> --> 
</mimeType> 

I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr. 

this all works fine except.... 

the content field in solr shows the title and the body text. 

I want just the body text in the contents field. 

Solr schema.xml does NOT perform any kind of copy into contents. 

Solr schema.xml defines content as: 

<field name="content" type="text" indexed="true" stored="true" termVectors="true"/> 

I have attached the nutch dump and the parseText:: shows title and body. 

How do I get the result i need? 

I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result. 

In fact parsing with Tika produces duplicate entries in the metadata fields. 

TIA for assistance?

Re: parsing issue - content and title fields combined

Posted by Comcast <mu...@comcast.net>.

I was not complaining

Sent from my iPhone

> On Oct 4, 2016, at 2:29 PM, Markus Jelsma <ma...@openindex.io> wrote:
> 
> That doesn't mean a thing. If you need it, patch the sources and compile it yourself.
> Markuss 
> 
> -----Original message-----
>> From:KRIS MUSSHORN <mu...@comcast.net>
>> Sent: Tuesday 4th October 2016 18:51
>> To: user@nutch.apache.org
>> Subject: Re: parsing issue - content and title fields combined
>> 
>> this is slated for fix in v1.13. 
>> Great. 
>> K 
>> 
>> ----- Original Message -----
>> 
>> From: "Markus Jelsma" <ma...@openindex.io> 
>> To: user@nutch.apache.org 
>> Sent: Tuesday, October 4, 2016 12:34:33 PM 
>> Subject: RE: parsing issue - content and title fields combined 
>> 
>> Hi - this is a known and open issue, but it has a patch: 
>> https://issues.apache.org/jira/browse/NUTCH-1749 
>> 
>> 
>> 
>> -----Original message----- 
>>> From:KRIS MUSSHORN <mu...@comcast.net> 
>>> Sent: Tuesday 4th October 2016 16:53 
>>> To: user@nutch.apache.org 
>>> Subject: parsing issue - content and title fields combined 
>>> 
>>> Nutch 1.12 
>>> Solr 5.4.1 
>>> 
>>> I have a simple webpage that I am crawling with Nutch (attached). 
>>> 
>>> Nutch picks it up as application/xhtml according to the doc type definition. 
>>> 
>>> In parse-plugins I am specifically telling nutch to use parse-html. 
>>> 
>>> <mimeType name="application/xhtml+xml"> 
>>> <plugin id="parse-html" /> 
>>> <!-- <plugin id="parse-tika" /> --> 
>>> </mimeType> 
>>> I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr. 
>>> 
>>> this all works fine except.... 
>>> 
>>> the content field in solr shows the title and the body text. 
>>> 
>>> I want just the body text in the contents field. 
>>> 
>>> Solr schema.xml does NOT perform any kind of copy into contents. 
>>> 
>>> Solr schema.xml defines content as: 
>>> 
>>> <field name="content" type="text" indexed="true" stored="true" termVectors="true"/> 
>>> I have attached the nutch dump and the parseText:: shows title and body. 
>>> 
>>> How do I get the result i need? 
>>> 
>>> I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result. 
>>> 
>>> In fact parsing with Tika produces duplicate entries in the metadata fields. 
>>> 
>>> TIA for assistance? 
>>> 
>>> 
>>> 
>> 
>>

RE: parsing issue - content and title fields combined

Posted by Markus Jelsma <ma...@openindex.io>.

That doesn't mean a thing. If you need it, patch the sources and compile it yourself.
Markuss 
 
-----Original message-----
> From:KRIS MUSSHORN <mu...@comcast.net>
> Sent: Tuesday 4th October 2016 18:51
> To: user@nutch.apache.org
> Subject: Re: parsing issue - content and title fields combined
> 
> this is slated for fix in v1.13. 
> Great. 
> K 
> 
> ----- Original Message -----
> 
> From: "Markus Jelsma" <ma...@openindex.io> 
> To: user@nutch.apache.org 
> Sent: Tuesday, October 4, 2016 12:34:33 PM 
> Subject: RE: parsing issue - content and title fields combined 
> 
> Hi - this is a known and open issue, but it has a patch: 
> https://issues.apache.org/jira/browse/NUTCH-1749 
> 
> 
> 
> -----Original message----- 
> > From:KRIS MUSSHORN <mu...@comcast.net> 
> > Sent: Tuesday 4th October 2016 16:53 
> > To: user@nutch.apache.org 
> > Subject: parsing issue - content and title fields combined 
> > 
> > Nutch 1.12 
> > Solr 5.4.1 
> > 
> > I have a simple webpage that I am crawling with Nutch (attached). 
> > 
> > Nutch picks it up as application/xhtml according to the doc type definition. 
> > 
> > In parse-plugins I am specifically telling nutch to use parse-html. 
> > 
> > <mimeType name="application/xhtml+xml"> 
> > <plugin id="parse-html" /> 
> > <!-- <plugin id="parse-tika" /> --> 
> > </mimeType> 
> > I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr. 
> > 
> > this all works fine except.... 
> > 
> > the content field in solr shows the title and the body text. 
> > 
> > I want just the body text in the contents field. 
> > 
> > Solr schema.xml does NOT perform any kind of copy into contents. 
> > 
> > Solr schema.xml defines content as: 
> > 
> > <field name="content" type="text" indexed="true" stored="true" termVectors="true"/> 
> > I have attached the nutch dump and the parseText:: shows title and body. 
> > 
> > How do I get the result i need? 
> > 
> > I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result. 
> > 
> > In fact parsing with Tika produces duplicate entries in the metadata fields. 
> > 
> > TIA for assistance? 
> > 
> > 
> > 
> 
>

Re: parsing issue - content and title fields combined

Posted by KRIS MUSSHORN <mu...@comcast.net>.

this is slated for fix in v1.13. 
Great. 
K 

----- Original Message -----

From: "Markus Jelsma" <ma...@openindex.io> 
To: user@nutch.apache.org 
Sent: Tuesday, October 4, 2016 12:34:33 PM 
Subject: RE: parsing issue - content and title fields combined 

Hi - this is a known and open issue, but it has a patch: 
https://issues.apache.org/jira/browse/NUTCH-1749 



-----Original message----- 
> From:KRIS MUSSHORN <mu...@comcast.net> 
> Sent: Tuesday 4th October 2016 16:53 
> To: user@nutch.apache.org 
> Subject: parsing issue - content and title fields combined 
> 
> Nutch 1.12 
> Solr 5.4.1 
> 
> I have a simple webpage that I am crawling with Nutch (attached). 
> 
> Nutch picks it up as application/xhtml according to the doc type definition. 
> 
> In parse-plugins I am specifically telling nutch to use parse-html. 
> 
> <mimeType name="application/xhtml+xml"> 
> <plugin id="parse-html" /> 
> <!-- <plugin id="parse-tika" /> --> 
> </mimeType> 
> I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr. 
> 
> this all works fine except.... 
> 
> the content field in solr shows the title and the body text. 
> 
> I want just the body text in the contents field. 
> 
> Solr schema.xml does NOT perform any kind of copy into contents. 
> 
> Solr schema.xml defines content as: 
> 
> <field name="content" type="text" indexed="true" stored="true" termVectors="true"/> 
> I have attached the nutch dump and the parseText:: shows title and body. 
> 
> How do I get the result i need? 
> 
> I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result. 
> 
> In fact parsing with Tika produces duplicate entries in the metadata fields. 
> 
> TIA for assistance? 
> 
> 
>

RE: parsing issue - content and title fields combined

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - this is a known and open issue, but it has a patch:
https://issues.apache.org/jira/browse/NUTCH-1749

 
 
-----Original message-----
> From:KRIS MUSSHORN <mu...@comcast.net>
> Sent: Tuesday 4th October 2016 16:53
> To: user@nutch.apache.org
> Subject: parsing issue - content and title fields combined
> 
> Nutch 1.12 
> Solr 5.4.1
> 
> I have a simple webpage that I am crawling with Nutch (attached).
> 
> Nutch picks it up as application/xhtml according to the doc type definition.
> 
> In parse-plugins I am specifically telling nutch to use parse-html.
> 
> <mimeType name="application/xhtml+xml">
>         <plugin id="parse-html" />
>         <!-- <plugin id="parse-tika" /> -->
> </mimeType>
> I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr.
> 
> this all works fine except....
> 
> the content field in solr shows the title and the body text.
> 
> I want just the body text in the contents field.
> 
> Solr schema.xml does NOT perform any kind of copy into contents.
> 
> Solr schema.xml defines content as: 
> 
> <field name="content" type="text" indexed="true" stored="true" termVectors="true"/>
> I have attached the nutch dump and the parseText:: shows title and body.
> 
> How do I get the result i need?
> 
> I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result.
> 
> In fact parsing with Tika produces duplicate entries in the metadata fields.
> 
> TIA for assistance?
> 
> 
>

Re: parsing issue - content and title fields combined

Posted by KRIS MUSSHORN <mu...@comcast.net>.

----- Original Message -----

From: "KRIS MUSSHORN" <mu...@comcast.net> 
To: user@nutch.apache.org 
Sent: Tuesday, October 4, 2016 10:52:43 AM 
Subject: parsing issue - content and title fields combined 

Nutch 1.12 
Solr 5.4.1 

I have a simple webpage that I am crawling with Nutch (attached). 

Nutch picks it up as application/xhtml according to the doc type definition. 

In parse-plugins I am specifically telling nutch to use parse-html. 

<mimeType name="application/xhtml+xml"> 
<plugin id="parse-html" /> 
<!-- <plugin id="parse-tika" /> --> 
</mimeType> 

I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr. 

this all works fine except.... 

the content field in solr shows the title and the body text. 

I want just the body text in the contents field. 

Solr schema.xml does NOT perform any kind of copy into contents. 

Solr schema.xml defines content as: 

<field name="content" type="text" indexed="true" stored="true" termVectors="true"/> 

I have attached the nutch dump and the parseText:: shows title and body. 

How do I get the result i need? 

I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result. 

In fact parsing with Tika produces duplicate entries in the metadata fields. 

TIA for assistance?