You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by KRIS MUSSHORN <mu...@comcast.net> on 2016/10/04 14:52:43 UTC
parsing issue - content and title fields combined
Nutch 1.12
Solr 5.4.1
I have a simple webpage that I am crawling with Nutch (attached).
Nutch picks it up as application/xhtml according to the doc type definition.
In parse-plugins I am specifically telling nutch to use parse-html.
<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
<!-- <plugin id="parse-tika" /> -->
</mimeType>
I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr.
this all works fine except....
the content field in solr shows the title and the body text.
I want just the body text in the contents field.
Solr schema.xml does NOT perform any kind of copy into contents.
Solr schema.xml defines content as:
<field name="content" type="text" indexed="true" stored="true" termVectors="true"/>
I have attached the nutch dump and the parseText:: shows title and body.
How do I get the result i need?
I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result.
In fact parsing with Tika produces duplicate entries in the metadata fields.
TIA for assistance?
Re: parsing issue - content and title fields combined
Posted by Comcast <mu...@comcast.net>.
I was not complaining
Sent from my iPhone
> On Oct 4, 2016, at 2:29 PM, Markus Jelsma <ma...@openindex.io> wrote:
>
> That doesn't mean a thing. If you need it, patch the sources and compile it yourself.
> Markuss
>
> -----Original message-----
>> From:KRIS MUSSHORN <mu...@comcast.net>
>> Sent: Tuesday 4th October 2016 18:51
>> To: user@nutch.apache.org
>> Subject: Re: parsing issue - content and title fields combined
>>
>> this is slated for fix in v1.13.
>> Great.
>> K
>>
>> ----- Original Message -----
>>
>> From: "Markus Jelsma" <ma...@openindex.io>
>> To: user@nutch.apache.org
>> Sent: Tuesday, October 4, 2016 12:34:33 PM
>> Subject: RE: parsing issue - content and title fields combined
>>
>> Hi - this is a known and open issue, but it has a patch:
>> https://issues.apache.org/jira/browse/NUTCH-1749
>>
>>
>>
>> -----Original message-----
>>> From:KRIS MUSSHORN <mu...@comcast.net>
>>> Sent: Tuesday 4th October 2016 16:53
>>> To: user@nutch.apache.org
>>> Subject: parsing issue - content and title fields combined
>>>
>>> Nutch 1.12
>>> Solr 5.4.1
>>>
>>> I have a simple webpage that I am crawling with Nutch (attached).
>>>
>>> Nutch picks it up as application/xhtml according to the doc type definition.
>>>
>>> In parse-plugins I am specifically telling nutch to use parse-html.
>>>
>>> <mimeType name="application/xhtml+xml">
>>> <plugin id="parse-html" />
>>> <!-- <plugin id="parse-tika" /> -->
>>> </mimeType>
>>> I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr.
>>>
>>> this all works fine except....
>>>
>>> the content field in solr shows the title and the body text.
>>>
>>> I want just the body text in the contents field.
>>>
>>> Solr schema.xml does NOT perform any kind of copy into contents.
>>>
>>> Solr schema.xml defines content as:
>>>
>>> <field name="content" type="text" indexed="true" stored="true" termVectors="true"/>
>>> I have attached the nutch dump and the parseText:: shows title and body.
>>>
>>> How do I get the result i need?
>>>
>>> I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result.
>>>
>>> In fact parsing with Tika produces duplicate entries in the metadata fields.
>>>
>>> TIA for assistance?
>>>
>>>
>>>
>>
>>
RE: parsing issue - content and title fields combined
Posted by Markus Jelsma <ma...@openindex.io>.
That doesn't mean a thing. If you need it, patch the sources and compile it yourself.
Markuss
-----Original message-----
> From:KRIS MUSSHORN <mu...@comcast.net>
> Sent: Tuesday 4th October 2016 18:51
> To: user@nutch.apache.org
> Subject: Re: parsing issue - content and title fields combined
>
> this is slated for fix in v1.13.
> Great.
> K
>
> ----- Original Message -----
>
> From: "Markus Jelsma" <ma...@openindex.io>
> To: user@nutch.apache.org
> Sent: Tuesday, October 4, 2016 12:34:33 PM
> Subject: RE: parsing issue - content and title fields combined
>
> Hi - this is a known and open issue, but it has a patch:
> https://issues.apache.org/jira/browse/NUTCH-1749
>
>
>
> -----Original message-----
> > From:KRIS MUSSHORN <mu...@comcast.net>
> > Sent: Tuesday 4th October 2016 16:53
> > To: user@nutch.apache.org
> > Subject: parsing issue - content and title fields combined
> >
> > Nutch 1.12
> > Solr 5.4.1
> >
> > I have a simple webpage that I am crawling with Nutch (attached).
> >
> > Nutch picks it up as application/xhtml according to the doc type definition.
> >
> > In parse-plugins I am specifically telling nutch to use parse-html.
> >
> > <mimeType name="application/xhtml+xml">
> > <plugin id="parse-html" />
> > <!-- <plugin id="parse-tika" /> -->
> > </mimeType>
> > I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr.
> >
> > this all works fine except....
> >
> > the content field in solr shows the title and the body text.
> >
> > I want just the body text in the contents field.
> >
> > Solr schema.xml does NOT perform any kind of copy into contents.
> >
> > Solr schema.xml defines content as:
> >
> > <field name="content" type="text" indexed="true" stored="true" termVectors="true"/>
> > I have attached the nutch dump and the parseText:: shows title and body.
> >
> > How do I get the result i need?
> >
> > I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result.
> >
> > In fact parsing with Tika produces duplicate entries in the metadata fields.
> >
> > TIA for assistance?
> >
> >
> >
>
>
Re: parsing issue - content and title fields combined
Posted by KRIS MUSSHORN <mu...@comcast.net>.
this is slated for fix in v1.13.
Great.
K
----- Original Message -----
From: "Markus Jelsma" <ma...@openindex.io>
To: user@nutch.apache.org
Sent: Tuesday, October 4, 2016 12:34:33 PM
Subject: RE: parsing issue - content and title fields combined
Hi - this is a known and open issue, but it has a patch:
https://issues.apache.org/jira/browse/NUTCH-1749
-----Original message-----
> From:KRIS MUSSHORN <mu...@comcast.net>
> Sent: Tuesday 4th October 2016 16:53
> To: user@nutch.apache.org
> Subject: parsing issue - content and title fields combined
>
> Nutch 1.12
> Solr 5.4.1
>
> I have a simple webpage that I am crawling with Nutch (attached).
>
> Nutch picks it up as application/xhtml according to the doc type definition.
>
> In parse-plugins I am specifically telling nutch to use parse-html.
>
> <mimeType name="application/xhtml+xml">
> <plugin id="parse-html" />
> <!-- <plugin id="parse-tika" /> -->
> </mimeType>
> I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr.
>
> this all works fine except....
>
> the content field in solr shows the title and the body text.
>
> I want just the body text in the contents field.
>
> Solr schema.xml does NOT perform any kind of copy into contents.
>
> Solr schema.xml defines content as:
>
> <field name="content" type="text" indexed="true" stored="true" termVectors="true"/>
> I have attached the nutch dump and the parseText:: shows title and body.
>
> How do I get the result i need?
>
> I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result.
>
> In fact parsing with Tika produces duplicate entries in the metadata fields.
>
> TIA for assistance?
>
>
>
RE: parsing issue - content and title fields combined
Posted by Markus Jelsma <ma...@openindex.io>.
Hi - this is a known and open issue, but it has a patch:
https://issues.apache.org/jira/browse/NUTCH-1749
-----Original message-----
> From:KRIS MUSSHORN <mu...@comcast.net>
> Sent: Tuesday 4th October 2016 16:53
> To: user@nutch.apache.org
> Subject: parsing issue - content and title fields combined
>
> Nutch 1.12
> Solr 5.4.1
>
> I have a simple webpage that I am crawling with Nutch (attached).
>
> Nutch picks it up as application/xhtml according to the doc type definition.
>
> In parse-plugins I am specifically telling nutch to use parse-html.
>
> <mimeType name="application/xhtml+xml">
> <plugin id="parse-html" />
> <!-- <plugin id="parse-tika" /> -->
> </mimeType>
> I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr.
>
> this all works fine except....
>
> the content field in solr shows the title and the body text.
>
> I want just the body text in the contents field.
>
> Solr schema.xml does NOT perform any kind of copy into contents.
>
> Solr schema.xml defines content as:
>
> <field name="content" type="text" indexed="true" stored="true" termVectors="true"/>
> I have attached the nutch dump and the parseText:: shows title and body.
>
> How do I get the result i need?
>
> I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result.
>
> In fact parsing with Tika produces duplicate entries in the metadata fields.
>
> TIA for assistance?
>
>
>
Re: parsing issue - content and title fields combined
Posted by KRIS MUSSHORN <mu...@comcast.net>.
----- Original Message -----
From: "KRIS MUSSHORN" <mu...@comcast.net>
To: user@nutch.apache.org
Sent: Tuesday, October 4, 2016 10:52:43 AM
Subject: parsing issue - content and title fields combined
Nutch 1.12
Solr 5.4.1
I have a simple webpage that I am crawling with Nutch (attached).
Nutch picks it up as application/xhtml according to the doc type definition.
In parse-plugins I am specifically telling nutch to use parse-html.
<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
<!-- <plugin id="parse-tika" /> -->
</mimeType>
I am using parse-(html|tika|metatags) to extract the description, keywords, and date into solr.
this all works fine except....
the content field in solr shows the title and the body text.
I want just the body text in the contents field.
Solr schema.xml does NOT perform any kind of copy into contents.
Solr schema.xml defines content as:
<field name="content" type="text" indexed="true" stored="true" termVectors="true"/>
I have attached the nutch dump and the parseText:: shows title and body.
How do I get the result i need?
I have tried using parse-tika, with boilerpipe-default/article/canola, instead of parse-html and parsing with Tika does not produce the desired result.
In fact parsing with Tika produces duplicate entries in the metadata fields.
TIA for assistance?