You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kieran Munday <k....@slcyber.io> on 2021/06/01 12:37:19 UTC

Re: Adding html field to NutchDocument

Hi Sebastian,

Thank you for your response. It was a great help.
I didn't realise that it is intended for users to edit the bin/crawl file.
Although looking at it now it's clear.

This makes it easier for me to access the html content within my plugin,
thanks again

On Fri, May 28, 2021 at 8:36 PM Sebastian Nagel
<wa...@googlemail.com.invalid> wrote:

> Hi Kieran,
>
> see the command-line options
>
>          -addBinaryContent
>            index raw/binary content in field `binaryContent`
>          -base64
>             use Base64 encoding for binary content
>
> of the Nutch index job [1]. Note that the content maybe indeed
> binary, eg. for PDF documents but also for HTML pages which use
> a different encoding than UTF-8.
>
> Best,
> Sebastian
>
> [1]
> https://wiki.apache.org/confluence/pages/viewpage.action?pageId=122916842
>
>
> On 5/28/21 5:28 PM, Kieran Munday wrote:
> > Hi users@,
> >
> > I am new to Nutch (v.1.17) and my current project requires the indexing
> of
> > the html of crawled pages. It also requires fields that can be derived
> from
> > the raw html such as image count, and charset.
> >
> > I have looked on StackOverflow for how to achieve this and most people
> from
> > my understanding seem to be recommending processing the segments to
> extract
> > the html and modify the documents post-crawl. This doesn't fit my use
> case
> > as I need to calculate these fields at crawl time before they are indexed
> > into Elasticsearch.
> >
> > The other recommendations I have seen mention creating a plugin to
> override
> > the parse-html plugin. However, I have found rather limited documentation
> > on how to do this correctly and am not sure on how to return from the
> > plugin in a way that the field propagates into the NutchDocument which
> will
> > be processed in the Indexers' write method.
> >
> > Do any of you have any advice or links to documentation that explains how
> > to modify what gets set in the NutchDocument?
> >
> > Thank you in advance
> >
>
>

Re: Adding html field to NutchDocument

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Kieran,

thanks for the feedback!

 > I didn't realise that it is intended for users to edit the bin/crawl file.

Maybe we should add a comment to encourage users to adapt the shell scripts
to their needs.  Almost 10 years ago, the Java "Crawl" class was replaced
by the scripts because a shell script is easy to modify and deploy, see
   https://issues.apache.org/jira/browse/NUTCH-1087

Best,
Sebastian


On 6/1/21 2:37 PM, Kieran Munday wrote:
> Hi Sebastian,
> 
> Thank you for your response. It was a great help.
> I didn't realise that it is intended for users to edit the bin/crawl file.
> Although looking at it now it's clear.
> 
> This makes it easier for me to access the html content within my plugin,
> thanks again
> 
> On Fri, May 28, 2021 at 8:36 PM Sebastian Nagel
> <wa...@googlemail.com.invalid> wrote:
> 
>> Hi Kieran,
>>
>> see the command-line options
>>
>>           -addBinaryContent
>>             index raw/binary content in field `binaryContent`
>>           -base64
>>              use Base64 encoding for binary content
>>
>> of the Nutch index job [1]. Note that the content maybe indeed
>> binary, eg. for PDF documents but also for HTML pages which use
>> a different encoding than UTF-8.
>>
>> Best,
>> Sebastian
>>
>> [1]
>> https://wiki.apache.org/confluence/pages/viewpage.action?pageId=122916842
>>
>>
>> On 5/28/21 5:28 PM, Kieran Munday wrote:
>>> Hi users@,
>>>
>>> I am new to Nutch (v.1.17) and my current project requires the indexing
>> of
>>> the html of crawled pages. It also requires fields that can be derived
>> from
>>> the raw html such as image count, and charset.
>>>
>>> I have looked on StackOverflow for how to achieve this and most people
>> from
>>> my understanding seem to be recommending processing the segments to
>> extract
>>> the html and modify the documents post-crawl. This doesn't fit my use
>> case
>>> as I need to calculate these fields at crawl time before they are indexed
>>> into Elasticsearch.
>>>
>>> The other recommendations I have seen mention creating a plugin to
>> override
>>> the parse-html plugin. However, I have found rather limited documentation
>>> on how to do this correctly and am not sure on how to return from the
>>> plugin in a way that the field propagates into the NutchDocument which
>> will
>>> be processed in the Indexers' write method.
>>>
>>> Do any of you have any advice or links to documentation that explains how
>>> to modify what gets set in the NutchDocument?
>>>
>>> Thank you in advance
>>>
>>
>>
>