You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tanguy Moal <ta...@gmail.com> on 2013/05/30 11:01:12 UTC

Altering webpage ?

Dear list,
I'd like to store additional data into the webpages rows (something like all distinct anchor texts for each inlink)

I'm wondering what's the best way to do this, and would appreciate any suggestion before digging in the wrong direction.

From what I could understand, I have at least two options, may be more :

1/ Write a custom job that would iterate over the webpage db and produce the desired output in a dedicated db for my own use. This option seems to lower the risk of messing up nutch's internals, just like the updatedb command does but producing a second db as a result. The only CONS of this choice from my point of view is the need to run yet another whole iteration over all the db to populate the required fields, and the duplication of information in an other db.

2/ Extend the webpage structure with my additions, and extend the DBUpdaterJob with pimped DBUpdateMapper and DBUpdateReducer classes doing what I want with inlinks's anchor texts. I see several CONS to this choice, which may result from my misunderstanding of how nutch works. Practically I think I can't simply extend the WebPage storage class, but I'd have to copy-and-modify it, because of the way gora persists things and so on.

3/ Better idea ?

I run nutch 2.1 and rely on hbase for storage.

Thanks in advance for your lights.

Tanguy

Re: Altering webpage ?

Posted by Tanguy Moal <ta...@gmail.com>.
Thank you Lewis for completing Fergy's answer.

I admit that it simplifies the problem in some ways.

I'll stick to the metadata option for now as it's satisfying so far.

BTW, do you think data already stored using the original WebPage storage
class could be read using an alternative CustomWebPage if I only add new
fields *after* the existing and not modified ones ? (I use hbase as storage
backend)

Thanks for your advice

Tanguy

Le jeudi 30 mai 2013, Lewis John Mcgibbney a écrit :

> Hi,
> Just heads up for the event where you do need to add to the nested Metadata
> structure within the WebPage.avsc, you can merely write your changes and
> utilise the ant 'generate-gora-src' target from the build script. The
> GoraCompiler will then compile everything in /src/gora to the path you
> specify along with whichever license header you specify (ASLv2 by default
> now).
>
> On Thursday, May 30, 2013, Tanguy Moal <tanguy.moal@gmail.com<javascript:;>>
> wrote:
> > Hi Ferdy,
> >
> > Thank you for the fast response. I'll try what you suggested and come
> back here if I face an other issue ;-)
> >
> > --
> > Tanguy
> >
> > On May 30, 2013, at 11:14 AM, Ferdy Galema <ferdy.galema@kalooga.com<javascript:;>
> >
> wrote:
> >
> >> Hi,
> >>
> >> I would certainly not extend WebPage since that will require a lot of
> work.
> >> (Simple Java extending won't work because it is a generated class. You'd
> >> have to modify the Avro Schema, regenerate the classes and such).
> Putting
> >> it in a separate table is also not great because of the duplication and
> >> separation of the data.
> >>
> >> In my opinion the best way to add extra data is to use the Metadata
> field,
> >> since it is a freeform map already provided with WebPage. If you have
> >> several pieces of data, you can prefix the keys to indicate what data it
> >> is. You can write a separate Job (or extend/modify DbUpdaterJob) to work
> on
> >> this data.
> >>
> >>
> >> On Thu, May 30, 2013 at 11:01 AM, Tanguy Moal <tanguy.moal@gmail.com<javascript:;>
> >
> wrote:
> >>
> >>> Dear list,
> >>> I'd like to store additional data into the webpages rows (something
> like
> >>> all distinct anchor texts for each inlink)
> >>>
> >>> I'm wondering what's the best way to do this, and would appreciate any
> >>> suggestion before digging in the wrong direction.
> >>>
> >>> From what I could understand, I have at least two options, may be more
> :
> >>>
> >>> 1/ Write a custom job that would iterate over the webpage db and
> produce
> >>> the desired output in a dedicated db for my own use. This option seems
> to
> >>> lower the risk of messing up nutch's internals, just like the updatedb
> >>> command does but producing a second db as a result. The only CONS of
> this
> >>> choice from my point of view is the need to run yet another whole
> iteration
> >>> over all the db to populate the required fields, and the duplication of
> >>> information in an other db.
> >>>
> >>> 2/ Extend the webpage structure with my additions, and extend the
> >>> DBUpdaterJob with pimped DBUpdateMapper and DBUpdateReducer classes
> doing
> >>> what I want with inlinks's anchor texts. I see several CONS to this
> choice,
> >>> which may result from my misunderstanding of how nutch works.
> Practically I
> >>> think I can't simply extend the WebPage storage class, but I'd have to
> >>> copy-and-modify it, because of the way gora persists things and so on.
> >>>
> >>> 3/ Better idea ?
> >>>
> >>> I run nutch 2.1 and rely on hbase for storage.
> >>>
> >>> Thanks in advance for your lights.
> >>>
> >>> Tanguy
> >>
> >>
> >>
> >>
> >> --
> >> *Ferdy Galema*
> >> Kalooga Development
> >>
> >> --
> >>
> >> *Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer now!<
>
> http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22
> >
> >> Kalooga
> >>
> >> Helperpark 288
> >> 9723 ZA Groningen
> >> The Netherlands
> >> +31 50 2103400
> >>
> >> www.kalooga.com
> >> info@kalooga.comKalooga EMEA
> >>
> >> 53 Davies Street
> >> W1K 5JH London
> >> United Kingdom
> >> +44 20 7129 1430Kalooga Spain and LatAM
> >>
> >> Maria de Sevilla Diago No 3
> >> 28022 Madrid - Madrid
> >> Spain
> >> +34 670 580 872
> >
> >
>
> --
> *Lewis*
>

Re: Altering webpage ?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,
Just heads up for the event where you do need to add to the nested Metadata
structure within the WebPage.avsc, you can merely write your changes and
utilise the ant 'generate-gora-src' target from the build script. The
GoraCompiler will then compile everything in /src/gora to the path you
specify along with whichever license header you specify (ASLv2 by default
now).

On Thursday, May 30, 2013, Tanguy Moal <ta...@gmail.com> wrote:
> Hi Ferdy,
>
> Thank you for the fast response. I'll try what you suggested and come
back here if I face an other issue ;-)
>
> --
> Tanguy
>
> On May 30, 2013, at 11:14 AM, Ferdy Galema <fe...@kalooga.com>
wrote:
>
>> Hi,
>>
>> I would certainly not extend WebPage since that will require a lot of
work.
>> (Simple Java extending won't work because it is a generated class. You'd
>> have to modify the Avro Schema, regenerate the classes and such). Putting
>> it in a separate table is also not great because of the duplication and
>> separation of the data.
>>
>> In my opinion the best way to add extra data is to use the Metadata
field,
>> since it is a freeform map already provided with WebPage. If you have
>> several pieces of data, you can prefix the keys to indicate what data it
>> is. You can write a separate Job (or extend/modify DbUpdaterJob) to work
on
>> this data.
>>
>>
>> On Thu, May 30, 2013 at 11:01 AM, Tanguy Moal <ta...@gmail.com>
wrote:
>>
>>> Dear list,
>>> I'd like to store additional data into the webpages rows (something like
>>> all distinct anchor texts for each inlink)
>>>
>>> I'm wondering what's the best way to do this, and would appreciate any
>>> suggestion before digging in the wrong direction.
>>>
>>> From what I could understand, I have at least two options, may be more :
>>>
>>> 1/ Write a custom job that would iterate over the webpage db and produce
>>> the desired output in a dedicated db for my own use. This option seems
to
>>> lower the risk of messing up nutch's internals, just like the updatedb
>>> command does but producing a second db as a result. The only CONS of
this
>>> choice from my point of view is the need to run yet another whole
iteration
>>> over all the db to populate the required fields, and the duplication of
>>> information in an other db.
>>>
>>> 2/ Extend the webpage structure with my additions, and extend the
>>> DBUpdaterJob with pimped DBUpdateMapper and DBUpdateReducer classes
doing
>>> what I want with inlinks's anchor texts. I see several CONS to this
choice,
>>> which may result from my misunderstanding of how nutch works.
Practically I
>>> think I can't simply extend the WebPage storage class, but I'd have to
>>> copy-and-modify it, because of the way gora persists things and so on.
>>>
>>> 3/ Better idea ?
>>>
>>> I run nutch 2.1 and rely on hbase for storage.
>>>
>>> Thanks in advance for your lights.
>>>
>>> Tanguy
>>
>>
>>
>>
>> --
>> *Ferdy Galema*
>> Kalooga Development
>>
>> --
>>
>> *Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer now!<
http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22
>
>> Kalooga
>>
>> Helperpark 288
>> 9723 ZA Groningen
>> The Netherlands
>> +31 50 2103400
>>
>> www.kalooga.com
>> info@kalooga.comKalooga EMEA
>>
>> 53 Davies Street
>> W1K 5JH London
>> United Kingdom
>> +44 20 7129 1430Kalooga Spain and LatAM
>>
>> Maria de Sevilla Diago No 3
>> 28022 Madrid - Madrid
>> Spain
>> +34 670 580 872
>
>

-- 
*Lewis*

Re: Altering webpage ?

Posted by Tanguy Moal <ta...@gmail.com>.
Hi Ferdy,

Thank you for the fast response. I'll try what you suggested and come back here if I face an other issue ;-)

--
Tanguy

On May 30, 2013, at 11:14 AM, Ferdy Galema <fe...@kalooga.com> wrote:

> Hi,
> 
> I would certainly not extend WebPage since that will require a lot of work.
> (Simple Java extending won't work because it is a generated class. You'd
> have to modify the Avro Schema, regenerate the classes and such). Putting
> it in a separate table is also not great because of the duplication and
> separation of the data.
> 
> In my opinion the best way to add extra data is to use the Metadata field,
> since it is a freeform map already provided with WebPage. If you have
> several pieces of data, you can prefix the keys to indicate what data it
> is. You can write a separate Job (or extend/modify DbUpdaterJob) to work on
> this data.
> 
> 
> On Thu, May 30, 2013 at 11:01 AM, Tanguy Moal <ta...@gmail.com> wrote:
> 
>> Dear list,
>> I'd like to store additional data into the webpages rows (something like
>> all distinct anchor texts for each inlink)
>> 
>> I'm wondering what's the best way to do this, and would appreciate any
>> suggestion before digging in the wrong direction.
>> 
>> From what I could understand, I have at least two options, may be more :
>> 
>> 1/ Write a custom job that would iterate over the webpage db and produce
>> the desired output in a dedicated db for my own use. This option seems to
>> lower the risk of messing up nutch's internals, just like the updatedb
>> command does but producing a second db as a result. The only CONS of this
>> choice from my point of view is the need to run yet another whole iteration
>> over all the db to populate the required fields, and the duplication of
>> information in an other db.
>> 
>> 2/ Extend the webpage structure with my additions, and extend the
>> DBUpdaterJob with pimped DBUpdateMapper and DBUpdateReducer classes doing
>> what I want with inlinks's anchor texts. I see several CONS to this choice,
>> which may result from my misunderstanding of how nutch works. Practically I
>> think I can't simply extend the WebPage storage class, but I'd have to
>> copy-and-modify it, because of the way gora persists things and so on.
>> 
>> 3/ Better idea ?
>> 
>> I run nutch 2.1 and rely on hbase for storage.
>> 
>> Thanks in advance for your lights.
>> 
>> Tanguy
> 
> 
> 
> 
> -- 
> *Ferdy Galema*
> Kalooga Development
> 
> -- 
> 
> *Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer now!<http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22>
> Kalooga
> 
> Helperpark 288
> 9723 ZA Groningen
> The Netherlands
> +31 50 2103400
> 
> www.kalooga.com
> info@kalooga.comKalooga EMEA
> 
> 53 Davies Street
> W1K 5JH London
> United Kingdom
> +44 20 7129 1430Kalooga Spain and LatAM
> 
> Maria de Sevilla Diago No 3
> 28022 Madrid - Madrid
> Spain
> +34 670 580 872


Re: Altering webpage ?

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

I would certainly not extend WebPage since that will require a lot of work.
(Simple Java extending won't work because it is a generated class. You'd
have to modify the Avro Schema, regenerate the classes and such). Putting
it in a separate table is also not great because of the duplication and
separation of the data.

In my opinion the best way to add extra data is to use the Metadata field,
since it is a freeform map already provided with WebPage. If you have
several pieces of data, you can prefix the keys to indicate what data it
is. You can write a separate Job (or extend/modify DbUpdaterJob) to work on
this data.


On Thu, May 30, 2013 at 11:01 AM, Tanguy Moal <ta...@gmail.com> wrote:

> Dear list,
> I'd like to store additional data into the webpages rows (something like
> all distinct anchor texts for each inlink)
>
> I'm wondering what's the best way to do this, and would appreciate any
> suggestion before digging in the wrong direction.
>
> From what I could understand, I have at least two options, may be more :
>
> 1/ Write a custom job that would iterate over the webpage db and produce
> the desired output in a dedicated db for my own use. This option seems to
> lower the risk of messing up nutch's internals, just like the updatedb
> command does but producing a second db as a result. The only CONS of this
> choice from my point of view is the need to run yet another whole iteration
> over all the db to populate the required fields, and the duplication of
> information in an other db.
>
> 2/ Extend the webpage structure with my additions, and extend the
> DBUpdaterJob with pimped DBUpdateMapper and DBUpdateReducer classes doing
> what I want with inlinks's anchor texts. I see several CONS to this choice,
> which may result from my misunderstanding of how nutch works. Practically I
> think I can't simply extend the WebPage storage class, but I'd have to
> copy-and-modify it, because of the way gora persists things and so on.
>
> 3/ Better idea ?
>
> I run nutch 2.1 and rely on hbase for storage.
>
> Thanks in advance for your lights.
>
> Tanguy




-- 
*Ferdy Galema*
Kalooga Development

-- 

*Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer now!<http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22>
Kalooga

Helperpark 288
9723 ZA Groningen
The Netherlands
+31 50 2103400

www.kalooga.com
info@kalooga.comKalooga EMEA

53 Davies Street
W1K 5JH London
United Kingdom
+44 20 7129 1430Kalooga Spain and LatAM

Maria de Sevilla Diago No 3
28022 Madrid - Madrid
Spain
+34 670 580 872