You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2010/07/02 12:42:20 UTC

Nutch 2.0 : Design issue

Hi guys,

You've probably seen that there has been some progress on 2.0 lately. We've
updated the nutchbase svn branch with the latest developments done on
Dogacan's Github i.e. using GORA as a storage layer.
One of the main issues [1] I raised after using nutchbase was that :

NutchBase currently marks entries in the table to be fetched | parsed |
> etc... and needs to go through the whole table at every step. As the table
> gets bigger it takes more and more time to read through the entries and
> check their marks which is not a viable option. NutchBase is currently
> slower than Nutch 1.1 (might be issues with Gora but still...)
> I suggest instead that we create fetchlists in separate tables, fetch &
> parse in these tables then merge the entries back to the main table. The
> segment tables could then be deleted if necessary. We would then have a
> linear processing time for fetching + parsing + updating depending on the
> size of the segments and NOT on the size of the main table. This would be an
> improvement compared to 1.1 where the processing time in the updates is
> relative to the size of the crawldb .
>

Doing this requires to be able to separate the name of a schema from the
name of a table in Gora [2], which should not be a big problem.

On a second thought I was wondering whether it would also make sense to
actually keep the segments as they currently are i.e. stored as
NutchWritables in HDFS. The advantages of doing this would be that we'd keep
exactly the same code for the fetching + parsing + would only need to modify
the generations and update steps + would be able to easily port pre-2.0
segments to the webtable. The drawbacks being that there would be a dual
storage GORA / HDFS and we'd need to keep the legacy Nutch Writable objects.

Note that it would not change anything to the content of the main webtable
nor the operations done on them. Maybe it would make sense to do that anyway
at least as a transition while we make the webtable and GORA operations
stable and then see if there is an advantage in storing the segments as GORA
tables as well.

I am pretty confident that we need to address the point raised in [1]
anyway. What do you guys think?

*[1] http://github.com/dogacan/nutchbase/issues#issue/8
[2] http://github.com/enis/gora/issues#issue/30*

Julien

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Nutch 2.0 : Design issue

Posted by Julien Nioche <li...@gmail.com>.
On 2 July 2010 12:22, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-07-02 12:42, Julien Nioche wrote:
>
>> Hi guys,
>>
>> You've probably seen that there has been some progress on 2.0 lately.
>> We've
>> updated the nutchbase svn branch with the latest developments done on
>> Dogacan's Github i.e. using GORA as a storage layer.
>> One of the main issues [1] I raised after using nutchbase was that :
>>
>> NutchBase currently marks entries in the table to be fetched | parsed |
>>
>>> etc... and needs to go through the whole table at every step. As the
>>> table
>>> gets bigger it takes more and more time to read through the entries and
>>> check their marks which is not a viable option. NutchBase is currently
>>> slower than Nutch 1.1 (might be issues with Gora but still...)
>>> I suggest instead that we create fetchlists in separate tables, fetch&
>>> parse in these tables then merge the entries back to the main table. The
>>> segment tables could then be deleted if necessary. We would then have a
>>> linear processing time for fetching + parsing + updating depending on the
>>> size of the segments and NOT on the size of the main table. This would be
>>> an
>>> improvement compared to 1.1 where the processing time in the updates is
>>> relative to the size of the crawldb .
>>>
>>>
>> Doing this requires to be able to separate the name of a schema from the
>> name of a table in Gora [2], which should not be a big problem.
>>
>
> I think this is a good idea - this model is conceptually close to the
> current model, and I bet it will be easier to debug problems when changes
> are limited to a separate table... we could create 1 table per segment.
>
> (Oh, and let's stop calling them segments, please - maybe call them a batch
> or "crawl cycle" or something. The name "segments" caused a lot of confusion
> already, and it doesn't convey any useful meaning..)
>

Makes sense


>
> As for the time savings .. this remains to be seen. At the end of the
> fetching/parsing job we need to merge this data back into the main table,
> which is a massive update that also takes time.


True


>
>
>
>> On a second thought I was wondering whether it would also make sense to
>> actually keep the segments as they currently are i.e. stored as
>> NutchWritables in HDFS. The advantages of doing this would be that we'd
>> keep
>> exactly the same code for the fetching + parsing + would only need to
>> modify
>> the generations and update steps + would be able to easily port pre-2.0
>> segments to the webtable. The drawbacks being that there would be a dual
>> storage GORA / HDFS and we'd need to keep the legacy Nutch Writable
>> objects.
>>
>
> The fetcher code is already ported in nutchbase not to use the plain files.
> I doubt there would be many users who want to jump to Nutch 2.0 and still
> want to hold on to their old segments... so I think this is not useful. Dual
> storage .. *shudder* that's asking for trouble.
>

Right, + am not too keen on keeping the legacy objects. Another advantage of
having the GORA-based tables for the segments (or fetch_cycles ;-) ) is that
is makes it easier to restart an interrupted fetch or parse.

Forget about the HDFS based storage, let's just do it with GORA



>
>> Note that it would not change anything to the content of the main webtable
>> nor the operations done on them. Maybe it would make sense to do that
>> anyway
>> at least as a transition while we make the webtable and GORA operations
>> stable and then see if there is an advantage in storing the segments as
>> GORA
>> tables as well.
>>
>> I am pretty confident that we need to address the point raised in [1]
>> anyway. What do you guys think?
>>
>> *[1] http://github.com/dogacan/nutchbase/issues#issue/8
>> [2] http://github.com/enis/gora/issues#issue/30*
>>
>
> +1 to both points, -1 to the dual storage.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Nutch 2.0 : Design issue

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-07-02 12:42, Julien Nioche wrote:
> Hi guys,
>
> You've probably seen that there has been some progress on 2.0 lately. We've
> updated the nutchbase svn branch with the latest developments done on
> Dogacan's Github i.e. using GORA as a storage layer.
> One of the main issues [1] I raised after using nutchbase was that :
>
> NutchBase currently marks entries in the table to be fetched | parsed |
>> etc... and needs to go through the whole table at every step. As the table
>> gets bigger it takes more and more time to read through the entries and
>> check their marks which is not a viable option. NutchBase is currently
>> slower than Nutch 1.1 (might be issues with Gora but still...)
>> I suggest instead that we create fetchlists in separate tables, fetch&
>> parse in these tables then merge the entries back to the main table. The
>> segment tables could then be deleted if necessary. We would then have a
>> linear processing time for fetching + parsing + updating depending on the
>> size of the segments and NOT on the size of the main table. This would be an
>> improvement compared to 1.1 where the processing time in the updates is
>> relative to the size of the crawldb .
>>
>
> Doing this requires to be able to separate the name of a schema from the
> name of a table in Gora [2], which should not be a big problem.

I think this is a good idea - this model is conceptually close to the 
current model, and I bet it will be easier to debug problems when 
changes are limited to a separate table... we could create 1 table per 
segment.

(Oh, and let's stop calling them segments, please - maybe call them a 
batch or "crawl cycle" or something. The name "segments" caused a lot of 
confusion already, and it doesn't convey any useful meaning..)

As for the time savings .. this remains to be seen. At the end of the 
fetching/parsing job we need to merge this data back into the main 
table, which is a massive update that also takes time.

>
> On a second thought I was wondering whether it would also make sense to
> actually keep the segments as they currently are i.e. stored as
> NutchWritables in HDFS. The advantages of doing this would be that we'd keep
> exactly the same code for the fetching + parsing + would only need to modify
> the generations and update steps + would be able to easily port pre-2.0
> segments to the webtable. The drawbacks being that there would be a dual
> storage GORA / HDFS and we'd need to keep the legacy Nutch Writable objects.

The fetcher code is already ported in nutchbase not to use the plain 
files. I doubt there would be many users who want to jump to Nutch 2.0 
and still want to hold on to their old segments... so I think this is 
not useful. Dual storage .. *shudder* that's asking for trouble.

>
> Note that it would not change anything to the content of the main webtable
> nor the operations done on them. Maybe it would make sense to do that anyway
> at least as a transition while we make the webtable and GORA operations
> stable and then see if there is an advantage in storing the segments as GORA
> tables as well.
>
> I am pretty confident that we need to address the point raised in [1]
> anyway. What do you guys think?
>
> *[1] http://github.com/dogacan/nutchbase/issues#issue/8
> [2] http://github.com/enis/gora/issues#issue/30*

+1 to both points, -1 to the dual storage.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com