You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2006/01/30 03:12:52 UTC

where we need meta data?

Hi,

some thoughts about meta data.
We agree that we try to minimize the usage of meta data, to keep  
performance high.
Since we descide to have meta data separated, I was thinking of a  
meta data db as we have a crawl db today.

I asking my self where we will need meta data, so it makes sense to  
have them separated or not.

My personal list:

+ generation // having meta data here to decide if a page should be  
fetched or not
+ fetching // here I'm not sure, my we need meta data for fecthing  
but it may be would be great to store session or authentication  
informations can be used until fetching.
However until fetching and parsing meta data for  a url can be created.
+ updating // until updating i was planing to overwrite the old meta  
data with the new data, I had the idea to use a system.currentmillis  
as a stored timestamp to identify the newer meta data, but I have no  
idea if the current millis are fast enough for the job, any thoughts?
+ indexing // to add url meta data into the index.

Well, looking to this list, I'm more and more believe that it would  
be a better idea to store the meta data into the CrawlDatum object  
directly. It save a lot of code changes and we need meta data  
everywhere anyway.
I understand that for performance reasons people do not like meta  
data, but please let me repeat we can add meta data into crawldb in a  
way that does not slowdown crawldatum processing in case no meta data  
are  used.
Also I believe that meta data support is one of the most important  
feature for our users since most users run a small size special  
interest search engine. In any case getting extra data into the index  
is the most asked nutch customizing question in the user list.

So why not adding meta data directly to crawlDatum?

Stefan

  

Re: where we need meta data?

Posted by Stefan Groschupf <sg...@media-style.com>.
> Do we need versioning or timestamping of metadata? I can't imagine  
> why... we already store the last fetch time.
In case we add meta data directly to the crawlDatum we don't need  
that at all.

Would people prefer
crawlDatum.getMetaDatum().set(key, value )
or:
crawlDatum.set(key, value)

?
I may have a basic implementation already later today.

Stefan


Re: where we need meta data?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Stefan Groschupf wrote:
> Hi,
>
> some thoughts about meta data.
> We agree that we try to minimize the usage of meta data, to keep 
> performance high.
> Since we descide to have meta data separated, I was thinking of a meta 
> data db as we have a crawl db today.
>
> I asking my self where we will need meta data, so it makes sense to 
> have them separated or not.
>
> My personal list:
>
[...]

As you point out, in many cases the additional metadata is needed 
throughout most of the workflow. So, it would make more sense to keep it 
together with CrawlDatum.

> + generation // having meta data here to decide if a page should be 
> fetched or not
> + fetching // here I'm not sure, my we need meta data for fecthing but 
> it may be would be great to store session or authentication 
> informations can be used until fetching.

Yes, that's a perfect example. Also, last modification time is required 
to detect modified content.

> However until fetching and parsing meta data for  a url can be created.
> + updating // until updating i was planing to overwrite the old meta 
> data with the new data, I had the idea to use a system.currentmillis 
> as a stored timestamp to identify the newer meta data, but I have no 
> idea if the current millis are fast enough for the job, any thoughts?

Do we need versioning or timestamping of metadata? I can't imagine 
why... we already store the last fetch time.

> + indexing // to add url meta data into the index.
>
> Well, looking to this list, I'm more and more believe that it would be 
> a better idea to store the meta data into the CrawlDatum object 
> directly. It save a lot of code changes and we need meta data 
> everywhere anyway.

[...]

> So why not adding meta data directly to crawlDatum?

I thought it was already decided ;-) . Yes, we need to do just that.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com