You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2013/12/23 05:33:50 UTC

[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store

    [ https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855370#comment-13855370 ] 

Otis Gospodnetic commented on NUTCH-1686:
-----------------------------------------

{code}
-  private final static Utf8 CASH_KEY = new Utf8("_csh_");
-
+  public static final Utf8 CASH_KEY = new Utf8("c");
{code}

Is this going to cause any backwards compatibility issues by any chance?

> Optimize UpdateDb to load less field from Store
> -----------------------------------------------
>
>                 Key: NUTCH-1686
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1686
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.3
>            Reporter: Nguyen Manh Tien
>             Fix For: 2.3
>
>         Attachments: NUTCH-1686.patch
>
>
> While running large crawl i found that updatedb run very slow, especially the Map task which loading data from store.
> We can't use filter by batchId to load less url due to bug in NUTCH-1679 so we must always update the whole table.
> After checking the field loaded in UpdateDbJob i found that it load many fields from store (at least 15/25 field) which make updatedb slow
> I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, METADATA which is used to compute link score, distance that i think the main purpose of this job.
> The other fields is used to compute url schedule to parser and fetcher, we can move code to Parser or Fetcher whithout loading much new field because many field are generated from parser. WE can also use gora filter for Fetcher or Parser so load new field is not a problem.
> I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is currently store in METADATA. field CASH is used in OPICScoring which is used only in UpdateDB and distance is used only in Generator and Updater so move both field two new Metadata field can prevent reading METADATA at Generator and Updater, METADATA contains many data that is used only at Parser and Indexer
> So with new change
> UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we don't need to load big family Fetch and INLINKS.
> Generator only load SCOREMETA (which is smaller than current METADATA)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)