You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2014/05/01 07:55:15 UTC

[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to use GORA_94 branch

    [ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986370#comment-13986370 ] 

Lewis John McGibbney commented on NUTCH-1714:
---------------------------------------------

Hey [~jnioche], thanks for looking in to the patch. Answers below
bq. •There is no progression of the complete status of mappers : they go from 0% to 100% for the tasks taking the input from GORA i.e not the injection
Honestly, I have no idea here... we need to find out WTF is wrong 
bq. •The whole content of the webtable seems to be taken as input for mapreduce. I assumed it wouldn't be the case for GORA-119 and that the fetch step for instance would get only the entries marked by the Generator. There is NUTCH-1674 but this should only add the batchID to the filters according to its title.
OK so I wonder if this patch _just_ upgrades to use 0.4 or if it upgrades to 0.4 _and_ upgrades to use the new *filter* API |0|? It is my thought that the former is the truth. I need to look in to the patch... which unfortunately I cannot do right now :( If this is true, then we need to open a separate issue and upgrade to use the filter API as well. This will not be difficult as we know the tools which use the existing Query API.
bq. •./nutch readdb -crawlId MYCRAWLIDHERE -stats gets 0 docs but I can see the corresponding table in HBase.
OK so when we read XML mappings (e.g. gora-hbase-mapping.xml) and *initialize* a Gora datastore the table is created no matter if data is written or read. Are you expecting to see Records? Or are you just surprised that the table is there and no Records?
 
|0| https://svn.apache.org/repos/asf/gora/trunk/gora-core/src/main/java/org/apache/gora/filter/

> Nutch 2.x upgrade to use GORA_94 branch
> ---------------------------------------
>
>                 Key: NUTCH-1714
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1714
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Alparslan Avcı
>            Assignee: Alparslan Avcı
>         Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, NUTCH-1714v2.patch, NUTCH-1714v4.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)