You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Aaron Cosand (JIRA)" <ji...@apache.org> on 2016/03/23 15:26:25 UTC

[jira] [Commented] (NUTCH-2230) Nutch doesn't index all URLs found

    [ https://issues.apache.org/jira/browse/NUTCH-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208481#comment-15208481 ] 

Aaron Cosand commented on NUTCH-2230:
-------------------------------------

The mongodb implementation of GORA assumes that data will be received in sorted order by the _id (primary key) field.  On versions of mongodb using the MMap storage engine, this assumption is true, but the WiredTiger (and presumably other storage engine possibilities) this is not true.  While the best fix is a correction to the GORA mongodb implementation, the below modification to org.apache.nutch.storage.StorageUtils should cause current versions of mongo to process the query using  an index scan that will cause the order of data to match the assumptions that GORA makes.  The single insertion is 'query.setStartKey("");'.  The GORA mongo implementation converts this into {_id:{$gte:""}} which yields all records in the collections, in sorted order

  public static <K, V> void initMapperJob(Job job,
      Collection<WebPage.Field> fields, Class<K> outKeyClass,
      Class<V> outValueClass,
      Class<? extends GoraMapper<String, WebPage, K, V>> mapperClass,
      Class<? extends Partitioner<K, V>> partitionerClass,
      Filter<String, WebPage> filter, boolean reuseObjects)
      throws ClassNotFoundException, IOException {
    DataStore<String, WebPage> store = createWebStore(job.getConfiguration(),
        String.class, WebPage.class);
    if (store == null)
      throw new RuntimeException("Could not create datastore");
    Query<String, WebPage> query = store.newQuery();
    query.setFields(toStringArray(fields));
    if (filter != null) {
      query.setFilter(filter);
    }
    query.setStartKey("");
    GoraMapper.initMapperJob(job, query, store, outKeyClass, outValueClass,
        mapperClass, partitionerClass, reuseObjects);
    GoraOutputFormat.setOutput(job, store, true);
  }


> Nutch doesn't index all URLs found
> ----------------------------------
>
>                 Key: NUTCH-2230
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2230
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 2.3.1
>         Environment: MongoDB with WiredTiger storage engine (3.2 but probably affects other versions as well)
>            Reporter: Aaron Cosand
>
> The initial query run by the generator task, against mongodb, doesn't force ordering by _id.  This causes an incorrect selection of ranges for successive map-reduce related queries.  The successive queries do appear to be getting run in the correct order since _id is always indexed, but they should also explicitly specify a sort, since you are not guaranteed a particular order otherwise.  I didn't dig deep enough to see if the root of the problem is with nutch or gora, and whether it only affected mongo or could affect other databases as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)