You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Brian (JIRA)" <ji...@apache.org> on 2014/12/10 17:30:15 UTC

[jira] [Updated] (NUTCH-1896) SolrDeleteDuplicates does not use the mapped Solr field names from solrindex-mapping.xml

     [ https://issues.apache.org/jira/browse/NUTCH-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian updated NUTCH-1896:
-------------------------
    Description: 
SolrDeleteDuplicates uses the hard-coded field names specified in SolrConstants.java to get all the fields (id, content, etc.) from Solr when deleting duplicates.

However this ignores the mappings specified in solrindex-mapping.xml - these fields may have been mapped to other fields at index time.

E.g.:
At index time, "id" is mapped to "asset_id"
At dedup time - "id" is used to get the field from Solr - error - no such field exists in Solr.

SolrDeleteDuplicates should use the same mappings defined for indexing, otherwise it can't be used for any setup renaming the internal nutch fields used in deduplication.  

The way I fixed it was to instantiate the SolrMappingReader during initialization and store the mapped field names in the hadoop configuration, e.g.:

{code:java|borderStyle=solid}
  public void setSolrFieldMappings() throws IOException{
    SolrMappingReader solrMapping = SolrMappingReader.getInstance(getConf());

	getConf().set(SolrConstants.ID_FIELD, 
		          solrMapping.mapKey(SolrConstants.ID_FIELD_DEFAULT));
	getConf().set(SolrConstants.BOOST_FIELD, 
		          solrMapping.mapKey(SolrConstants.BOOST_FIELD_DEFAULT));
	getConf().set(SolrConstants.TIMESTAMP_FIELD, 
		          solrMapping.mapKey(SolrConstants.TIMESTAMP_FIELD_DEFAULT));
	getConf().set(SolrConstants.TITLE_FIELD, 
		          solrMapping.mapKey(SolrConstants.TITLE_FIELD_DEFAULT));
	getConf().set(SolrConstants.CONTENT_FIELD, 
		          solrMapping.mapKey(SolrConstants.CONTENT_FIELD_DEFAULT));

  }
{code}


Called in dedup method:
{code:java|borderStyle=solid}
  public boolean dedup(String solrUrl)
  throws IOException, InterruptedException, ClassNotFoundException {
    LOG.info("SolrDeleteDuplicates: starting...");
    LOG.info("SolrDeleteDuplicates: Solr url: " + solrUrl);
    
    getConf().set(SolrConstants.SERVER_URL, solrUrl);
    	
	setSolrFieldMappings();
    
    Job job = new Job(getConf(), "solrdedup");

    job.setInputFormatClass(SolrInputFormat.class);
    job.setOutputFormatClass(NullOutputFormat.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(SolrRecord.class);
    job.setMapperClass(Mapper.class);
    job.setReducerClass(SolrDeleteDuplicates.class);

    return job.waitForCompletion(true);    
  }
{code}



  was:
SolrDeleteDuplicates uses the hard-coded field names specified in SolrConstants.java to get all the fields (id, content, etc.) from Solr when deleting duplicates.

However this ignores the mappings specified in solrindex-mapping.xml - these fields may have been mapped to other fields at index time.

E.g.:
At index time, "id" is mapped to "asset_id"
At dedup time - "id" is used to get the field from Solr - error - no such field exists in Solr.

SolrDeleteDuplicates should use the same mappings defined for indexing, otherwise it can't be used for any setup renaming the internal nutch fields used in deduplication.  

The way I fixed it was to instantiate the SolrMappingReader during initialization and store the mapped field names in the hadoop configuration, e.g.:

  public void setSolrFieldMappings() throws IOException{
    SolrMappingReader solrMapping = SolrMappingReader.getInstance(getConf());

	getConf().set(SolrConstants.ID_FIELD, 
		          solrMapping.mapKey(SolrConstants.ID_FIELD_DEFAULT));
	getConf().set(SolrConstants.BOOST_FIELD, 
		          solrMapping.mapKey(SolrConstants.BOOST_FIELD_DEFAULT));
	getConf().set(SolrConstants.TIMESTAMP_FIELD, 
		          solrMapping.mapKey(SolrConstants.TIMESTAMP_FIELD_DEFAULT));
	getConf().set(SolrConstants.TITLE_FIELD, 
		          solrMapping.mapKey(SolrConstants.TITLE_FIELD_DEFAULT));
	getConf().set(SolrConstants.CONTENT_FIELD, 
		          solrMapping.mapKey(SolrConstants.CONTENT_FIELD_DEFAULT));

  }


Called in dedup method:
  public boolean dedup(String solrUrl)
  throws IOException, InterruptedException, ClassNotFoundException {
    LOG.info("SolrDeleteDuplicates: starting...");
    LOG.info("SolrDeleteDuplicates: Solr url: " + solrUrl);
    
    getConf().set(SolrConstants.SERVER_URL, solrUrl);
    	
	setSolrFieldMappings();
    
    Job job = new Job(getConf(), "solrdedup");

    job.setInputFormatClass(SolrInputFormat.class);
    job.setOutputFormatClass(NullOutputFormat.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(SolrRecord.class);
    job.setMapperClass(Mapper.class);
    job.setReducerClass(SolrDeleteDuplicates.class);

    return job.waitForCompletion(true);    
  }





> SolrDeleteDuplicates does not use the mapped Solr field names from solrindex-mapping.xml
> ----------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1896
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1896
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Brian
>
> SolrDeleteDuplicates uses the hard-coded field names specified in SolrConstants.java to get all the fields (id, content, etc.) from Solr when deleting duplicates.
> However this ignores the mappings specified in solrindex-mapping.xml - these fields may have been mapped to other fields at index time.
> E.g.:
> At index time, "id" is mapped to "asset_id"
> At dedup time - "id" is used to get the field from Solr - error - no such field exists in Solr.
> SolrDeleteDuplicates should use the same mappings defined for indexing, otherwise it can't be used for any setup renaming the internal nutch fields used in deduplication.  
> The way I fixed it was to instantiate the SolrMappingReader during initialization and store the mapped field names in the hadoop configuration, e.g.:
> {code:java|borderStyle=solid}
>   public void setSolrFieldMappings() throws IOException{
>     SolrMappingReader solrMapping = SolrMappingReader.getInstance(getConf());
> 	getConf().set(SolrConstants.ID_FIELD, 
> 		          solrMapping.mapKey(SolrConstants.ID_FIELD_DEFAULT));
> 	getConf().set(SolrConstants.BOOST_FIELD, 
> 		          solrMapping.mapKey(SolrConstants.BOOST_FIELD_DEFAULT));
> 	getConf().set(SolrConstants.TIMESTAMP_FIELD, 
> 		          solrMapping.mapKey(SolrConstants.TIMESTAMP_FIELD_DEFAULT));
> 	getConf().set(SolrConstants.TITLE_FIELD, 
> 		          solrMapping.mapKey(SolrConstants.TITLE_FIELD_DEFAULT));
> 	getConf().set(SolrConstants.CONTENT_FIELD, 
> 		          solrMapping.mapKey(SolrConstants.CONTENT_FIELD_DEFAULT));
>   }
> {code}
> Called in dedup method:
> {code:java|borderStyle=solid}
>   public boolean dedup(String solrUrl)
>   throws IOException, InterruptedException, ClassNotFoundException {
>     LOG.info("SolrDeleteDuplicates: starting...");
>     LOG.info("SolrDeleteDuplicates: Solr url: " + solrUrl);
>     
>     getConf().set(SolrConstants.SERVER_URL, solrUrl);
>     	
> 	setSolrFieldMappings();
>     
>     Job job = new Job(getConf(), "solrdedup");
>     job.setInputFormatClass(SolrInputFormat.class);
>     job.setOutputFormatClass(NullOutputFormat.class);
>     job.setMapOutputKeyClass(Text.class);
>     job.setMapOutputValueClass(SolrRecord.class);
>     job.setMapperClass(Mapper.class);
>     job.setReducerClass(SolrDeleteDuplicates.class);
>     return job.waitForCompletion(true);    
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)