You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2012/06/21 00:33:43 UTC

[jira] [Commented] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

    [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397954#comment-13397954 ] 

Markus Jelsma commented on NUTCH-1406:
--------------------------------------

Thanks for contributing. Can you provide a patch against trunk?
                
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dc.date</value>
> 	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Changes made to MetaTagsIndexer.java between lines 41 and 71:
> {code}
> 	if (tagEntry != null && tagEntry.trim().length() > 0)
> 	{	
> 		if (checkDateConversion(metatag)) {
> 			
> 			Date date = null;
> 			
> 			try {
> 				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
> 				doc.add(metatag, date);
> 			} catch (ParseException e) {
> 				e.printStackTrace();
> 					
> 				if (LOG.isTraceEnabled()) {
> 		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
> 				}
> 			}
> 		}
> 		else {
> 			doc.add(metatag, tagEntry);
> 		}
> 			      
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
> 		}
> 	}
> 	else {
> 					
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
> 		}	
> 	}
> {code}
> Method added to MetaTagsIndexer.java:
> {code}
> 	public boolean checkDateConversion (String metatag){
> 		String convertToDate = conf.get("metatags.convert", "*");	
> 		String[] fieldsToConvert = convertToDate.split(";");
> 		boolean convert = false; 
> 			   
> 		for (String check : fieldsToConvert)
> 			if (check.equals(metatag)) convert = true;			   
> 		
> 		return convert;
> 	}
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira