You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Kristof (JIRA)" <ji...@apache.org> on 2012/06/20 23:35:42 UTC

[jira] [Created] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Kristof  created NUTCH-1406:
-------------------------------

             Summary: Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
                 Key: NUTCH-1406
                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
             Project: Nutch
          Issue Type: Improvement
          Components: indexer, parser
            Reporter: Kristof 
            Priority: Minor


This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
{code}
	<property>
		<name>metatags.convert</name>
		<value>dc.date</value>
		<description>For plugin index-metatags: Indicate here the name of the
		html meta tag that should be converted to date format.
		</description>
	</property>
{code}

I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
So far it worked well for me. Below more details about the changes.

Changes made to MetaTagsIndexer.java between lines 41 and 71:
{code}
	if (tagEntry != null && tagEntry.trim().length() > 0)
	{	
		if (checkDateConversion(metatag)) {
			
			Date date = null;
			
			try {
				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
				doc.add(metatag, date);
			} catch (ParseException e) {
				e.printStackTrace();
					
				if (LOG.isTraceEnabled()) {
		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
				}
			}
		}
		else {
			doc.add(metatag, tagEntry);
		}
			      
		if (LOG.isTraceEnabled()) {
			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
		}
	}
	else {
					
		if (LOG.isTraceEnabled()) {
			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
		}	
	}
{code}

Method added to MetaTagsIndexer.java:
{code}
	public boolean checkDateConversion (String metatag){
		String convertToDate = conf.get("metatags.convert", "*");	
		String[] fieldsToConvert = convertToDate.split(";");
		boolean convert = false; 
			   
		for (String check : fieldsToConvert)
			if (check.equals(metatag)) convert = true;			   
		
		return convert;
	}
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Attachment:     (was: index-metadata.patch)
    
> metadata-index plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata_formatted.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format. The main benefit of this conversion is the possibility to create range facets.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
>  
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>metatag.dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Description: 
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
{code}
<property>
	<name>metatags.convert</name>
	<value>dc.date</value>
	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
	</description>
</property>
{code}

I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
So far it worked well for me. Below more details about the changes.

Changes made to MetaTagsIndexer.java between lines 41 and 71:
{code}
	if (tagEntry != null && tagEntry.trim().length() > 0)
	{	
		if (checkDateConversion(metatag)) {
			
			Date date = null;
			
			try {
				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
				doc.add(metatag, date);
			} catch (ParseException e) {
				e.printStackTrace();
					
				if (LOG.isTraceEnabled()) {
		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
				}
			}
		}
		else {
			doc.add(metatag, tagEntry);
		}
			      
		if (LOG.isTraceEnabled()) {
			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
		}
	}
	else {
					
		if (LOG.isTraceEnabled()) {
			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
		}	
	}
{code}

Method added to MetaTagsIndexer.java:
{code}
	public boolean checkDateConversion (String metatag){
		String convertToDate = conf.get("metatags.convert", "*");	
		String[] fieldsToConvert = convertToDate.split(";");
		boolean convert = false; 
			   
		for (String check : fieldsToConvert)
			if (check.equals(metatag)) convert = true;			   
		
		return convert;
	}
{code}

  was:
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
{code}
	<property>
		<name>metatags.convert</name>
		<value>dc.date</value>
		<description>For plugin index-metatags: Indicate here the name of the
		html meta tag that should be converted to date format.
		</description>
	</property>
{code}

I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
So far it worked well for me. Below more details about the changes.

Changes made to MetaTagsIndexer.java between lines 41 and 71:
{code}
	if (tagEntry != null && tagEntry.trim().length() > 0)
	{	
		if (checkDateConversion(metatag)) {
			
			Date date = null;
			
			try {
				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
				doc.add(metatag, date);
			} catch (ParseException e) {
				e.printStackTrace();
					
				if (LOG.isTraceEnabled()) {
		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
				}
			}
		}
		else {
			doc.add(metatag, tagEntry);
		}
			      
		if (LOG.isTraceEnabled()) {
			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
		}
	}
	else {
					
		if (LOG.isTraceEnabled()) {
			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
		}	
	}
{code}

Method added to MetaTagsIndexer.java:
{code}
	public boolean checkDateConversion (String metatag){
		String convertToDate = conf.get("metatags.convert", "*");	
		String[] fieldsToConvert = convertToDate.split(";");
		boolean convert = false; 
			   
		for (String check : fieldsToConvert)
			if (check.equals(metatag)) convert = true;			   
		
		return convert;
	}
{code}

    
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dc.date</value>
> 	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Changes made to MetaTagsIndexer.java between lines 41 and 71:
> {code}
> 	if (tagEntry != null && tagEntry.trim().length() > 0)
> 	{	
> 		if (checkDateConversion(metatag)) {
> 			
> 			Date date = null;
> 			
> 			try {
> 				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
> 				doc.add(metatag, date);
> 			} catch (ParseException e) {
> 				e.printStackTrace();
> 					
> 				if (LOG.isTraceEnabled()) {
> 		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
> 				}
> 			}
> 		}
> 		else {
> 			doc.add(metatag, tagEntry);
> 		}
> 			      
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
> 		}
> 	}
> 	else {
> 					
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
> 		}	
> 	}
> {code}
> Method added to MetaTagsIndexer.java:
> {code}
> 	public boolean checkDateConversion (String metatag){
> 		String convertToDate = conf.get("metatags.convert", "*");	
> 		String[] fieldsToConvert = convertToDate.split(";");
> 		boolean convert = false; 
> 			   
> 		for (String check : fieldsToConvert)
> 			if (check.equals(metatag)) convert = true;			   
> 		
> 		return convert;
> 	}
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399228#comment-13399228 ] 

Markus Jelsma commented on NUTCH-1406:
--------------------------------------

Hello, a few notes on your patch:
* Nutch uses double space for a single indentation, not tabs;
* convertIndicatior seems to be misspelled;
* yyyy-MM-dd doesn't look like Solr's supported DateField as it's missing time and timezone Z.
                
> metadata-index plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format. The main benefit of this conversion is the possibility to create range facets.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
>  
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>metatag.dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Description: 
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements.  dcterms.modified with the seed url http://www.cic.gc.ca dcterms.modified must also be defined in the metatags.names and index.parse.md propertie. 
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>metatag.dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


    
> metadata-index plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements.  dcterms.modified with the seed url http://www.cic.gc.ca dcterms.modified must also be defined in the metatags.names and index.parse.md propertie. 
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>metatag.dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Description: 
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


        Summary: metadata-index plugin: conversion to Solr date format  (was: Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags)
    
> metadata-index plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Description: 
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
{code}
<property>
	<name>metatags.convert</name>
	<value>dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}

I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
So far it worked well for me. Below more details about the changes.

  was:
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
{code}
<property>
	<name>metatags.convert</name>
	<value>dc.date</value>
	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
	</description>
</property>
{code}

I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
So far it worked well for me. Below more details about the changes.

    
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata-plugin.patch, index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Description: 
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
 
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>metatag.dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements.  dcterms.modified with the seed url http://www.cic.gc.ca dcterms.modified must also be defined in the metatags.names and index.parse.md propertie. 
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>metatag.dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


    
> metadata-index plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
>  
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>metatag.dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Description: 
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
{code}
<property>
	<name>metatags.convert</name>
	<value>dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}

I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
So far it worked well for me. Below more details about the changes.

Please note:
The attached jar-file was originally taken from NUTCH-809 (https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial there do not necessarily match the index-metadata plugin in subversion.

  was:
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
{code}
<property>
	<name>metatags.convert</name>
	<value>dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}

I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
So far it worked well for me. Below more details about the changes.

    
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata-plugin.patch, index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Please note:
> The attached jar-file was originally taken from NUTCH-809 (https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial there do not necessarily match the index-metadata plugin in subversion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398296#comment-13398296 ] 

Kristof  commented on NUTCH-1406:
---------------------------------

Markus, I will provide the patch against trunk. But since I used the metatags-plugin+tutorial.zip provided under #NUTCH-809, I need to transfer the adjustments to the trunk files. Have some problems with building the classes with ant and will come back to fixing it after the weekend once I have more time to look into this.
Julien, thanks for the link.
                
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dc.date</value>
> 	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Changes made to MetaTagsIndexer.java between lines 41 and 71:
> {code}
> 	if (tagEntry != null && tagEntry.trim().length() > 0)
> 	{	
> 		if (checkDateConversion(metatag)) {
> 			
> 			Date date = null;
> 			
> 			try {
> 				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
> 				doc.add(metatag, date);
> 			} catch (ParseException e) {
> 				e.printStackTrace();
> 					
> 				if (LOG.isTraceEnabled()) {
> 		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
> 				}
> 			}
> 		}
> 		else {
> 			doc.add(metatag, tagEntry);
> 		}
> 			      
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
> 		}
> 	}
> 	else {
> 					
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
> 		}	
> 	}
> {code}
> Method added to MetaTagsIndexer.java:
> {code}
> 	public boolean checkDateConversion (String metatag){
> 		String convertToDate = conf.get("metatags.convert", "*");	
> 		String[] fieldsToConvert = convertToDate.split(";");
> 		boolean convert = false; 
> 			   
> 		for (String check : fieldsToConvert)
> 			if (check.equals(metatag)) convert = true;			   
> 		
> 		return convert;
> 	}
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397954#comment-13397954 ] 

Markus Jelsma commented on NUTCH-1406:
--------------------------------------

Thanks for contributing. Can you provide a patch against trunk?
                
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dc.date</value>
> 	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Changes made to MetaTagsIndexer.java between lines 41 and 71:
> {code}
> 	if (tagEntry != null && tagEntry.trim().length() > 0)
> 	{	
> 		if (checkDateConversion(metatag)) {
> 			
> 			Date date = null;
> 			
> 			try {
> 				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
> 				doc.add(metatag, date);
> 			} catch (ParseException e) {
> 				e.printStackTrace();
> 					
> 				if (LOG.isTraceEnabled()) {
> 		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
> 				}
> 			}
> 		}
> 		else {
> 			doc.add(metatag, tagEntry);
> 		}
> 			      
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
> 		}
> 	}
> 	else {
> 					
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
> 		}	
> 	}
> {code}
> Method added to MetaTagsIndexer.java:
> {code}
> 	public boolean checkDateConversion (String metatag){
> 		String convertToDate = conf.get("metatags.convert", "*");	
> 		String[] fieldsToConvert = convertToDate.split(";");
> 		boolean convert = false; 
> 			   
> 		for (String check : fieldsToConvert)
> 			if (check.equals(metatag)) convert = true;			   
> 		
> 		return convert;
> 	}
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1406:
---------------------------------

    Fix Version/s: 1.6
    
> metadata-index plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>             Fix For: 1.6
>
>         Attachments: index-metadata_formatted.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format. The main benefit of this conversion is the possibility to create range facets.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
>  
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>metatag.dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Attachment: index-metatags.jar

jar-file containing plugin
                
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
> {code}
> 	<property>
> 		<name>metatags.convert</name>
> 		<value>dc.date</value>
> 		<description>For plugin index-metatags: Indicate here the name of the
> 		html meta tag that should be converted to date format.
> 		</description>
> 	</property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Changes made to MetaTagsIndexer.java between lines 41 and 71:
> {code}
> 	if (tagEntry != null && tagEntry.trim().length() > 0)
> 	{	
> 		if (checkDateConversion(metatag)) {
> 			
> 			Date date = null;
> 			
> 			try {
> 				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
> 				doc.add(metatag, date);
> 			} catch (ParseException e) {
> 				e.printStackTrace();
> 					
> 				if (LOG.isTraceEnabled()) {
> 		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
> 				}
> 			}
> 		}
> 		else {
> 			doc.add(metatag, tagEntry);
> 		}
> 			      
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
> 		}
> 	}
> 	else {
> 					
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
> 		}	
> 	}
> {code}
> Method added to MetaTagsIndexer.java:
> {code}
> 	public boolean checkDateConversion (String metatag){
> 		String convertToDate = conf.get("metatags.convert", "*");	
> 		String[] fieldsToConvert = convertToDate.split(";");
> 		boolean convert = false; 
> 			   
> 		for (String check : fieldsToConvert)
> 			if (check.equals(metatag)) convert = true;			   
> 		
> 		return convert;
> 	}
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Attachment:     (was: index-metadata-plugin.patch)
    
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Please note:
> The attached jar-file was originally taken from NUTCH-809 (https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial there do not necessarily match the index-metadata plugin in subversion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

Posted by "Kristof (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399222#comment-13399222 ] 

Kristof  commented on NUTCH-1406:
---------------------------------

Thank you for the clarification. When I originally looked for a plugin to index metadata early this year, the index-metatags was the one available. Hence I developed based on this, only realizing after trying to get it working with trunk that something did not add up. Obviously building on the committed index-metadata version is the way to go. I attached the hopefully correct way to patch it, and removed the wrong version and any information that might be misleading. I was not able to make extensive tests though as this was done using the version initially posted in NUTCH-809.
                
> metadata-index plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398247#comment-13398247 ] 

Julien Nioche commented on NUTCH-1406:
--------------------------------------

See http://wiki.apache.org/nutch/HowToContribute
                
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dc.date</value>
> 	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Changes made to MetaTagsIndexer.java between lines 41 and 71:
> {code}
> 	if (tagEntry != null && tagEntry.trim().length() > 0)
> 	{	
> 		if (checkDateConversion(metatag)) {
> 			
> 			Date date = null;
> 			
> 			try {
> 				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
> 				doc.add(metatag, date);
> 			} catch (ParseException e) {
> 				e.printStackTrace();
> 					
> 				if (LOG.isTraceEnabled()) {
> 		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
> 				}
> 			}
> 		}
> 		else {
> 			doc.add(metatag, tagEntry);
> 		}
> 			      
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
> 		}
> 	}
> 	else {
> 					
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
> 		}	
> 	}
> {code}
> Method added to MetaTagsIndexer.java:
> {code}
> 	public boolean checkDateConversion (String metatag){
> 		String convertToDate = conf.get("metatags.convert", "*");	
> 		String[] fieldsToConvert = convertToDate.split(";");
> 		boolean convert = false; 
> 			   
> 		for (String check : fieldsToConvert)
> 			if (check.equals(metatag)) convert = true;			   
> 		
> 		return convert;
> 	}
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Attachment: index-metadata.patch
    
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Please note:
> The attached jar-file was originally taken from NUTCH-809 (https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial there do not necessarily match the index-metadata plugin in subversion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Attachment:     (was: index-metatags.jar)
    
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Please note:
> The attached jar-file was originally taken from NUTCH-809 (https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial there do not necessarily match the index-metadata plugin in subversion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Description: 
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
{code}
<property>
	<name>metatags.convert</name>
	<value>dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}

I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
So far it worked well for me. Below more details about the changes.

Please note:
The attached jar-file was originally taken from NUTCH-809 (https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial there do not necessarily match the index-metadata plugin in subversion.

    
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Description: 
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format. The main benefit of this conversion is the possibility to create range facets.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
 
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>metatag.dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
 
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>metatag.dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


    
> metadata-index plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format. The main benefit of this conversion is the possibility to create range facets.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
>  
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>metatag.dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Comment: was deleted

(was: jar-file containing plugin)
    
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dc.date</value>
> 	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Changes made to MetaTagsIndexer.java between lines 41 and 71:
> {code}
> 	if (tagEntry != null && tagEntry.trim().length() > 0)
> 	{	
> 		if (checkDateConversion(metatag)) {
> 			
> 			Date date = null;
> 			
> 			try {
> 				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
> 				doc.add(metatag, date);
> 			} catch (ParseException e) {
> 				e.printStackTrace();
> 					
> 				if (LOG.isTraceEnabled()) {
> 		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
> 				}
> 			}
> 		}
> 		else {
> 			doc.add(metatag, tagEntry);
> 		}
> 			      
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
> 		}
> 	}
> 	else {
> 					
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
> 		}	
> 	}
> {code}
> Method added to MetaTagsIndexer.java:
> {code}
> 	public boolean checkDateConversion (String metatag){
> 		String convertToDate = conf.get("metatags.convert", "*");	
> 		String[] fieldsToConvert = convertToDate.split(";");
> 		boolean convert = false; 
> 			   
> 		for (String check : fieldsToConvert)
> 			if (check.equals(metatag)) convert = true;			   
> 		
> 		return convert;
> 	}
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) index-metadata plugin: conversion to Solr date format

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Description: 
This improvement to the index-mdata plugin allows for conversion of selected fields to the Solr date format. The main benefit of this conversion is the possibility to create range facets.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
 
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>metatag.dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


  was:
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format. The main benefit of this conversion is the possibility to create range facets.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
 
{code}
<property>
	<name>index.dateconvert.md</name>
	<value>metatag.dcterms.modified</value>
	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
	</description>
</property>
{code}


        Summary: index-metadata plugin: conversion to Solr date format  (was: metadata-index plugin: conversion to Solr date format)
    
> index-metadata plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>             Fix For: 1.6
>
>         Attachments: index-metadata_formatted.patch
>
>
> This improvement to the index-mdata plugin allows for conversion of selected fields to the Solr date format. The main benefit of this conversion is the possibility to create range facets.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
>  
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>metatag.dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Description: 
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
{code}
<property>
	<name>metatags.convert</name>
	<value>dc.date</value>
	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
	</description>
</property>
{code}

I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
So far it worked well for me. Below more details about the changes.

  was:
This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.

In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
{code}
<property>
	<name>metatags.convert</name>
	<value>dc.date</value>
	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
	</description>
</property>
{code}

I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
So far it worked well for me. Below more details about the changes.

Changes made to MetaTagsIndexer.java between lines 41 and 71:
{code}
	if (tagEntry != null && tagEntry.trim().length() > 0)
	{	
		if (checkDateConversion(metatag)) {
			
			Date date = null;
			
			try {
				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
				doc.add(metatag, date);
			} catch (ParseException e) {
				e.printStackTrace();
					
				if (LOG.isTraceEnabled()) {
		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
				}
			}
		}
		else {
			doc.add(metatag, tagEntry);
		}
			      
		if (LOG.isTraceEnabled()) {
			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
		}
	}
	else {
					
		if (LOG.isTraceEnabled()) {
			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
		}	
	}
{code}

Method added to MetaTagsIndexer.java:
{code}
	public boolean checkDateConversion (String metatag){
		String convertToDate = conf.get("metatags.convert", "*");	
		String[] fieldsToConvert = convertToDate.split(";");
		boolean convert = false; 
			   
		for (String check : fieldsToConvert)
			if (check.equals(metatag)) convert = true;			   
		
		return convert;
	}
{code}

    
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata-plugin.patch, index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dc.date</value>
> 	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398420#comment-13398420 ] 

Julien Nioche commented on NUTCH-1406:
--------------------------------------

bq. index-metatags plugin (sometimes also refered to parse-metatags plugin) 

for the sake of clarification this patch is about index-metadata, not parse-metatags (which was index-metatags at one point). This confusion explains why this patch is definitely wrong.  You're basically replacing a more advanced version with the older and more primitive index-metatags (with the added twist of date conversion). What you could do instead would be to keep the existing MetadataIndexer but specify via configuration the field names that should be converted e.g. index.md.date with the values being a comma separated list of field names for instance.

                
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata-plugin.patch, index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Please note:
> The attached jar-file was originally taken from NUTCH-809 (https://issues.apache.org/jira/browse/NUTCH-809). The plugin and tutorial there do not necessarily match the index-metadata plugin in subversion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Attachment: index-metadata-plugin.patch
    
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata-plugin.patch, index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is a simple Dublin Core element dc.date. It must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dc.date</value>
> 	<description>For plugin index-metatags: Indicate here the name of the html meta tag that should be converted to date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Changes made to MetaTagsIndexer.java between lines 41 and 71:
> {code}
> 	if (tagEntry != null && tagEntry.trim().length() > 0)
> 	{	
> 		if (checkDateConversion(metatag)) {
> 			
> 			Date date = null;
> 			
> 			try {
> 				date = new SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
> 				doc.add(metatag, date);
> 			} catch (ParseException e) {
> 				e.printStackTrace();
> 					
> 				if (LOG.isTraceEnabled()) {
> 		LOG.trace(url.toString() + " : date conversion failed for " + tagEntry + " in " + metatag + " field");
> 				}
> 			}
> 		}
> 		else {
> 			doc.add(metatag, tagEntry);
> 		}
> 			      
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : successfully added " + tagEntry + " to the " + metatag + " field");
> 		}
> 	}
> 	else {
> 					
> 		if (LOG.isTraceEnabled()) {
> 			LOG.trace(url.toString() + " : " + metatag + " and " + tagEntry + " not added as Metatag does not have any content");
> 		}	
> 	}
> {code}
> Method added to MetaTagsIndexer.java:
> {code}
> 	public boolean checkDateConversion (String metatag){
> 		String convertToDate = conf.get("metatags.convert", "*");	
> 		String[] fieldsToConvert = convertToDate.split(";");
> 		boolean convert = false; 
> 			   
> 		for (String check : fieldsToConvert)
> 			if (check.equals(metatag)) convert = true;			   
> 		
> 		return convert;
> 	}
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

Posted by "Kristof (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristof  updated NUTCH-1406:
----------------------------

    Attachment: index-metadata_formatted.patch

Formatting done (correct?), spelling error corrected. In regards to the format. You are right that Solr uses this date format yyyy-mm-ddThh:mm:ss.mmmZ. The used SimpleDateFormat yyyy-MM-dd correctly converts to the yyyy-mm-ddThh:mm:ss.mmmZ, but for dates only. I did not consider time when using it as the fields I am looking only have date. The conversion basically adds time information by interpreting the missing time as 00:00:00 and converting it to UTC based on the time zone settings of the machine used in the process. I just tested with some altered files into which I included time information and several SimpleDateFormat patterns trying to find one which works. So far I did not find any that works. A pattern going beyond the pattern yyyy-MM-dd the original field values only having are not converted. So it seems this solutions is only limited to dates.
                
> metadata-index plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata_formatted.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format. The main benefit of this conversion is the possibility to create range facets.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
>  
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>metatag.dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1406) metadata-index plugin: conversion to Solr date format

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399250#comment-13399250 ] 

Julien Nioche commented on NUTCH-1406:
--------------------------------------

BTW we have formatting rules for Eclipse in the NutchGora branch (see eclipse-codeformat.xml). We could add this to the trunk as well
                
> metadata-index plugin: conversion to Solr date format
> -----------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata.patch
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format. The main benefit of this conversion is the possibility to create range facets.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. This can be for example used with Dublin Core elements. A subdomain which would have pages with the meta tag dcterms.modified would be cic.gc.ca. dcterms.modified must also be defined in the metatags.names and index.parse.md properties.
>  
> {code}
> <property>
> 	<name>index.dateconvert.md</name>
> 	<value>metatag.dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1406) Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags

Posted by "Kristof (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398367#comment-13398367 ] 

Kristof  commented on NUTCH-1406:
---------------------------------

I found a way to, but it involved replacing MetadataIndexer.java completely. The date conversion works. I tested it with the seed url stated in the description. Patch against trunk is attached.
                
> Metatags-index/-parse plugin: conversion to Solr date format and prevents parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metadata-plugin.patch, index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to parse-metatags plugin) allows for conversion of selected fields to the Solr date format and prevents parsing/indexing of metatags that do not contain any content.
> In order to convert the values of selected metatags to Solr date format, you must specify in nutch-site.xml. The example used is an extended Dublin Core element dcterms.modified with the seed url http://www.cic.gc.ca/. dcterms.modified must also be defined in the metatags.names property.
> {code}
> <property>
> 	<name>metatags.convert</name>
> 	<value>dcterms.modified</value>
> 	<description>For plugin index-metadata: Indicate here the name of the html meta tag that should be converted to Solr date format.
> 	</description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this improvement might have some problems.
> So far it worked well for me. Below more details about the changes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira