You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Max S <ma...@googlemail.com> on 2009/08/16 00:39:45 UTC

XML Parser not extracting links

Hello all,

I have installed XML Parser plugin to Nutch 0.9 and it is working correctly.
Running the plugin from commandline, it displays both parsed text and parsed
data. However, the parser did not managed to extract any outlinks. 
The outlinks is extracted from the parsed text using the following code,
basically extracting the link from the text extracted from the xml.

	Outlink[] outlinks = OutlinkExtractor.getOutlinks(text, getConf());

>From my test, the parsed text displays with a few links and all of them are
separated by spaces. I have verified that the variable String text contains
the extracted contents. Since these links have a different domain, I have
made sure db.ignore.external.links in nutch config is set to false.

I cannot see anything else that will prevent this code from extracting the
links. Does anyone have any idea or have managed to resolve this issue?

Ta. 


RE: XML Parser not extracting links

Posted by Max S <ma...@googlemail.com>.
Enabling verbose logging, I found out that getConf() is returning null (log
below). I believe this is the cause of the issue since OutlinkExtractor will
loop through the whole chunk of text using regex, (see
http://jakarta.apache.org/oro/api/org/apache/oro/text/regex/PatternMatcherIn
put.html).

Not entirely sure what was wrong with the configuration though..



2009-08-23 01:31:29,145 DEBUG conf.Configuration - java.io.IOException:
config()
	at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
	at
org.apache.nutch.util.NutchConfiguration.create(NutchConfiguration.java:51)
	at org.apache.nutch.parse.xml.XMLParser.main(XMLParser.java:391)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at
org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:417)

2009-08-23 01:31:29,322 DEBUG conf.Configuration - java.io.IOException:
config()
	at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
	at
org.apache.nutch.util.NutchConfiguration.create(NutchConfiguration.java:51)
	at
org.apache.nutch.parse.xml.config.XMLParserConfig.getInstance(XMLParserConfi
g.java:57)
	at org.apache.nutch.parse.xml.XMLParser.getParse(XMLParser.java:88)
	at org.apache.nutch.parse.xml.XMLParser.main(XMLParser.java:391)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at
org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:417)

2009-08-23 01:31:29,323 INFO  conf.Configuration - found resource
xmlparser-conf.xml at file:/usr/local/nutch/conf/xmlparser-conf.xml
2009-08-23 01:31:29,323 DEBUG conf.Configuration - java.io.IOException:
config()
	at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:93)
	at
org.apache.nutch.util.NutchConfiguration.create(NutchConfiguration.java:51)
	at
org.apache.nutch.parse.xml.config.XMLParserConfig.getInstance(XMLParserConfi
g.java:60)
	at org.apache.nutch.parse.xml.XMLParser.getParse(XMLParser.java:88)
	at org.apache.nutch.parse.xml.XMLParser.main(XMLParser.java:391)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at
org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:417)

2009-08-23 01:31:29,335 INFO  parse.xml - XMLParser config path : null
2009-08-23 01:31:29,439 ERROR parse.OutlinkExtractor - getOutlinks
java.lang.NullPointerException
	at
org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:68)
	at
org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:95)
	at
org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:116)
	at org.apache.nutch.parse.Outlink.<init>(Outlink.java:36)
	at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:13
4)
	at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:72
)
	at org.apache.nutch.parse.xml.XMLParser.getParse(XMLParser.java:110)
	at org.apache.nutch.parse.xml.XMLParser.main(XMLParser.java:391)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at
org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:417)

 

-----Original Message-----
From: Max S [mailto:maximillian009@googlemail.com] 
Sent: Saturday, August 15, 2009 11:40 PM
To: nutch-user@lucene.apache.org
Subject: XML Parser not extracting links

Hello all,

I have installed XML Parser plugin to Nutch 0.9 and it is working correctly.
Running the plugin from commandline, it displays both parsed text and parsed
data. However, the parser did not managed to extract any outlinks. 
The outlinks is extracted from the parsed text using the following code,
basically extracting the link from the text extracted from the xml.

	Outlink[] outlinks = OutlinkExtractor.getOutlinks(text, getConf());

>From my test, the parsed text displays with a few links and all of them are
separated by spaces. I have verified that the variable String text contains
the extracted contents. Since these links have a different domain, I have
made sure db.ignore.external.links in nutch config is set to false.

I cannot see anything else that will prevent this code from extracting the
links. Does anyone have any idea or have managed to resolve this issue?

Ta.