You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Anarus <as...@gmail.com> on 2007/11/03 10:13:45 UTC

Is there any plugin for data extraction using Xpath, XQuery or regex for nutch

Problem:
I want to develop a web service in which I have to extract certain data from
one site and certain data from another site and will store those to the
index. 

Case study: Let's say I am developing a real estate search site in which I
know all the seed urls and have to extract data only from these seed urls in
a pre-defined way. For every seed url website I will extract pre-known
fields like location, price, zip, bedrooms, description etc. and will add
these fields to the index. Here the field extraction will be different for
different sites and for that have to use xpath,xquery or regex expressions
for every such site. So I want kind of
web-harvest(http://web-harvest.sourceforge.net) integration in nutch. Can
anyone suggest me any such plugin or any other way to do this.

Thanks
Anarus
-- 
View this message in context: http://www.nabble.com/Is-there-any-plugin-for-data-extraction-using-Xpath%2C-XQuery-or-regex-for-nutch-tf4742306.html#a13561159
Sent from the Nutch - User mailing list archive at Nabble.com.