You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Rasheed, Salman" <Ra...@gsicommerce.com> on 2009/02/12 19:46:08 UTC

URL Transformation

I had posted a similar issue earlier but got no response. I'm running
Nutch 0.9 version. I'm crawling an internal site to build the index and
results. However the index and search results are accessed externally.
Since the search results are accessed externally, the associated urls
need to be transformed. It is best that the transformation happens when
the index and its artifacts are built. 
 
What is the procedure to transform urls. 
 
Do I write a custom plug that extends URL Normalize? If so what scope
does this need to be associated with? 
 
Appreciate your suggestions.
 
Thanks,
Salman

Re: URL Transformation

Posted by KSY <ks...@yahoo.com>.
Implement your own IndexingFilter, and register it via "nutch-site.xml"
configuration file.

http://lucene.apache.org/nutch/apidocs-0.9/index.html
http://lucene.apache.org/nutch/apidocs-0.9/index.html 

IndexingFilter intercepts before each Lucene Document is persisted to the
file system.   So, you have a chance to modify the fields of in-memory
Lucene Document before it finally gets persisted.

:-U >-( :rules:
-- 
View this message in context: http://www.nabble.com/URL-Transformation-tp21982403p22461342.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: URL Transformation

Posted by dmcole <dm...@colegroup.com>.

salmanrs wrote:
> 
> Adding some more detail to what I'm trying to do.
> 
> Crawl site:  http://abcd.com/home/index.jsp
> 
> Replace the domain abcd.com to www.xyz.com thereby transforming the link
> to http://www.xyz.com/home/index.jsp 
> 
> The url needs to be transformed only for links returned during search.
> 
> 

I'm brand new to Nutch -- three days and counting -- so maybe I don't fully
understand your problem, but maybe you should consider using the OpenSearch
option, which returns XML. You could then use PHP (or another scripting
language) to parse and transform the results before displaying.

You can google "nutch opensearch" for more detail.

\dmc

-- 
View this message in context: http://www.nabble.com/URL-Transformation-tp21982403p22015028.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: URL Transformation

Posted by salmanrs <sa...@hotmail.com>.
Adding some more detail to what I'm trying to do.

Crawl site:  http://abcd.com/home/index.jsp

Replace the domain abcd.com to www.xyz.com thereby transforming the link to
http://www.xyz.com/home/index.jsp 

The url needs to be transformed only for links returned during search.

-- 
View this message in context: http://www.nabble.com/URL-Transformation-tp21982403p21982907.html
Sent from the Nutch - User mailing list archive at Nabble.com.