You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by 丛云牙之主 <ya...@qq.com> on 2011/07/03 17:43:54 UTC

Problems when crawl a .nsf site

Hello, I am using nutch-1.2 has encountered a problem.The site is writtenwith lotus domino, I use the browser to enter, click on the emergence of thoseconnections have not changed the site URL, unlike some sites have a lot of suffixes.Then there is a web site is buptoa.bupt.edu.cn / student_broad.nsf, I wanted to climbwill take. Nsf file. But nutch does not support. Nsf file crawl, I should write my ownplugin or should solve this problem from the other side?
Extremely grateful for your help

Re: Problems when crawl a .nsf site

Posted by lewis john mcgibbney <le...@gmail.com>.

Absolutely...

There is a short (old) thread here on this topic [1], from what I can see
this issue has not been addressed. Therefore it looks like implementing your
own parser plugin is what's required.

[1]
http://www.lucidimagination.com/search/document/a8d53fac1caa578c/nutch_with_nsf_files

2011/7/3 Alexander Aristov <al...@gmail.com>

> Hi
>
> If it is a text file then you can simply associate the extension with text
> parser. But if I understand you right it's a lotus Db file then I suspect
> you have no other choice than implementing your own parser. I haven't heard
> of lotus files support in nutch.
>
> Best Regards
> Alexander Aristov
>
>
> 2011/7/3 丛云牙之主 <ya...@qq.com>
>
> > Hello, I am using nutch-1.2 has encountered a problem.The site is
> > writtenwith lotus domino, I use the browser to enter, click on the
> emergence
> > of thoseconnections have not changed the site URL, unlike some sites have
> a
> > lot of suffixes.Then there is a web site is buptoa.bupt.edu.cn /
> > student_broad.nsf, I wanted to climbwill take. Nsf file. But nutch does
> not
> > support. Nsf file crawl, I should write my ownplugin or should solve this
> > problem from the other side?
> > Extremely grateful for your help
>



-- 
*Lewis*

Re: Problems when crawl a .nsf site

Posted by Alexander Aristov <al...@gmail.com>.

Hi

If it is a text file then you can simply associate the extension with text
parser. But if I understand you right it's a lotus Db file then I suspect
you have no other choice than implementing your own parser. I haven't heard
of lotus files support in nutch.

Best Regards
Alexander Aristov


2011/7/3 丛云牙之主 <ya...@qq.com>

> Hello, I am using nutch-1.2 has encountered a problem.The site is
> writtenwith lotus domino, I use the browser to enter, click on the emergence
> of thoseconnections have not changed the site URL, unlike some sites have a
> lot of suffixes.Then there is a web site is buptoa.bupt.edu.cn /
> student_broad.nsf, I wanted to climbwill take. Nsf file. But nutch does not
> support. Nsf file crawl, I should write my ownplugin or should solve this
> problem from the other side?
> Extremely grateful for your help