You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jason Manfield <ra...@yahoo.com> on 2005/05/02 21:31:46 UTC

using nutch just for crawling, not indexing?

We would like to use nutch just for crawling, and then index the crawled database into our proprietory datastore/index. How do we go about this? I see that nutch is a shell script, so it is possible to just crawl. Once it crawls, I suppose the crawled data is dumped into webdb. Are there exposed APIs to extract the data from webdb? 
 
One more catch -- our company is a .NET shop :((, so we would like to use C# to read the data of the fetched/crawled pages for further indexing.
 
Ideas/suggestions?
 
Any plans to have nutch for .NET (like dotLucene)?

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [Nutch-general] using nutch just for crawling, not indexing?

Posted by EM <em...@cpuedge.com>.

I don't recall the exact command, but you can use the 'inject' command
to inject an url as a starting point.

Zhou LiBing wrote:

>hi
>  I have a problem about the nutch crawler, How can I crawling the www 
>according to one or serveral specified URL? becauseＩdon't want to use the 
>DMOZ data.
>    
>
> On 5/3/05, Jason Manfield <ra...@yahoo.com> wrote: 
>  
>
>>We would like to use nutch just for crawling, and then index the crawled 
>>database into our proprietory datastore/index. How do we go about this? I 
>>see that nutch is a shell script, so it is possible to just crawl. Once it 
>>crawls, I suppose the crawled data is dumped into webdb. Are there exposed 
>>APIs to extract the data from webdb?
>>
>>One more catch -- our company is a .NET shop :((, so we would like to use 
>>C# to read the data of the fetched/crawled pages for further indexing.
>>
>>Ideas/suggestions?
>>
>>Any plans to have nutch for .NET (like dotLucene)?
>>
>>__________________________________________________
>>Do You Yahoo!?
>>Tired of spam? Yahoo! Mail has the best spam protection around
>>http://mail.yahoo.com
>>
>>    
>>
>
>
>
>  
>

Re: [Nutch-general] using nutch just for crawling, not indexing?

Posted by Zhou LiBing <zh...@gmail.com>.

hi
  I have a problem about the nutch crawler, How can I crawling the www 
according to one or serveral specified URL? becauseＩdon't want to use the 
DMOZ data.
    

 On 5/3/05, Jason Manfield <ra...@yahoo.com> wrote: 
> 
> We would like to use nutch just for crawling, and then index the crawled 
> database into our proprietory datastore/index. How do we go about this? I 
> see that nutch is a shell script, so it is possible to just crawl. Once it 
> crawls, I suppose the crawled data is dumped into webdb. Are there exposed 
> APIs to extract the data from webdb?
> 
> One more catch -- our company is a .NET shop :((, so we would like to use 
> C# to read the data of the fetched/crawled pages for further indexing.
> 
> Ideas/suggestions?
> 
> Any plans to have nutch for .NET (like dotLucene)?
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> 



-- 
---Letter From your friend Blue at HUST CGCL---

Re: [Nutch-general] using nutch just for crawling, not indexing?

Posted by Jeff Bowden <jl...@houseofdistraction.com>.

Jason Manfield wrote:

>One more catch -- our company is a .NET shop :((, so we would like to use C# to read the data of the fetched/crawled pages for further indexing.
>

http://www.ikvm.net/

RE: using nutch just for crawling, not indexing?

Posted by Chirag Chaman <de...@filangy.com>.

Jason,

The data is technically not stored in WebDb -- just the links are (and the
related information).
The pages (content) are stored within the segment. 

Nutch is nothing more than a "search engine" specific implementation of
lucene (with added hooks). It uses lucene for indexing out of the box. Thus
you could in theory use the entire nutch process and then do your search
using dotlucene on the segments created by nutch (repeat "in theory
...you'll need to do some hacking to make it work). Make sure the index file
versions are the same/compatible.

Look at the WebDB API, and you'll get a better picture of whats available.
You should also look at the SegmentReader -- as this is what you'll need to
read the actual content of each page.

CC-

--------------------------------------------
Filangy, Inc.
Interested in Improving Search? Join our Team!
http://filangy.com/jointheteam.jsp 

-----Original Message-----
From: Jason Manfield [mailto:rarish911@yahoo.com] 
Sent: Monday, May 02, 2005 3:32 PM
To: Nutch User
Subject: using nutch just for crawling, not indexing?

We would like to use nutch just for crawling, and then index the crawled
database into our proprietory datastore/index. How do we go about this? I
see that nutch is a shell script, so it is possible to just crawl. Once it
crawls, I suppose the crawled data is dumped into webdb. Are there exposed
APIs to extract the data from webdb? 
 
One more catch -- our company is a .NET shop :((, so we would like to use C#
to read the data of the fetched/crawled pages for further indexing.
 
Ideas/suggestions?
 
Any plans to have nutch for .NET (like dotLucene)?

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com