You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Goldschmidt, Dave" <dg...@globalspec.com> on 2005/09/26 20:32:12 UTC

API for injecting content into Nutch?

Hello,

 

Is there an API of some sort for injecting content into Nutch *without*
using Nutch's crawler?  Or does anyone have ideas as to how to approach
this problem?  I.e. given a URL, a page of content, metadata about the
page, links, etc., how can I inject this into Nutch without Nutch
performing the crawl?

 

Thanks in advance for your ideas and insights,

 

DaveG

 


Re: API for injecting content into Nutch?

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hi,
I am not sure what you mean by "injecting content into Nutch" but to 
create a segment you can use SegmentWriter class. To update WebDB - 
IWebDBWriter interface might be useful. The best place to learn about 
what kind of data  is stored in segment is probably fetcher code.
Regards
Piotr
Goldschmidt, Dave wrote:
> Hello,
> 
>  
> 
> Is there an API of some sort for injecting content into Nutch *without*
> using Nutch's crawler?  Or does anyone have ideas as to how to approach
> this problem?  I.e. given a URL, a page of content, metadata about the
> page, links, etc., how can I inject this into Nutch without Nutch
> performing the crawl?
> 
>  
> 
> Thanks in advance for your ideas and insights,
> 
>  
> 
> DaveG
> 
>  
> 
> 


Re: API for injecting content into Nutch?

Posted by Matt Kangas <ka...@gmail.com>.
Dave, you don't want to "inject" anything per-se, at least according  
to nutch terminology. Instead, you'll want create your own synthetic  
crawler. Nutch's crawler outputs one "segment file" (directory of  
files, actually) per crawler pass. It is this segment that is  
processed by the "nutch index" stage.

So, create a program that iterates through your content and writes it  
to a segment file, simulating the crawler's output. Just read the  
source for Fetcher.java to see how it uses  
org.apache.nutch.segment.SegmentWriter and mimic that. Then follow  
the rest of the tutorial as if your segment files had fallen out of  
the real crawler.

--Matt

On Sep 26, 2005, at 2:32 PM, Goldschmidt, Dave wrote:

> Hello,
>
> Is there an API of some sort for injecting content into Nutch  
> *without*
> using Nutch's crawler?  Or does anyone have ideas as to how to  
> approach
> this problem?  I.e. given a URL, a page of content, metadata about the
> page, links, etc., how can I inject this into Nutch without Nutch
> performing the crawl?
>
> Thanks in advance for your ideas and insights,
>
>
> DaveG
>

--
Matt Kangas / kangas@gmail.com



Re: API for injecting content into Nutch?

Posted by Jon Shoberg <jo...@shoberg.net>.
Goldschmidt, Dave wrote:
> Hello,
> 
>  
> 
> Is there an API of some sort for injecting content into Nutch *without*
> using Nutch's crawler?  Or does anyone have ideas as to how to approach
> this problem?  I.e. given a URL, a page of content, metadata about the
> page, links, etc., how can I inject this into Nutch without Nutch
> performing the crawl?
> 
>  
> 
> Thanks in advance for your ideas and insights,
> 
>  
> 
> DaveG

You may want to open the source of the Fetcher.java and look at 
handleFetch.  You'll see content parsing and how it is written to a 
segment.  From there you can decern how to use the API and how it fits 
your needs.

-j