You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Goldschmidt, Dave" <dg...@globalspec.com> on 2005/09/26 20:32:12 UTC
API for injecting content into Nutch?
Hello,
Is there an API of some sort for injecting content into Nutch *without*
using Nutch's crawler? Or does anyone have ideas as to how to approach
this problem? I.e. given a URL, a page of content, metadata about the
page, links, etc., how can I inject this into Nutch without Nutch
performing the crawl?
Thanks in advance for your ideas and insights,
DaveG
Re: API for injecting content into Nutch?
Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hi,
I am not sure what you mean by "injecting content into Nutch" but to
create a segment you can use SegmentWriter class. To update WebDB -
IWebDBWriter interface might be useful. The best place to learn about
what kind of data is stored in segment is probably fetcher code.
Regards
Piotr
Goldschmidt, Dave wrote:
> Hello,
>
>
>
> Is there an API of some sort for injecting content into Nutch *without*
> using Nutch's crawler? Or does anyone have ideas as to how to approach
> this problem? I.e. given a URL, a page of content, metadata about the
> page, links, etc., how can I inject this into Nutch without Nutch
> performing the crawl?
>
>
>
> Thanks in advance for your ideas and insights,
>
>
>
> DaveG
>
>
>
>
Re: API for injecting content into Nutch?
Posted by Matt Kangas <ka...@gmail.com>.
Dave, you don't want to "inject" anything per-se, at least according
to nutch terminology. Instead, you'll want create your own synthetic
crawler. Nutch's crawler outputs one "segment file" (directory of
files, actually) per crawler pass. It is this segment that is
processed by the "nutch index" stage.
So, create a program that iterates through your content and writes it
to a segment file, simulating the crawler's output. Just read the
source for Fetcher.java to see how it uses
org.apache.nutch.segment.SegmentWriter and mimic that. Then follow
the rest of the tutorial as if your segment files had fallen out of
the real crawler.
--Matt
On Sep 26, 2005, at 2:32 PM, Goldschmidt, Dave wrote:
> Hello,
>
> Is there an API of some sort for injecting content into Nutch
> *without*
> using Nutch's crawler? Or does anyone have ideas as to how to
> approach
> this problem? I.e. given a URL, a page of content, metadata about the
> page, links, etc., how can I inject this into Nutch without Nutch
> performing the crawl?
>
> Thanks in advance for your ideas and insights,
>
>
> DaveG
>
--
Matt Kangas / kangas@gmail.com
Re: API for injecting content into Nutch?
Posted by Jon Shoberg <jo...@shoberg.net>.
Goldschmidt, Dave wrote:
> Hello,
>
>
>
> Is there an API of some sort for injecting content into Nutch *without*
> using Nutch's crawler? Or does anyone have ideas as to how to approach
> this problem? I.e. given a URL, a page of content, metadata about the
> page, links, etc., how can I inject this into Nutch without Nutch
> performing the crawl?
>
>
>
> Thanks in advance for your ideas and insights,
>
>
>
> DaveG
You may want to open the source of the Fetcher.java and look at
handleFetch. You'll see content parsing and how it is written to a
segment. From there you can decern how to use the API and how it fits
your needs.
-j