You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by hank williams <ha...@gmail.com> on 2007/11/08 10:42:41 UTC

noob wants to know: joining with a relational database result, is it possible?

I want to do searches within a constrained set of URLs where the
constrained set of URLs is determined by a MySQL result.

For example, lets say we have a program that is maintaining a MySQL
database of URLs that also have a "name" field. So I want to search in
the database for all the URLs that have "foo" in the name field that
have "bar" in the text of the web page.

Is there any way to tell Nutch "hey! i don't want *all* the results
that have the word 'bar', but just ones that are within this set of
URLs that I am giving you."

In this circumstance perhaps you are feeding Nutch a list of URLs to
constrain it.  Or perhaps there is some other way. I am not
necessarily suggesting the best way here, I don't have a strategy yet.
I am just wondering if there is a way to marry the worlds of
structured database search and Nutch/web search in a type of "cross
database-type" join.

Hank

Re: noob wants to know: joining with a relational database result, is it possible?

Posted by Sebastian Steinmetz <s....@mederi-research.de>.
Hey,

that's exactly the scenario that I'm looking for. I've thought i  
could solve this with a unique ID which i could add to the CrawlDB  
through which i can filter the results returned from nutch. (i've  
already implemented the ID part, just need to add the filtering by ID.)

Nonetheless good advices and opinions are very welcome how to achieve  
such a scenario. Maybe there is some more performance to be squashed  
out of it.
But i think it's definitely possible someway ;)

thank you very much,
	Sebastian Steinmetz




> I want to do searches within a constrained set of URLs where the
> constrained set of URLs is determined by a MySQL result.
>
> For example, lets say we have a program that is maintaining a MySQL
> database of URLs that also have a "name" field. So I want to search in
> the database for all the URLs that have "foo" in the name field that
> have "bar" in the text of the web page.
>
> Is there any way to tell Nutch "hey! i don't want *all* the results
> that have the word 'bar', but just ones that are within this set of
> URLs that I am giving you."
>
> In this circumstance perhaps you are feeding Nutch a list of URLs to
> constrain it.  Or perhaps there is some other way. I am not
> necessarily suggesting the best way here, I don't have a strategy yet.
> I am just wondering if there is a way to marry the worlds of
> structured database search and Nutch/web search in a type of "cross
> database-type" join.
>
> Hank