You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by HellSpawn <r....@gmail.com> on 2006/05/24 15:19:54 UTC

Extract infos from documents and query external sites

Hi all, I'm new :)

I have to extract some informations from an address book in my site
(example: names and surnames) and then use it to build queries on sites like
scholar.google.com, indexing the result page with my crawler. Can I do it?
How?

Thank you

Rosario Salatiello
--
View this message in context: http://www.nabble.com/Extract+infos+from+documents+and+query+external+sites-t1675003.html#a4541042
Sent from the Nutch - Dev forum at Nabble.com.

Re: Extract infos from documents and query external sites

Posted by Stefan Groschupf <sg...@media-style.com>.

Think about using the google API.

However the way to go could be:

+ fetch your pages
+ do not parse the pages
+ write a map reduce job that extract your data
++ make a xhtml dom from the html e.g. using neko
++ use xpath queries to extract your data
++ also check out gate as a named entity extraction tool to extract  
names based on patterns and heuristics.
++ write the names in a file.

+ build your query urls
+ inject the query urls in a empty crawl db
+ create a segment fetch it and update the segment agains a second  
empty crawl database
+ remove the first segment and db
+ create a segment with your second db and fetch it.
You second segment will only contains the paper pages.

HTH
Stefan




Am 30.05.2006 um 12:14 schrieb HellSpawn:

>
> I'm working on a search engine for my university and they want me  
> to do that
> to create a repository of scientific articles on the web :D
>
> I red something about xpath for extracting exact parts from a  
> document, once
> done this building the query is very easy but my doubts are about  
> how to
> insert all of this in the nutch crawler...
>
> Thank you
> --
> View this message in context: http://www.nabble.com/Extract+infos 
> +from+documents+and+query+external+sites-t1675003.html#a4624272
> Sent from the Nutch - Dev forum at Nabble.com.
>
>

Re: Extract infos from documents and query external sites

Posted by HellSpawn <r....@gmail.com>.

I'm working on a search engine for my university and they want me to do that
to create a repository of scientific articles on the web :D

I red something about xpath for extracting exact parts from a document, once
done this building the query is very easy but my doubts are about how to
insert all of this in the nutch crawler...

Thank you
--
View this message in context: http://www.nabble.com/Extract+infos+from+documents+and+query+external+sites-t1675003.html#a4624272
Sent from the Nutch - Dev forum at Nabble.com.

Re: Extract infos from documents and query external sites

Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.

HellSpawn wrote:
> Hi all, I'm new :)
> 
> I have to extract some informations from an address book in my site
> (example: names and surnames) and then use it to build queries on sites like
> scholar.google.com, indexing the result page with my crawler. Can I do it?
> How?

Not "out of the box". You'd have to figure out building query-strings (I
assume they use GET-parameters) from your addressbook, and you could
then "index" those URLs.

For me the question though remains why you'd want to do that - but you
could :-)

  Stefan