You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by HellSpawn <r....@gmail.com> on 2006/05/24 15:19:54 UTC
Extract infos from documents and query external sites
Hi all, I'm new :)
I have to extract some informations from an address book in my site
(example: names and surnames) and then use it to build queries on sites like
scholar.google.com, indexing the result page with my crawler. Can I do it?
How?
Thank you
Rosario Salatiello
--
View this message in context: http://www.nabble.com/Extract+infos+from+documents+and+query+external+sites-t1675003.html#a4541042
Sent from the Nutch - Dev forum at Nabble.com.
Re: Extract infos from documents and query external sites
Posted by Stefan Groschupf <sg...@media-style.com>.
Think about using the google API.
However the way to go could be:
+ fetch your pages
+ do not parse the pages
+ write a map reduce job that extract your data
++ make a xhtml dom from the html e.g. using neko
++ use xpath queries to extract your data
++ also check out gate as a named entity extraction tool to extract
names based on patterns and heuristics.
++ write the names in a file.
+ build your query urls
+ inject the query urls in a empty crawl db
+ create a segment fetch it and update the segment agains a second
empty crawl database
+ remove the first segment and db
+ create a segment with your second db and fetch it.
You second segment will only contains the paper pages.
HTH
Stefan
Am 30.05.2006 um 12:14 schrieb HellSpawn:
>
> I'm working on a search engine for my university and they want me
> to do that
> to create a repository of scientific articles on the web :D
>
> I red something about xpath for extracting exact parts from a
> document, once
> done this building the query is very easy but my doubts are about
> how to
> insert all of this in the nutch crawler...
>
> Thank you
> --
> View this message in context: http://www.nabble.com/Extract+infos
> +from+documents+and+query+external+sites-t1675003.html#a4624272
> Sent from the Nutch - Dev forum at Nabble.com.
>
>
Re: Extract infos from documents and query external sites
Posted by HellSpawn <r....@gmail.com>.
I'm working on a search engine for my university and they want me to do that
to create a repository of scientific articles on the web :D
I red something about xpath for extracting exact parts from a document, once
done this building the query is very easy but my doubts are about how to
insert all of this in the nutch crawler...
Thank you
--
View this message in context: http://www.nabble.com/Extract+infos+from+documents+and+query+external+sites-t1675003.html#a4624272
Sent from the Nutch - Dev forum at Nabble.com.
Re: Extract infos from documents and query external sites
Posted by Stefan Neufeind <ap...@stefan-neufeind.de>.
HellSpawn wrote:
> Hi all, I'm new :)
>
> I have to extract some informations from an address book in my site
> (example: names and surnames) and then use it to build queries on sites like
> scholar.google.com, indexing the result page with my crawler. Can I do it?
> How?
Not "out of the box". You'd have to figure out building query-strings (I
assume they use GET-parameters) from your addressbook, and you could
then "index" those URLs.
For me the question though remains why you'd want to do that - but you
could :-)
Stefan