You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Luis Cappa Banda <lu...@gmail.com> on 2011/10/18 08:30:25 UTC

Solr scraping: Nutch and other alternatives.

Hello everyone.

I've been thinking about a way to retrieve information from a domain (for
example, http://www.ign.com) to process and index. My idea is to use Solr as
a searcher. I'm familiarized with Apache Nutch and I know that the latest
version has a gateway to Solr to retrieve and index information with it. I
tried it and it worked fine, but it's a little bit complex to develop
plugins to process info and index it in a new field desired. Perhaps one of
you have tried another (and better) alternative to data mine web
information. Which is your recommendation? Can you give me any scraping
suggestion?

Thank you very much.

Luis Cappa.

Re: Solr scraping: Nutch and other alternatives.

Posted by Markus Jelsma <ma...@openindex.io>.

I'm a bit biased but i would certainly use Nutch as it's the right tool for 
the job, it seems. Developing custom plugins is actually easier than you might 
think.

Solr, with it's extracting request handling, can only help in a very limited 
way.

> Hello everyone.
> 
> I've been thinking about a way to retrieve information from a domain (for
> example, http://www.ign.com) to process and index. My idea is to use Solr
> as a searcher. I'm familiarized with Apache Nutch and I know that the
> latest version has a gateway to Solr to retrieve and index information
> with it. I tried it and it worked fine, but it's a little bit complex to
> develop plugins to process info and index it in a new field desired.
> Perhaps one of you have tried another (and better) alternative to data
> mine web
> information. Which is your recommendation? Can you give me any scraping
> suggestion?
> 
> Thank you very much.
> 
> Luis Cappa.

Re: Solr scraping: Nutch and other alternatives.

Posted by Igor MILOVANOVIC <pl...@gmail.com>.

Try this if you haven't use python before :
http://gun.io/blog/python-for-the-web/

Keep in mind that the usage of "some very known search engine" is usually
not in line with their ToS, so they will sooner or later block you, at
least.

Be gentle and polite, and you even might make it work... ;)

On Wed, Oct 19, 2011 at 2:08 PM, Luis Cappa Banda <lu...@gmail.com>wrote:

> Do you know any tutorial or book that can teach me the
> first steps?
>

-- 
Igor Milovanović
http://about.me/igor.milovanovic
http://umotvorine.com/

Re: Solr scraping: Nutch and other alternatives.

Posted by Luis Cappa Banda <lu...@gmail.com>.

Hello Marco, Markus and Óscar.

Thank you very much for your answers. What you suggest, Óscar, sounds very
interesting. I mean the alternative that covers data mining with any
'popular searcher'. Do you know any tutorial or book that can teach me the
first steps?

Bye!

Re: Solr scraping: Nutch and other alternatives.

Posted by Óscar Marín Miró <os...@gmail.com>.

Hi Luis, just an opinion (worked with Nutch intensively, 2005-2008).
Web crawling is a bitch, and Nutch won't make it any easier.

Some problems you'll find along the way:

   1. Spidering tunnels/traps
   2. Duplicate and near-duplicate content removal
   3. GET parameter explosion in dynamic pages
   4. Compromises between breadth and depth of crawl (you only have that
   much time, and every site has its unique link geometry)

Nutch has its own set of tools (urlfilters, depth control...) to cope with
each problem, but sometimes you solve, say, 3, and 4 comes back again.

My advice would be to use "some popular search engines" as a way to mine the
web (you always can ask for all the pages indexed in a domain). They have
done this job, and nicely done. In fact, due to their ranking algorithms
(based on link geometry), a 'popular' page will always be indexed, and to
me, that's a good circumstance (i.e: you can always claim that with your own
web crawler you've covered more url's for a specific site, but what's the
value if the extra url's are *not that important* ?)

If I'm absolutely forced to crawl a site, I use plain old 'curl' or 'wget'.
Open source, tunable via a vast array of parameters and 'black boxes'. I do
not see any justification in deploying 'the nutch monster' just to crawl
some web portion already crawled by "popular search engines"

On the 'scrapping' / xhtml mining front, 'mechanize' library (python, perl,
ruby, whatever flavour) and 'Beautiful Soup' for python have always fed my
hunger for web scrapping.

Good luck :D

On Tue, Oct 18, 2011 at 9:16 AM, Marco Martinez <
mmartinez@paradigmatecnologico.com> wrote:

> Hi Luis,
>
> Have you tried the copyField function with custom analyzers and tokenizers?
>
> bye,
>
> Marco Martínez Bautista
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>
>
> 2011/10/18 Luis Cappa Banda <lu...@gmail.com>
>
> > Hello everyone.
> >
> > I've been thinking about a way to retrieve information from a domain (for
> > example, http://www.ign.com) to process and index. My idea is to use
> Solr
> > as
> > a searcher. I'm familiarized with Apache Nutch and I know that the latest
> > version has a gateway to Solr to retrieve and index information with it.
> I
> > tried it and it worked fine, but it's a little bit complex to develop
> > plugins to process info and index it in a new field desired. Perhaps one
> of
> > you have tried another (and better) alternative to data mine web
> > information. Which is your recommendation? Can you give me any scraping
> > suggestion?
> >
> > Thank you very much.
> >
> > Luis Cappa.
> >
>

-- 
Whether it's science, technology, personal experience, true love, astrology,
or gut feelings, each of us has confidence in something that we will never
fully comprehend.
 --Roy H. William

Re: Solr scraping: Nutch and other alternatives.

Posted by Marco Martinez <mm...@paradigmatecnologico.com>.

Hi Luis,

Have you tried the copyField function with custom analyzers and tokenizers?

bye,

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2011/10/18 Luis Cappa Banda <lu...@gmail.com>

> Hello everyone.
>
> I've been thinking about a way to retrieve information from a domain (for
> example, http://www.ign.com) to process and index. My idea is to use Solr
> as
> a searcher. I'm familiarized with Apache Nutch and I know that the latest
> version has a gateway to Solr to retrieve and index information with it. I
> tried it and it worked fine, but it's a little bit complex to develop
> plugins to process info and index it in a new field desired. Perhaps one of
> you have tried another (and better) alternative to data mine web
> information. Which is your recommendation? Can you give me any scraping
> suggestion?
>
> Thank you very much.
>
> Luis Cappa.
>