You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dominique Bejean <do...@eolya.fr> on 2011/03/02 01:25:14 UTC
[ANNOUNCE] Web Crawler
Hi,
I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
Crawler. It includes :
* a crawler
* a document processing pipeline
* a solr indexer
The crawler has a web administration in order to manage web sites to be
crawled. Each web site crawl is configured with a lot of possible
parameters (no all mandatory) :
* number of simultaneous items crawled by site
* recrawl period rules based on item type (html, PDF, …)
* item type inclusion / exclusion rules
* item path inclusion / exclusion / strategy rules
* max depth
* web site authentication
* language
* country
* tags
* collections
* ...
The pileline includes various ready to use stages (text extraction,
language detection, Solr ready to index xml writer, ...).
All is very configurable and extendible either by scripting or java coding.
With scripting technology, you can help the crawler to handle javascript
links or help the pipeline to extract relevant title and cleanup the
html pages (remove menus, header, footers, ..)
With java coding, you can develop your own pipeline stage stage
The Crawl Anywhere web site provides good explanations and screen shots.
All is documented in a wiki.
The current version is 1.1.4. You can download and try it out from here
: www.crawl-anywhere.com
Regards
Dominique
Re: [ANNOUNCE] Web Crawler
Posted by Ramakrishna <ra...@dioxe.com>.
so, There is no way to crawl if they blocked their web-sites to crawl ? I've
one idea, But seems little bit foolish(not works/I've to Modify whole
architecture) still I'm telling, If I use Html-Parser(Jsoup) Instead of
fetcher then? Anyhow Html-parser easily takes all contents of the
web-page.Can i do this.. I think rest of the
parts(segments,updater,indexer,parser) I've to write all these things, I
think it'll(Html-parser) not work with the already existing (parts) if i
replace fetcher with Html-parser.
--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078228.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: [ANNOUNCE] Web Crawler
Posted by Ramakrishna <ra...@dioxe.com>.
so, There is no way to crawl if they blocked their web-sites to crawl ? I've
one idea, But seems little bit foolish(not works/I've to Modify whole
architecture) still I'm telling, If I use Html-Parser(Jsoup) Instead of
fetcher then? Anyhow Html-parser easily takes all contents of the
web-page.Can i do this.. I think rest of the
parts(segments,updater,indexer,parser) I've to write all these things, I
think it'll(Html-parser) not work with the already existing (parts) if i
replace fetcher with Html-parser.
--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078229.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: [ANNOUNCE] Web Crawler
Posted by ka...@here.com.
Usually, if a webmaster finds that your crawler has ignored their robots.txt, they will block you machine, or maybe even your entire IP block, from accessing their site.
Karl
-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: Monday, July 15, 2013 9:30 AM
To: java-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler
Lucene does not provide any capabilities for crawling websites. You would have to contact the Nutch project, the ManifoldCF project, or other web crawling projects.
As far as bypassing robots.txt, that is a very unethical thing to do. It is rather offensive that you seem to be suggesting that anybody on this mailing list would engage in such an unethical or unprofessional activity.
-- Jack Krupansky
-----Original Message-----
From: Ramakrishna
Sent: Monday, July 15, 2013 9:13 AM
To: java-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler
Hi..
I'm trying nutch to crawl some web-sites. Unfortunately they restricted to crawl their web-site by writing robots.txt. By using crawl-anywhere can I crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz send me the materials/links to study about crawl-anywhere or else plz suggest me which are the crawlers to use to crawl web-sites without bothering about robots.txt of that particular site. Its urgent plz reply as soon as possible.
Thanks in advance
--
View this message in context:
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078039.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [ANNOUNCE] Web Crawler
Posted by Jack Krupansky <ja...@basetechnology.com>.
Lucene does not provide any capabilities for crawling websites. You would
have to contact the Nutch project, the ManifoldCF project, or other web
crawling projects.
As far as bypassing robots.txt, that is a very unethical thing to do. It is
rather offensive that you seem to be suggesting that anybody on this mailing
list would engage in such an unethical or unprofessional activity.
-- Jack Krupansky
-----Original Message-----
From: Ramakrishna
Sent: Monday, July 15, 2013 9:13 AM
To: java-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler
Hi..
I'm trying nutch to crawl some web-sites. Unfortunately they restricted to
crawl their web-site by writing robots.txt. By using crawl-anywhere can I
crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz
send me the materials/links to study about crawl-anywhere or else plz
suggest me which are the crawlers to use to crawl web-sites without
bothering about robots.txt of that particular site. Its urgent plz reply as
soon as possible.
Thanks in advance
--
View this message in context:
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078039.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [ANNOUNCE] Web Crawler
Posted by Ramakrishna <ra...@dioxe.com>.
Hi..
I'm trying nutch to crawl some web-sites. Unfortunately they restricted to
crawl their web-site by writing robots.txt. By using crawl-anywhere can I
crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz
send me the materials/links to study about crawl-anywhere or else plz
suggest me which are the crawlers to use to crawl web-sites without
bothering about robots.txt of that particular site. Its urgent plz reply as
soon as possible.
Thanks in advance
--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078039.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [ANNOUNCE] Web Crawler
Posted by Dominique Bejean <do...@eolya.fr>.
Hi,
Sorry for the delay, but I haven't been checking the mailing list for a
long time.
Crawl-anywhere includes 3 piece of software : a crawler, a pipeline and
a solr indexer.
There is a default Solr schema used by Crawl-anywhere, tested with Solr
1.4.1 and Solr 3.1.0.
But, you can configure the pipeline stage responsible for mapping
crawled data to Solr field. IN the absolute, you can use any schema with
any Solr version.
Regards
Dominique
Le 14/05/11 15:29, abhayd a écrit :
> hi Dominique,
>
> I am looking for a crawler to feed solr index. After looking at various
> posts i have settled down on two
> Nutch and crawl anywhere.
>
> I dont see any activities on Nutch wiki so wondering if its not being
> developed anymore. But most forums say Nutch is standard for solr.
>
> Crawl Anywhere looks solid. Any way for users like me to decide which one we
> should go for Nutch or Crawl Anywehre?
>
>
> Concern with crawl anywhere is it supports solr 1.3 index not the latest
> version
>
> Any help on the is really appreciated
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p2937762.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [ANNOUNCE] Web Crawler
Posted by abhayd <aj...@hotmail.com>.
hi Julien ,
I m not sure what do u mean by
"SOLR is now used by default for indexing in Nutch." Does that mean SOLR has
integrated Nutch for crawling web resources?
I checked SOLR wiki but i didnt see something like that, Could u please
provide some details ?
--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p2947623.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [ANNOUNCE] Web Crawler
Posted by Julien Nioche <li...@gmail.com>.
> I dont see any activities on Nutch wiki so wondering if its not being
> developed anymore. But most forums say Nutch is standard for solr.
>
Looking at the mail archives is a good clue of whether a project is still
alive or not. In the case of Nutch, the project is active as you can see on
the list archives below :
http://www.mail-archive.com/user%40nutch.apache.org/
http://www.mail-archive.com/dev%40nutch.apache.org/
We're about to release a new version (1.3) and have a 2.0 in beta. SOLR is
now used by default for indexing in Nutch.
HTH
Julien Nioche
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
RE: [ANNOUNCE] Web Crawler
Posted by ka...@nokia.com.
You might want to look at ManifoldCF also.
Karl
-----Original Message-----
From: ext abhayd [mailto:ajdabholkar@hotmail.com]
Sent: Saturday, May 14, 2011 9:29 AM
To: java-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler
hi Dominique,
I am looking for a crawler to feed solr index. After looking at various
posts i have settled down on two
Nutch and crawl anywhere.
I dont see any activities on Nutch wiki so wondering if its not being
developed anymore. But most forums say Nutch is standard for solr.
Crawl Anywhere looks solid. Any way for users like me to decide which one we
should go for Nutch or Crawl Anywehre?
Concern with crawl anywhere is it supports solr 1.3 index not the latest
version
Any help on the is really appreciated
--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p2937762.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [ANNOUNCE] Web Crawler
Posted by abhayd <aj...@hotmail.com>.
hi Dominique,
I am looking for a crawler to feed solr index. After looking at various
posts i have settled down on two
Nutch and crawl anywhere.
I dont see any activities on Nutch wiki so wondering if its not being
developed anymore. But most forums say Nutch is standard for solr.
Crawl Anywhere looks solid. Any way for users like me to decide which one we
should go for Nutch or Crawl Anywehre?
Concern with crawl anywhere is it supports solr 1.3 index not the latest
version
Any help on the is really appreciated
--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p2937762.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org