You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Dominique Bejean <do...@eolya.fr> on 2011/03/02 01:25:14 UTC

[ANNOUNCE] Web Crawler

Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web 
Crawler. It includes :

    * a crawler
    * a document processing pipeline
    * a solr indexer

The crawler has a web administration in order to manage web sites to be 
crawled. Each web site crawl is configured with a lot of possible 
parameters (no all mandatory) :

    * number of simultaneous items crawled by site
    * recrawl period rules based on item type (html, PDF, …)
    * item type inclusion / exclusion rules
    * item path inclusion / exclusion / strategy rules
    * max depth
    * web site authentication
    * language
    * country
    * tags
    * collections
    * ...

The pileline includes various ready to use stages (text extraction, 
language detection, Solr ready to index xml writer, ...).

All is very configurable and extendible either by scripting or java coding.

With scripting technology, you can help the crawler to handle javascript 
links or help the pipeline to extract relevant title and cleanup the 
html pages (remove menus, header, footers, ..)

With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen shots. 
All is documented in a wiki.

The current version is 1.1.4. You can download and try it out from here 
: www.crawl-anywhere.com


Regards

Dominique

Re: [ANNOUNCE] Web Crawler

Posted by Ramakrishna <ra...@dioxe.com>.

so, There is no way to crawl if they blocked their web-sites to crawl ? I've
one idea, But seems little bit foolish(not works/I've to Modify whole
architecture) still I'm telling, If I use Html-Parser(Jsoup) Instead of
fetcher then? Anyhow Html-parser easily takes all contents of the
web-page.Can i do this.. I think rest of the
parts(segments,updater,indexer,parser) I've to write all these things, I
think it'll(Html-parser) not work with the already existing (parts) if i
replace fetcher with Html-parser.



--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078228.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: [ANNOUNCE] Web Crawler

Posted by Ramakrishna <ra...@dioxe.com>.

so, There is no way to crawl if they blocked their web-sites to crawl ? I've
one idea, But seems little bit foolish(not works/I've to Modify whole
architecture) still I'm telling, If I use Html-Parser(Jsoup) Instead of
fetcher then? Anyhow Html-parser easily takes all contents of the
web-page.Can i do this.. I think rest of the
parts(segments,updater,indexer,parser) I've to write all these things, I
think it'll(Html-parser) not work with the already existing (parts) if i
replace fetcher with Html-parser.



--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078229.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: [ANNOUNCE] Web Crawler

Posted by ka...@here.com.

Usually, if a webmaster finds that your crawler has ignored their robots.txt, they will block you machine, or maybe even your entire IP block, from accessing their site.

Karl

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com] 
Sent: Monday, July 15, 2013 9:30 AM
To: java-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Lucene does not provide any capabilities for crawling websites. You would have to contact the Nutch project, the ManifoldCF project, or other web crawling projects.

As far as bypassing robots.txt, that is a very unethical thing to do. It is rather offensive that you seem to be suggesting that anybody on this mailing list would engage in such an unethical or unprofessional activity.

-- Jack Krupansky

-----Original Message-----
From: Ramakrishna
Sent: Monday, July 15, 2013 9:13 AM
To: java-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Hi..

I'm trying nutch to crawl some web-sites. Unfortunately they restricted to crawl their web-site by writing robots.txt. By using crawl-anywhere can I crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz send me the materials/links to study about crawl-anywhere or else plz suggest me which are the crawlers to use to crawl web-sites without bothering about robots.txt of that particular site. Its urgent plz reply as soon as possible.

Thanks in advance

--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078039.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ANNOUNCE] Web Crawler

Posted by Jack Krupansky <ja...@basetechnology.com>.

Lucene does not provide any capabilities for crawling websites. You would 
have to contact the Nutch project, the ManifoldCF project, or other web 
crawling projects.

As far as bypassing robots.txt, that is a very unethical thing to do. It is 
rather offensive that you seem to be suggesting that anybody on this mailing 
list would engage in such an unethical or unprofessional activity.

-- Jack Krupansky

-----Original Message----- 
From: Ramakrishna
Sent: Monday, July 15, 2013 9:13 AM
To: java-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Hi..

I'm trying nutch to crawl some web-sites. Unfortunately they restricted to
crawl their web-site by writing robots.txt. By using crawl-anywhere can I
crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz
send me the materials/links to study about crawl-anywhere or else plz
suggest me which are the crawlers to use to crawl web-sites without
bothering about robots.txt of that particular site. Its urgent plz reply as
soon as possible.

Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078039.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ANNOUNCE] Web Crawler

Posted by Ramakrishna <ra...@dioxe.com>.

Hi..

I'm trying nutch to crawl some web-sites. Unfortunately they restricted to
crawl their web-site by writing robots.txt. By using crawl-anywhere can I
crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz
send me the materials/links to study about crawl-anywhere or else plz
suggest me which are the crawlers to use to crawl web-sites without
bothering about robots.txt of that particular site. Its urgent plz reply as
soon as possible.

Thanks in advance



--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078039.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ANNOUNCE] Web Crawler

Posted by Dominique Bejean <do...@eolya.fr>.

Hi,

Sorry for the delay, but I haven't been checking the mailing list for a 
long time.

Crawl-anywhere includes 3 piece of software : a crawler, a pipeline and 
a solr indexer.

There is a default Solr schema used by Crawl-anywhere, tested with Solr 
1.4.1 and Solr 3.1.0.

But, you can configure the pipeline stage responsible for mapping 
crawled data to Solr field. IN the absolute, you can use any schema with 
any Solr version.

Regards

Dominique



Le 14/05/11 15:29, abhayd a écrit :
> hi Dominique,
>
> I am looking for a crawler to feed solr index. After looking at various
> posts i have settled down on two
> Nutch and crawl anywhere.
>
> I dont see any activities on Nutch wiki so wondering if its not being
> developed anymore. But most forums say Nutch is standard for solr.
>
> Crawl Anywhere looks solid. Any way for users like me to decide which one we
> should go for Nutch or Crawl Anywehre?
>
>
> Concern with crawl anywhere is it supports solr 1.3 index not the latest
> version
>
> Any help on the is really appreciated
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p2937762.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ANNOUNCE] Web Crawler

Posted by abhayd <aj...@hotmail.com>.

hi Julien ,

I m not sure what do u mean by 
"SOLR is now used by default for indexing in Nutch." Does that mean SOLR has
integrated Nutch for crawling web resources? 

I checked SOLR wiki but i didnt see something like that, Could u please
provide some details ? 


--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p2947623.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ANNOUNCE] Web Crawler

Posted by Julien Nioche <li...@gmail.com>.

> I dont see any activities on Nutch wiki so wondering if its not being
> developed anymore. But most forums say Nutch is standard for solr.
>

Looking at the mail archives is a good clue of whether a project is still
alive or not. In the case of Nutch, the project is active as you can see on
the list archives below :

http://www.mail-archive.com/user%40nutch.apache.org/
http://www.mail-archive.com/dev%40nutch.apache.org/

We're about to release a new version (1.3) and have a 2.0 in beta. SOLR is
now used by default for indexing in Nutch.

HTH

Julien Nioche

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

RE: [ANNOUNCE] Web Crawler

Posted by ka...@nokia.com.

You might want to look at ManifoldCF also.

Karl

-----Original Message-----
From: ext abhayd [mailto:ajdabholkar@hotmail.com] 
Sent: Saturday, May 14, 2011 9:29 AM
To: java-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

hi Dominique,

I am looking for a crawler to feed solr index. After looking at various
posts i have settled down on two
Nutch and crawl anywhere.

I dont see any activities on Nutch wiki so wondering if its not being
developed anymore. But most forums say Nutch is standard for solr.

Crawl Anywhere looks solid. Any way for users like me to decide which one we
should go for Nutch or Crawl Anywehre?

Concern with crawl anywhere is it supports solr 1.3 index not the latest
version

Any help on the is really appreciated 

--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p2937762.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [ANNOUNCE] Web Crawler

Posted by abhayd <aj...@hotmail.com>.

hi Dominique,

I am looking for a crawler to feed solr index. After looking at various
posts i have settled down on two
Nutch and crawl anywhere.

I dont see any activities on Nutch wiki so wondering if its not being
developed anymore. But most forums say Nutch is standard for solr.

Crawl Anywhere looks solid. Any way for users like me to decide which one we
should go for Nutch or Crawl Anywehre?


Concern with crawl anywhere is it supports solr 1.3 index not the latest
version

Any help on the is really appreciated 

--
View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p2937762.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org