You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dominique Bejean <do...@eolya.fr> on 2013/05/22 15:18:16 UTC
Re: [ANNOUNCE] Web Crawler
Hi,
Crawl-Anywhere is now open-source - https://github.com/bejean/crawl-anywhere
Best regards.
Le 02/03/11 10:02, findbestopensource a écrit :
> Hello Dominique Bejean,
>
> Good job.
>
> We identified almost 8 open source web crawlers
> http://www.findbestopensource.com/tagged/webcrawler I don't know how
> far yours would be different from the rest.
>
> Your license states that it is not open source but it is free for
> personnel use.
>
> Regards
> Aditya
> www.findbestopensource.com <http://www.findbestopensource.com>
>
>
> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
> <dominique.bejean@eolya.fr <ma...@eolya.fr>> wrote:
>
> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
> Web Crawler. It includes :
>
> * a crawler
> * a document processing pipeline
> * a solr indexer
>
> The crawler has a web administration in order to manage web sites
> to be crawled. Each web site crawl is configured with a lot of
> possible parameters (no all mandatory) :
>
> * number of simultaneous items crawled by site
> * recrawl period rules based on item type (html, PDF, …)
> * item type inclusion / exclusion rules
> * item path inclusion / exclusion / strategy rules
> * max depth
> * web site authentication
> * language
> * country
> * tags
> * collections
> * ...
>
> The pileline includes various ready to use stages (text
> extraction, language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or
> java coding.
>
> With scripting technology, you can help the crawler to handle
> javascript links or help the pipeline to extract relevant title
> and cleanup the html pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen
> shots. All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from
> here : www.crawl-anywhere.com <http://www.crawl-anywhere.com>
>
>
> Regards
>
> Dominique
>
>
--
Dominique Béjean
+33 6 08 46 12 43
skype: dbejean
www.eolya.fr
www.crawl-anywhere.com
www.mysolrserver.com
Re: [ANNOUNCE] Web Crawler
Posted by Dominique Bejean <do...@eolya.fr>.
Hi,
Release 3.0.3 was tested with :
* Oracle Java 6 but should work fine with version 7
* Tomcat 5.5 and 6 and 7
* PHP 5.2.x and 5.3.x
* Apache 2.2.x
* MongoDB 64 bits 2.2 (know issue with 2.4)
The new release 4.0.0-alpha-2 is available under Github -
https://github.com/bejean/crawl-anywhere
The pre-requisites are :
Oracle Java 6 or >
Tomcat 5.5 or >
Apache 2.2 or >
PHP 5.2.x or 5.3.x or 5.4.x
MongoDB 64 bits 2.2 or >
Solr 3.x or > (configuration files provided for Solr 4.3.0)
And the up to date installation instructions are here
http://www.crawl-anywhere.com/installation-v400/
Please read the Github project home page, all information are provided.
Regards.
Dominique
Le 23/05/13 07:38, Rajesh Nikam a écrit :
> Hi,
>
> crawl anywhere seems to using old versions of java, tomcat, etc.
>
> http://www.crawl-anywhere.com/installation-v300/
>
> Will it work with new versions of these required software ?
>
> Is there updated installation guide available ?
>
> Thanks
> Rajesh
>
>
>
>
>
> On Wed, May 22, 2013 at 6:48 PM, Dominique Bejean
> <dominique.bejean@eolya.fr <ma...@eolya.fr>> wrote:
>
> Hi,
>
> Crawl-Anywhere is now open-source -
> https://github.com/bejean/crawl-anywhere
>
> Best regards.
>
>
> Le 02/03/11 10:02, findbestopensource a écrit :
>
> Hello Dominique Bejean,
>
> Good job.
>
> We identified almost 8 open source web crawlers
> http://www.findbestopensource.com/tagged/webcrawler I don't
> know how far yours would be different from the rest.
>
> Your license states that it is not open source but it is free
> for personnel use.
>
> Regards
> Aditya
> www.findbestopensource.com <http://www.findbestopensource.com>
> <http://www.findbestopensource.com>
>
>
> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
> <dominique.bejean@eolya.fr <ma...@eolya.fr>
> <mailto:dominique.bejean@eolya.fr
> <ma...@eolya.fr>>> wrote:
>
> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is
> a Java
> Web Crawler. It includes :
>
> * a crawler
> * a document processing pipeline
> * a solr indexer
>
> The crawler has a web administration in order to manage
> web sites
> to be crawled. Each web site crawl is configured with a lot of
> possible parameters (no all mandatory) :
>
> * number of simultaneous items crawled by site
> * recrawl period rules based on item type (html, PDF, …)
> * item type inclusion / exclusion rules
> * item path inclusion / exclusion / strategy rules
> * max depth
> * web site authentication
> * language
> * country
> * tags
> * collections
> * ...
>
> The pileline includes various ready to use stages (text
> extraction, language detection, Solr ready to index xml
> writer, ...).
>
> All is very configurable and extendible either by scripting or
> java coding.
>
> With scripting technology, you can help the crawler to handle
> javascript links or help the pipeline to extract relevant
> title
> and cleanup the html pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage
> stage
>
> The Crawl Anywhere web site provides good explanations and
> screen
> shots. All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it
> out from
> here : www.crawl-anywhere.com
> <http://www.crawl-anywhere.com> <http://www.crawl-anywhere.com>
>
>
> Regards
>
> Dominique
>
>
>
> --
> Dominique Béjean
> +33 6 08 46 12 43
> skype: dbejean
> www.eolya.fr <http://www.eolya.fr>
> www.crawl-anywhere.com <http://www.crawl-anywhere.com>
> www.mysolrserver.com <http://www.mysolrserver.com>
>
>
--
Dominique Béjean
+33 6 08 46 12 43
skype: dbejean
www.eolya.fr
www.crawl-anywhere.com
Re: [ANNOUNCE] Web Crawler
Posted by Rajesh Nikam <ra...@gmail.com>.
Hi,
crawl anywhere seems to using old versions of java, tomcat, etc.
http://www.crawl-anywhere.com/installation-v300/
Will it work with new versions of these required software ?
Is there updated installation guide available ?
Thanks
Rajesh
On Wed, May 22, 2013 at 6:48 PM, Dominique Bejean <dominique.bejean@eolya.fr
> wrote:
> Hi,
>
> Crawl-Anywhere is now open-source - https://github.com/bejean/**
> crawl-anywhere <https://github.com/bejean/crawl-anywhere>
>
> Best regards.
>
>
> Le 02/03/11 10:02, findbestopensource a écrit :
>
>> Hello Dominique Bejean,
>>
>> Good job.
>>
>> We identified almost 8 open source web crawlers
>> http://www.findbestopensource.**com/tagged/webcrawler<http://www.findbestopensource.com/tagged/webcrawler> I don't know how far yours would be different from the rest.
>>
>> Your license states that it is not open source but it is free for
>> personnel use.
>>
>> Regards
>> Aditya
>> www.findbestopensource.com <http://www.**findbestopensource.com<http://www.findbestopensource.com>
>> >
>>
>>
>> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean <
>> dominique.bejean@eolya.fr <ma...@eolya.fr>>>
>> wrote:
>>
>> Hi,
>>
>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
>> Web Crawler. It includes :
>>
>> * a crawler
>> * a document processing pipeline
>> * a solr indexer
>>
>> The crawler has a web administration in order to manage web sites
>> to be crawled. Each web site crawl is configured with a lot of
>> possible parameters (no all mandatory) :
>>
>> * number of simultaneous items crawled by site
>> * recrawl period rules based on item type (html, PDF, …)
>> * item type inclusion / exclusion rules
>> * item path inclusion / exclusion / strategy rules
>> * max depth
>> * web site authentication
>> * language
>> * country
>> * tags
>> * collections
>> * ...
>>
>> The pileline includes various ready to use stages (text
>> extraction, language detection, Solr ready to index xml writer, ...).
>>
>> All is very configurable and extendible either by scripting or
>> java coding.
>>
>> With scripting technology, you can help the crawler to handle
>> javascript links or help the pipeline to extract relevant title
>> and cleanup the html pages (remove menus, header, footers, ..)
>>
>> With java coding, you can develop your own pipeline stage stage
>>
>> The Crawl Anywhere web site provides good explanations and screen
>> shots. All is documented in a wiki.
>>
>> The current version is 1.1.4. You can download and try it out from
>> here : www.crawl-anywhere.com <http://www.crawl-anywhere.com**>
>>
>>
>> Regards
>>
>> Dominique
>>
>>
>>
> --
> Dominique Béjean
> +33 6 08 46 12 43
> skype: dbejean
> www.eolya.fr
> www.crawl-anywhere.com
> www.mysolrserver.com
>
>