You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dominique Bejean <do...@eolya.fr> on 2013/05/22 15:18:16 UTC

Re: [ANNOUNCE] Web Crawler

Hi,

Crawl-Anywhere is now open-source - https://github.com/bejean/crawl-anywhere

Best regards.


Le 02/03/11 10:02, findbestopensource a écrit :
> Hello Dominique Bejean,
>
> Good job.
>
> We identified almost 8 open source web crawlers 
> http://www.findbestopensource.com/tagged/webcrawler   I don't know how 
> far yours would be different from the rest.
>
> Your license states that it is not open source but it is free for 
> personnel use.
>
> Regards
> Aditya
> www.findbestopensource.com <http://www.findbestopensource.com>
>
>
> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean 
> <dominique.bejean@eolya.fr <ma...@eolya.fr>> wrote:
>
>     Hi,
>
>     I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
>     Web Crawler. It includes :
>
>       * a crawler
>       * a document processing pipeline
>       * a solr indexer
>
>     The crawler has a web administration in order to manage web sites
>     to be crawled. Each web site crawl is configured with a lot of
>     possible parameters (no all mandatory) :
>
>       * number of simultaneous items crawled by site
>       * recrawl period rules based on item type (html, PDF, …)
>       * item type inclusion / exclusion rules
>       * item path inclusion / exclusion / strategy rules
>       * max depth
>       * web site authentication
>       * language
>       * country
>       * tags
>       * collections
>       * ...
>
>     The pileline includes various ready to use stages (text
>     extraction, language detection, Solr ready to index xml writer, ...).
>
>     All is very configurable and extendible either by scripting or
>     java coding.
>
>     With scripting technology, you can help the crawler to handle
>     javascript links or help the pipeline to extract relevant title
>     and cleanup the html pages (remove menus, header, footers, ..)
>
>     With java coding, you can develop your own pipeline stage stage
>
>     The Crawl Anywhere web site provides good explanations and screen
>     shots. All is documented in a wiki.
>
>     The current version is 1.1.4. You can download and try it out from
>     here : www.crawl-anywhere.com <http://www.crawl-anywhere.com>
>
>
>     Regards
>
>     Dominique
>
>

-- 
Dominique Béjean
+33 6 08 46 12 43
skype: dbejean
www.eolya.fr
www.crawl-anywhere.com
www.mysolrserver.com

Re: [ANNOUNCE] Web Crawler

Posted by Dominique Bejean <do...@eolya.fr>.

Hi,

Release 3.0.3 was tested with :

* Oracle Java 6 but should work fine with version 7
* Tomcat 5.5 and 6 and 7
* PHP 5.2.x and 5.3.x
* Apache 2.2.x
* MongoDB 64 bits 2.2 (know issue with 2.4)

The new release 4.0.0-alpha-2 is available under Github - 
https://github.com/bejean/crawl-anywhere

The pre-requisites are :

Oracle Java 6 or >
Tomcat 5.5 or >
Apache 2.2 or >
PHP 5.2.x or 5.3.x or 5.4.x
MongoDB 64 bits 2.2 or >
Solr 3.x or > (configuration files provided for Solr 4.3.0)

And the up to date installation instructions are here 
http://www.crawl-anywhere.com/installation-v400/

Please read the Github project home page, all information are provided.

Regards.

Dominique




Le 23/05/13 07:38, Rajesh Nikam a écrit :
> Hi,
>
> crawl anywhere seems to using old versions of java, tomcat, etc.
>
> http://www.crawl-anywhere.com/installation-v300/
>
> Will it work with new versions of these required software ?
>
> Is there updated installation guide available ?
>
> Thanks
> Rajesh
>
>
>
>
>
> On Wed, May 22, 2013 at 6:48 PM, Dominique Bejean 
> <dominique.bejean@eolya.fr <ma...@eolya.fr>> wrote:
>
>     Hi,
>
>     Crawl-Anywhere is now open-source -
>     https://github.com/bejean/crawl-anywhere
>
>     Best regards.
>
>
>     Le 02/03/11 10:02, findbestopensource a écrit :
>
>         Hello Dominique Bejean,
>
>         Good job.
>
>         We identified almost 8 open source web crawlers
>         http://www.findbestopensource.com/tagged/webcrawler   I don't
>         know how far yours would be different from the rest.
>
>         Your license states that it is not open source but it is free
>         for personnel use.
>
>         Regards
>         Aditya
>         www.findbestopensource.com <http://www.findbestopensource.com>
>         <http://www.findbestopensource.com>
>
>
>         On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
>         <dominique.bejean@eolya.fr <ma...@eolya.fr>
>         <mailto:dominique.bejean@eolya.fr
>         <ma...@eolya.fr>>> wrote:
>
>             Hi,
>
>             I would like to announce Crawl Anywhere. Crawl-Anywhere is
>         a Java
>             Web Crawler. It includes :
>
>               * a crawler
>               * a document processing pipeline
>               * a solr indexer
>
>             The crawler has a web administration in order to manage
>         web sites
>             to be crawled. Each web site crawl is configured with a lot of
>             possible parameters (no all mandatory) :
>
>               * number of simultaneous items crawled by site
>               * recrawl period rules based on item type (html, PDF, …)
>               * item type inclusion / exclusion rules
>               * item path inclusion / exclusion / strategy rules
>               * max depth
>               * web site authentication
>               * language
>               * country
>               * tags
>               * collections
>               * ...
>
>             The pileline includes various ready to use stages (text
>             extraction, language detection, Solr ready to index xml
>         writer, ...).
>
>             All is very configurable and extendible either by scripting or
>             java coding.
>
>             With scripting technology, you can help the crawler to handle
>             javascript links or help the pipeline to extract relevant
>         title
>             and cleanup the html pages (remove menus, header, footers, ..)
>
>             With java coding, you can develop your own pipeline stage
>         stage
>
>             The Crawl Anywhere web site provides good explanations and
>         screen
>             shots. All is documented in a wiki.
>
>             The current version is 1.1.4. You can download and try it
>         out from
>             here : www.crawl-anywhere.com
>         <http://www.crawl-anywhere.com> <http://www.crawl-anywhere.com>
>
>
>             Regards
>
>             Dominique
>
>
>
>     -- 
>     Dominique Béjean
>     +33 6 08 46 12 43
>     skype: dbejean
>     www.eolya.fr <http://www.eolya.fr>
>     www.crawl-anywhere.com <http://www.crawl-anywhere.com>
>     www.mysolrserver.com <http://www.mysolrserver.com>
>
>

-- 
Dominique Béjean
+33 6 08 46 12 43
skype: dbejean
www.eolya.fr
www.crawl-anywhere.com

Re: [ANNOUNCE] Web Crawler

Posted by Rajesh Nikam <ra...@gmail.com>.

Hi,

crawl anywhere seems to using old versions of java, tomcat, etc.

http://www.crawl-anywhere.com/installation-v300/

Will it work with new versions of these required software ?

Is there updated installation guide available ?

Thanks
Rajesh





On Wed, May 22, 2013 at 6:48 PM, Dominique Bejean <dominique.bejean@eolya.fr
> wrote:

> Hi,
>
> Crawl-Anywhere is now open-source - https://github.com/bejean/**
> crawl-anywhere <https://github.com/bejean/crawl-anywhere>
>
> Best regards.
>
>
> Le 02/03/11 10:02, findbestopensource a écrit :
>
>> Hello Dominique Bejean,
>>
>> Good job.
>>
>> We identified almost 8 open source web crawlers
>> http://www.findbestopensource.**com/tagged/webcrawler<http://www.findbestopensource.com/tagged/webcrawler>  I don't know how far yours would be different from the rest.
>>
>> Your license states that it is not open source but it is free for
>> personnel use.
>>
>> Regards
>> Aditya
>> www.findbestopensource.com <http://www.**findbestopensource.com<http://www.findbestopensource.com>
>> >
>>
>>
>> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean <
>> dominique.bejean@eolya.fr <ma...@eolya.fr>>>
>> wrote:
>>
>>     Hi,
>>
>>     I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
>>     Web Crawler. It includes :
>>
>>       * a crawler
>>       * a document processing pipeline
>>       * a solr indexer
>>
>>     The crawler has a web administration in order to manage web sites
>>     to be crawled. Each web site crawl is configured with a lot of
>>     possible parameters (no all mandatory) :
>>
>>       * number of simultaneous items crawled by site
>>       * recrawl period rules based on item type (html, PDF, …)
>>       * item type inclusion / exclusion rules
>>       * item path inclusion / exclusion / strategy rules
>>       * max depth
>>       * web site authentication
>>       * language
>>       * country
>>       * tags
>>       * collections
>>       * ...
>>
>>     The pileline includes various ready to use stages (text
>>     extraction, language detection, Solr ready to index xml writer, ...).
>>
>>     All is very configurable and extendible either by scripting or
>>     java coding.
>>
>>     With scripting technology, you can help the crawler to handle
>>     javascript links or help the pipeline to extract relevant title
>>     and cleanup the html pages (remove menus, header, footers, ..)
>>
>>     With java coding, you can develop your own pipeline stage stage
>>
>>     The Crawl Anywhere web site provides good explanations and screen
>>     shots. All is documented in a wiki.
>>
>>     The current version is 1.1.4. You can download and try it out from
>>     here : www.crawl-anywhere.com <http://www.crawl-anywhere.com**>
>>
>>
>>     Regards
>>
>>     Dominique
>>
>>
>>
> --
> Dominique Béjean
> +33 6 08 46 12 43
> skype: dbejean
> www.eolya.fr
> www.crawl-anywhere.com
> www.mysolrserver.com
>
>