You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Markus Jelsma <ma...@openindex.io> on 2013/10/30 18:54:31 UTC

RE: Replacing Google Mini Search Appliance with Solr?

Hi Eric,

We have also helped some government institution to replave their expensive GSA with open source software. In our case we use Apache Nutch 1.7 to crawl the websites and index to Apache Solr. It is very effective, robust and scales easily with Hadoop if you have to. Nutch may not be the easiest tool for the job but is very stable, feature rich and has an active community here at Apache.

Cheers,
 
-----Original message-----
> From:Palmer, Eric <ep...@richmond.edu>
> Sent: Wednesday 30th October 2013 18:48
> To: solr-user@lucene.apache.org
> Subject: Replacing Google Mini Search Appliance with Solr?
> 
> Hello all,
> 
> Been lurking on the list for awhile.
> 
> We are at the end of life for replacing two google mini search appliances used to index our public web sites. Google is no longer selling the mini appliances and buying the big appliance is not cost beneficial.
> 
> http://search.richmond.edu/
> 
> We would run a solr replacement in linux (cents, redhat, similar) with open Java or Oracle Java.
> 
> Background
> ==========
> ~130 sites
> only ~12,000 pages (at a depth of 3)
> probably ~40,000 pages if we go to a depth of 4
> 
> We use key matches a lot. In solr terms these are elevated documents (elevations)
> 
> We would code a search query form in php and wrap it into our design (http://www.richmond.edu)
> 
> I have played with and love lucidworks and know that their $ solution works for our use cases but the cost model is not attractive for such a small collection.
> 
> So with solr what are my open source options and what are people's experiences crawling and indexing web sites with solr + crawler. I understand there is not a crawler with solr so that would have to be first up to get one working.
> 
> We can code in Java, PHP, Python etc. if we have to, but we don't want to write a crawler if we can avoid it.
> 
> thanks in advance for and information.
> 
> --
> Eric Palmer
> Web Services
> U of Richmond
> 
> 

Re: Replacing Google Mini Search Appliance with Solr?

Posted by Furkan KAMACI <fu...@gmail.com>.
We also use Nutch at our environment.  Nutch crawls the data and it to Solr
for indexing. I have implemented a custom search API that interacts with my
Solr indexes cos of I don't want to expose my indexes directly to outside.
You can easily configure and build up what you want with such kind of
combination.

30 Ekim 2013 Çarşamba tarihinde Palmer, Eric <ep...@richmond.edu> adlı
kullanıcı şöyle yazdı:
> Thanks for the link
>
> Sent from my iPhone
>
> On Oct 30, 2013, at 4:06 PM, "Rajani Maski" <ra...@gmail.com> wrote:
>
>> Hi Eric,
>>
>>  I have also developed mini-applications replacing GSA for some of our
>> clients using Apache Nutch + Solr to crawl multi lingual sites and enable
>> multi-lingual search. Nutch+Solr is very stable and Nutch mailing list
>> provides a good support.
>>
>> Reference link to start:
>> apache nutch | profilerajanimaski
>>
>> Thanks
>> Rajani
>>
>>
>>
>>
>> On Thu, Oct 31, 2013 at 12:27 AM, Palmer, Eric <ep...@richmond.edu>
wrote:
>>
>>> Markus and Jason
>>>
>>> thanks for the info.
>>>
>>> I will start to research Nutch.  Writing a crawler, agree it is a rabbit
>>> hole.
>>>
>>>
>>> --
>>> Eric Palmer
>>>
>>> Web Services
>>> U of Richmond
>>>
>>> To report technical issues, obtain technical support or make requests
for
>>> enhancements please visit
>>> http://web.richmond.edu/contact/technical-support.html
>>>
>>>
>>>
>>>
>>>
>>> On 10/30/13 2:53 PM, "Jason Hellman" <jh...@innoventsolutions.com>
>>> wrote:
>>>
>>>> Nutch is an excellent option.  It should feel very comfortable for
people
>>>> migrating away from the Google appliances.
>>>>
>>>> Apache Droids is another possible way to approach, and I¹ve found
people
>>>> using Heretrix or Manifold for various use cases (and usually in
>>>> combination with other use cases where the extra overhead was worth the
>>>> trouble).
>>>>
>>>> I think the simples approach will be NutchŠit¹s absolutely worth
taking a
>>>> shot at it.
>>>>
>>>> DO NOT write a crawler!  That is a rabbit hole you do not want to peer
>>>> down into :)
>>>>
>>>>
>>>>
>>>> On Oct 30, 2013, at 10:54 AM, Markus Jelsma <markus.jelsma@openindex.io
>
>>>> wrote:
>>>>
>>>>> Hi Eric,
>>>>>
>>>>> We have also helped some government institution to replave their
>>>>> expensive GSA with open source software. In our case we use Apache
Nutch
>>>>> 1.7 to crawl the websites and index to Apache Solr. It is very
>>>>> effective, robust and scales easily with Hadoop if you have to. Nutch
>>>>> may not be the easiest tool for the job but is very stable, feature
rich
>>>>> and has an active community here at Apache.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -----Original message-----
>>>>>> From:Palmer, Eric <ep...@richmond.edu>
>>>>>> Sent: Wednesday 30th October 2013 18:48
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Subject: Replacing Google Mini Search Appliance with Solr?
>>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> Been lurking on the list for awhile.
>>>>>>
>>>>>> We are at the end of life for replacing two google mini search
>>>>>> appliances used to index our public web sites. Google is no longer
>>>>>> selling the mini appliances and buying the big appliance is not cost
>>>>>> beneficial.
>>>>>>
>>>>>> http://search.richmond.edu/
>>>>>>
>>>>>> We would run a solr replacement in linux (cents, redhat, similar)
with
>>>>>> open Java or Oracle Java.
>>>>>>
>>>>>> Background
>>>>>> ==========
>>>>>> ~130 sites
>

Re: Replacing Google Mini Search Appliance with Solr?

Posted by "Palmer, Eric" <ep...@richmond.edu>.
Thanks for the link

Sent from my iPhone

On Oct 30, 2013, at 4:06 PM, "Rajani Maski" <ra...@gmail.com> wrote:

> Hi Eric,
> 
>  I have also developed mini-applications replacing GSA for some of our
> clients using Apache Nutch + Solr to crawl multi lingual sites and enable
> multi-lingual search. Nutch+Solr is very stable and Nutch mailing list
> provides a good support.
> 
> Reference link to start:
> https://sites.google.com/site/profilerajanimaski/webcrawlers/apache-nutch
> 
> Thanks
> Rajani
> 
> 
> 
> 
> On Thu, Oct 31, 2013 at 12:27 AM, Palmer, Eric <ep...@richmond.edu> wrote:
> 
>> Markus and Jason
>> 
>> thanks for the info.
>> 
>> I will start to research Nutch.  Writing a crawler, agree it is a rabbit
>> hole.
>> 
>> 
>> --
>> Eric Palmer
>> 
>> Web Services
>> U of Richmond
>> 
>> To report technical issues, obtain technical support or make requests for
>> enhancements please visit
>> http://web.richmond.edu/contact/technical-support.html
>> 
>> 
>> 
>> 
>> 
>> On 10/30/13 2:53 PM, "Jason Hellman" <jh...@innoventsolutions.com>
>> wrote:
>> 
>>> Nutch is an excellent option.  It should feel very comfortable for people
>>> migrating away from the Google appliances.
>>> 
>>> Apache Droids is another possible way to approach, and I¹ve found people
>>> using Heretrix or Manifold for various use cases (and usually in
>>> combination with other use cases where the extra overhead was worth the
>>> trouble).
>>> 
>>> I think the simples approach will be NutchŠit¹s absolutely worth taking a
>>> shot at it.
>>> 
>>> DO NOT write a crawler!  That is a rabbit hole you do not want to peer
>>> down into :)
>>> 
>>> 
>>> 
>>> On Oct 30, 2013, at 10:54 AM, Markus Jelsma <ma...@openindex.io>
>>> wrote:
>>> 
>>>> Hi Eric,
>>>> 
>>>> We have also helped some government institution to replave their
>>>> expensive GSA with open source software. In our case we use Apache Nutch
>>>> 1.7 to crawl the websites and index to Apache Solr. It is very
>>>> effective, robust and scales easily with Hadoop if you have to. Nutch
>>>> may not be the easiest tool for the job but is very stable, feature rich
>>>> and has an active community here at Apache.
>>>> 
>>>> Cheers,
>>>> 
>>>> -----Original message-----
>>>>> From:Palmer, Eric <ep...@richmond.edu>
>>>>> Sent: Wednesday 30th October 2013 18:48
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Replacing Google Mini Search Appliance with Solr?
>>>>> 
>>>>> Hello all,
>>>>> 
>>>>> Been lurking on the list for awhile.
>>>>> 
>>>>> We are at the end of life for replacing two google mini search
>>>>> appliances used to index our public web sites. Google is no longer
>>>>> selling the mini appliances and buying the big appliance is not cost
>>>>> beneficial.
>>>>> 
>>>>> http://search.richmond.edu/
>>>>> 
>>>>> We would run a solr replacement in linux (cents, redhat, similar) with
>>>>> open Java or Oracle Java.
>>>>> 
>>>>> Background
>>>>> ==========
>>>>> ~130 sites
>>>>> only ~12,000 pages (at a depth of 3)
>>>>> probably ~40,000 pages if we go to a depth of 4
>>>>> 
>>>>> We use key matches a lot. In solr terms these are elevated documents
>>>>> (elevations)
>>>>> 
>>>>> We would code a search query form in php and wrap it into our design
>>>>> (http://www.richmond.edu)
>>>>> 
>>>>> I have played with and love lucidworks and know that their $ solution
>>>>> works for our use cases but the cost model is not attractive for such a
>>>>> small collection.
>>>>> 
>>>>> So with solr what are my open source options and what are people's
>>>>> experiences crawling and indexing web sites with solr + crawler. I
>>>>> understand there is not a crawler with solr so that would have to be
>>>>> first up to get one working.
>>>>> 
>>>>> We can code in Java, PHP, Python etc. if we have to, but we don't want
>>>>> to write a crawler if we can avoid it.
>>>>> 
>>>>> thanks in advance for and information.
>>>>> 
>>>>> --
>>>>> Eric Palmer
>>>>> Web Services
>>>>> U of Richmond
>> 
>> 

Re: Replacing Google Mini Search Appliance with Solr?

Posted by Rajani Maski <ra...@gmail.com>.
Hi Eric,

  I have also developed mini-applications replacing GSA for some of our
clients using Apache Nutch + Solr to crawl multi lingual sites and enable
multi-lingual search. Nutch+Solr is very stable and Nutch mailing list
provides a good support.

Reference link to start:
https://sites.google.com/site/profilerajanimaski/webcrawlers/apache-nutch

Thanks
Rajani




On Thu, Oct 31, 2013 at 12:27 AM, Palmer, Eric <ep...@richmond.edu> wrote:

> Markus and Jason
>
> thanks for the info.
>
> I will start to research Nutch.  Writing a crawler, agree it is a rabbit
> hole.
>
>
> --
> Eric Palmer
>
> Web Services
> U of Richmond
>
> To report technical issues, obtain technical support or make requests for
> enhancements please visit
> http://web.richmond.edu/contact/technical-support.html
>
>
>
>
>
> On 10/30/13 2:53 PM, "Jason Hellman" <jh...@innoventsolutions.com>
> wrote:
>
> >Nutch is an excellent option.  It should feel very comfortable for people
> >migrating away from the Google appliances.
> >
> >Apache Droids is another possible way to approach, and I¹ve found people
> >using Heretrix or Manifold for various use cases (and usually in
> >combination with other use cases where the extra overhead was worth the
> >trouble).
> >
> >I think the simples approach will be NutchŠit¹s absolutely worth taking a
> >shot at it.
> >
> >DO NOT write a crawler!  That is a rabbit hole you do not want to peer
> >down into :)
> >
> >
> >
> >On Oct 30, 2013, at 10:54 AM, Markus Jelsma <ma...@openindex.io>
> >wrote:
> >
> >> Hi Eric,
> >>
> >> We have also helped some government institution to replave their
> >>expensive GSA with open source software. In our case we use Apache Nutch
> >>1.7 to crawl the websites and index to Apache Solr. It is very
> >>effective, robust and scales easily with Hadoop if you have to. Nutch
> >>may not be the easiest tool for the job but is very stable, feature rich
> >>and has an active community here at Apache.
> >>
> >> Cheers,
> >>
> >> -----Original message-----
> >>> From:Palmer, Eric <ep...@richmond.edu>
> >>> Sent: Wednesday 30th October 2013 18:48
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Replacing Google Mini Search Appliance with Solr?
> >>>
> >>> Hello all,
> >>>
> >>> Been lurking on the list for awhile.
> >>>
> >>> We are at the end of life for replacing two google mini search
> >>>appliances used to index our public web sites. Google is no longer
> >>>selling the mini appliances and buying the big appliance is not cost
> >>>beneficial.
> >>>
> >>> http://search.richmond.edu/
> >>>
> >>> We would run a solr replacement in linux (cents, redhat, similar) with
> >>>open Java or Oracle Java.
> >>>
> >>> Background
> >>> ==========
> >>> ~130 sites
> >>> only ~12,000 pages (at a depth of 3)
> >>> probably ~40,000 pages if we go to a depth of 4
> >>>
> >>> We use key matches a lot. In solr terms these are elevated documents
> >>>(elevations)
> >>>
> >>> We would code a search query form in php and wrap it into our design
> >>>(http://www.richmond.edu)
> >>>
> >>> I have played with and love lucidworks and know that their $ solution
> >>>works for our use cases but the cost model is not attractive for such a
> >>>small collection.
> >>>
> >>> So with solr what are my open source options and what are people's
> >>>experiences crawling and indexing web sites with solr + crawler. I
> >>>understand there is not a crawler with solr so that would have to be
> >>>first up to get one working.
> >>>
> >>> We can code in Java, PHP, Python etc. if we have to, but we don't want
> >>>to write a crawler if we can avoid it.
> >>>
> >>> thanks in advance for and information.
> >>>
> >>> --
> >>> Eric Palmer
> >>> Web Services
> >>> U of Richmond
> >>>
> >>>
> >
>
>

Re: Replacing Google Mini Search Appliance with Solr?

Posted by "Palmer, Eric" <ep...@richmond.edu>.
Markus and Jason

thanks for the info.

I will start to research Nutch.  Writing a crawler, agree it is a rabbit
hole.


-- 
Eric Palmer

Web Services
U of Richmond

To report technical issues, obtain technical support or make requests for
enhancements please visit
http://web.richmond.edu/contact/technical-support.html





On 10/30/13 2:53 PM, "Jason Hellman" <jh...@innoventsolutions.com>
wrote:

>Nutch is an excellent option.  It should feel very comfortable for people
>migrating away from the Google appliances.
>
>Apache Droids is another possible way to approach, and I¹ve found people
>using Heretrix or Manifold for various use cases (and usually in
>combination with other use cases where the extra overhead was worth the
>trouble).
>
>I think the simples approach will be NutchŠit¹s absolutely worth taking a
>shot at it.
>
>DO NOT write a crawler!  That is a rabbit hole you do not want to peer
>down into :)
>
>
>
>On Oct 30, 2013, at 10:54 AM, Markus Jelsma <ma...@openindex.io>
>wrote:
>
>> Hi Eric,
>> 
>> We have also helped some government institution to replave their
>>expensive GSA with open source software. In our case we use Apache Nutch
>>1.7 to crawl the websites and index to Apache Solr. It is very
>>effective, robust and scales easily with Hadoop if you have to. Nutch
>>may not be the easiest tool for the job but is very stable, feature rich
>>and has an active community here at Apache.
>> 
>> Cheers,
>> 
>> -----Original message-----
>>> From:Palmer, Eric <ep...@richmond.edu>
>>> Sent: Wednesday 30th October 2013 18:48
>>> To: solr-user@lucene.apache.org
>>> Subject: Replacing Google Mini Search Appliance with Solr?
>>> 
>>> Hello all,
>>> 
>>> Been lurking on the list for awhile.
>>> 
>>> We are at the end of life for replacing two google mini search
>>>appliances used to index our public web sites. Google is no longer
>>>selling the mini appliances and buying the big appliance is not cost
>>>beneficial.
>>> 
>>> http://search.richmond.edu/
>>> 
>>> We would run a solr replacement in linux (cents, redhat, similar) with
>>>open Java or Oracle Java.
>>> 
>>> Background
>>> ==========
>>> ~130 sites
>>> only ~12,000 pages (at a depth of 3)
>>> probably ~40,000 pages if we go to a depth of 4
>>> 
>>> We use key matches a lot. In solr terms these are elevated documents
>>>(elevations)
>>> 
>>> We would code a search query form in php and wrap it into our design
>>>(http://www.richmond.edu)
>>> 
>>> I have played with and love lucidworks and know that their $ solution
>>>works for our use cases but the cost model is not attractive for such a
>>>small collection.
>>> 
>>> So with solr what are my open source options and what are people's
>>>experiences crawling and indexing web sites with solr + crawler. I
>>>understand there is not a crawler with solr so that would have to be
>>>first up to get one working.
>>> 
>>> We can code in Java, PHP, Python etc. if we have to, but we don't want
>>>to write a crawler if we can avoid it.
>>> 
>>> thanks in advance for and information.
>>> 
>>> --
>>> Eric Palmer
>>> Web Services
>>> U of Richmond
>>> 
>>> 
>


Re: Replacing Google Mini Search Appliance with Solr?

Posted by Jason Hellman <jh...@innoventsolutions.com>.
Nutch is an excellent option.  It should feel very comfortable for people migrating away from the Google appliances.

Apache Droids is another possible way to approach, and I’ve found people using Heretrix or Manifold for various use cases (and usually in combination with other use cases where the extra overhead was worth the trouble).

I think the simples approach will be Nutch…it’s absolutely worth taking a shot at it.

DO NOT write a crawler!  That is a rabbit hole you do not want to peer down into :)



On Oct 30, 2013, at 10:54 AM, Markus Jelsma <ma...@openindex.io> wrote:

> Hi Eric,
> 
> We have also helped some government institution to replave their expensive GSA with open source software. In our case we use Apache Nutch 1.7 to crawl the websites and index to Apache Solr. It is very effective, robust and scales easily with Hadoop if you have to. Nutch may not be the easiest tool for the job but is very stable, feature rich and has an active community here at Apache.
> 
> Cheers,
> 
> -----Original message-----
>> From:Palmer, Eric <ep...@richmond.edu>
>> Sent: Wednesday 30th October 2013 18:48
>> To: solr-user@lucene.apache.org
>> Subject: Replacing Google Mini Search Appliance with Solr?
>> 
>> Hello all,
>> 
>> Been lurking on the list for awhile.
>> 
>> We are at the end of life for replacing two google mini search appliances used to index our public web sites. Google is no longer selling the mini appliances and buying the big appliance is not cost beneficial.
>> 
>> http://search.richmond.edu/
>> 
>> We would run a solr replacement in linux (cents, redhat, similar) with open Java or Oracle Java.
>> 
>> Background
>> ==========
>> ~130 sites
>> only ~12,000 pages (at a depth of 3)
>> probably ~40,000 pages if we go to a depth of 4
>> 
>> We use key matches a lot. In solr terms these are elevated documents (elevations)
>> 
>> We would code a search query form in php and wrap it into our design (http://www.richmond.edu)
>> 
>> I have played with and love lucidworks and know that their $ solution works for our use cases but the cost model is not attractive for such a small collection.
>> 
>> So with solr what are my open source options and what are people's experiences crawling and indexing web sites with solr + crawler. I understand there is not a crawler with solr so that would have to be first up to get one working.
>> 
>> We can code in Java, PHP, Python etc. if we have to, but we don't want to write a crawler if we can avoid it.
>> 
>> thanks in advance for and information.
>> 
>> --
>> Eric Palmer
>> Web Services
>> U of Richmond
>> 
>>