You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by jy...@yahoo.com on 2010/01/08 08:08:40 UTC

a complete solution for building a website search with lucene

Hi ,

I am new in Lucene. 

To build a web search function, it need to have a backendc indexing function. But, before that, should run a Crawler? because Lucene index based on Html documents, while Crawler can change the website pages to Html documents. Am i right? 

If so, please anyone suggest to me a Crawler? like Nutch?
Thanks
Zhou




      New Email names for you! 
Get the Email name you've always wanted on the new @ymail and @rocketmail. 
Hurry before someone else does!
http://mail.promotions.yahoo.com/newdomains/sg/

Re: a complete solution for building a website search with lucene

Posted by Simon Willnauer <si...@googlemail.com>.
You should really look at Nutch.
from the website http://lucene.apache.org/nutch: Nutch is open source
web-search software. It builds on Lucene Java<http://lucene.apache.org/java/>,
adding web-specifics, such as a crawler, a link-graph database, parsers for
HTML and other document formats, etc.

sounds like a good place to start, doesn't it :)

simon

On Mon, Jan 11, 2010 at 2:47 AM, <jy...@yahoo.com> wrote:

> Hi,
>
> Have you implemented such web search in your web application development?
> As detailed as possible. example:
> 1) index: ?
> 2) search: Lucene
>
> Please do advise.
>
> Thanks.
>
>
> --- On *Sat, 9/1/10, Simon Willnauer <si...@googlemail.com>*wrote:
>
>
> From: Simon Willnauer <si...@googlemail.com>
> Subject: Re: a complete solution for building a website search with lucene
> To: java-user@lucene.apache.org
> Date: Saturday, 9 January, 2010, 6:16 PM
>
> I don't know that much about nutch but hadoop shouldn't really run
> under windows in production. If you use windows for development this
> should not be a big issue.
> Oatis is right you should use cygwin together with hadoop. look at
> http://wiki.apache.org/hadoop/FAQ for initial info.
>
> simon
>
> On Sat, Jan 9, 2010 at 5:20 AM, Otis Gospodnetic
> <ot...@yahoo.com>>
> wrote:
> > Nutch is written in Java, so Nutch itself *should* work on other
> non-Linux OSs that the JVM supports.
> > But it does contain some shell scripts, as does Hadoop that Nutch uses.
>  Oh, I guess Windows people run it under Cygwin?
> >  Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: "jyzhou817@yahoo.com <ht...@yahoo.com>"
> <jyzhou817@yahoo.com <ht...@yahoo.com>>
> >> To: java-user@lucene.apache.org<ht...@lucene.apache.org>
> >> Sent: Fri, January 8, 2010 5:03:41 AM
> >> Subject: Re: a complete solution for building a website search with
> lucene
> >>
> >> Hi Paul,
> >>
> >> Thanks.
> >> Use Nutch to do crawling. and integrate Lucene to the web application,
> so that
> >> can do search online.
> >>
> >> BTW, Nutch seems to have only Linux version, what my development is on
> Windows.
> >> Am i right?
> >>
> >> Zhou
> >>
> >> --- On Fri, 8/1/10, Paul Libbrecht wrote:
> >>
> >> From: Paul Libbrecht
> >> Subject: Re: a complete solution for building a website search with
> lucene
> >> To: java-user@lucene.apache.org<ht...@lucene.apache.org>
> >> Date: Friday, 8 January, 2010, 4:27 PM
> >>
> >> Zhou,
> >>
> >> Lucene is a back-end library, it's very useful for developer but it is
> not a
> >> complete site-search-engine.
> >> A lucene-based site-search-engine is Nutch, it does crawl.
> >> Solr also provides functions close to these with a large amount of
> thoughts on
> >> flexible integration; crawling methods are rather based on feeds or
> other
> >> acquisition methods (see DIH for example).
> >>
> >> paul
> >>
> >>
> >>
> >>
> >> Le 08-janv.-10 à 08:08, a écrit :
> >>
> >> > Hi ,
> >> >
> >> > I am new in Lucene.
> >> >
> >> > To build a web search function, it need to have a backendc indexing
> function.
> >> But, before that, should run a Crawler? because Lucene index based on
> Html
> >> documents, while Crawler can change the website pages to Html documents.
> Am i
> >> right?
> >> >
> >> > If so, please anyone suggest to me a Crawler? like Nutch?
> >> > Thanks
> >> > Zhou
> >> >
> >> >
> >> >
> >> >
> >> >      New Email names for you!
> >> > Get the Email name you've always wanted on the new @ymail and
> @rocketmail.
> >> > Hurry before someone else does!
> >> > http://mail.promotions.yahoo.com/newdomains/sg/
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<ht...@lucene.apache.org>
> >> For additional commands, e-mail: java-user-help@lucene.apache.org<ht...@lucene.apache.org>
> >>
> >>
> >>
> >>
> >>       New Email names for you!
> >> Get the Email name you've always wanted on the new @ymail and
> @rocketmail.
> >> Hurry before someone else does!
> >> http://mail.promotions.yahoo.com/newdomains/sg/
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<ht...@lucene.apache.org>
> > For additional commands, e-mail: java-user-help@lucene.apache.org<ht...@lucene.apache.org>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<ht...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org<ht...@lucene.apache.org>
>
>
> ------------------------------
>  New Email names for you!
> <http://sg.rd.yahoo.com/sg/mail/domainchoice/mail/signature/*http://mail.promotions.yahoo.com/newdomains/sg/>
> Get the Email name you've always wanted on the new @ymail and @rocketmail.
> Hurry before someone else does!
>

Re: a complete solution for building a website search with lucene

Posted by jy...@yahoo.com.
Thanks.

--- On Sat, 9/1/10, Simon Willnauer <si...@googlemail.com> wrote:

From: Simon Willnauer <si...@googlemail.com>
Subject: Re: a complete solution for building a website search with lucene
To: java-user@lucene.apache.org
Date: Saturday, 9 January, 2010, 6:16 PM

I don't know that much about nutch but hadoop shouldn't really run
under windows in production. If you use windows for development this
should not be a big issue.
Oatis is right you should use cygwin together with hadoop. look at
http://wiki.apache.org/hadoop/FAQ for initial info.

simon

On Sat, Jan 9, 2010 at 5:20 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Nutch is written in Java, so Nutch itself *should* work on other non-Linux OSs that the JVM supports.
> But it does contain some shell scripts, as does Hadoop that Nutch uses.  Oh, I guess Windows people run it under Cygwin?
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> ----- Original Message ----
>> From: "jyzhou817@yahoo.com" <jy...@yahoo.com>
>> To: java-user@lucene.apache.org
>> Sent: Fri, January 8, 2010 5:03:41 AM
>> Subject: Re: a complete solution for building a website search with lucene
>>
>> Hi Paul,
>>
>> Thanks.
>> Use Nutch to do crawling. and integrate Lucene to the web application, so that
>> can do search online.
>>
>> BTW, Nutch seems to have only Linux version, what my development is on Windows.
>> Am i right?
>>
>> Zhou
>>
>> --- On Fri, 8/1/10, Paul Libbrecht wrote:
>>
>> From: Paul Libbrecht
>> Subject: Re: a complete solution for building a website search with lucene
>> To: java-user@lucene.apache.org
>> Date: Friday, 8 January, 2010, 4:27 PM
>>
>> Zhou,
>>
>> Lucene is a back-end library, it's very useful for developer but it is not a
>> complete site-search-engine.
>> A lucene-based site-search-engine is Nutch, it does crawl.
>> Solr also provides functions close to these with a large amount of thoughts on
>> flexible integration; crawling methods are rather based on feeds or other
>> acquisition methods (see DIH for example).
>>
>> paul
>>
>>
>>
>>
>> Le 08-janv.-10 à 08:08, a écrit :
>>
>> > Hi ,
>> >
>> > I am new in Lucene.
>> >
>> > To build a web search function, it need to have a backendc indexing function.
>> But, before that, should run a Crawler? because Lucene index based on Html
>> documents, while Crawler can change the website pages to Html documents. Am i
>> right?
>> >
>> > If so, please anyone suggest to me a Crawler? like Nutch?
>> > Thanks
>> > Zhou
>> >
>> >
>> >
>> >
>> >      New Email names for you!
>> > Get the Email name you've always wanted on the new @ymail and @rocketmail.
>> > Hurry before someone else does!
>> > http://mail.promotions.yahoo.com/newdomains/sg/
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>       New Email names for you!
>> Get the Email name you've always wanted on the new @ymail and @rocketmail.
>> Hurry before someone else does!
>> http://mail.promotions.yahoo.com/newdomains/sg/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




      

Re: a complete solution for building a website search with lucene

Posted by jy...@yahoo.com.
Hi,

Have you implemented such web search in your web application development?  As detailed as possible. example: 
1) index: ?
2) search: Lucene

Please do advise. 

Thanks.


--- On Sat, 9/1/10, Simon Willnauer <si...@googlemail.com> wrote:

From: Simon Willnauer <si...@googlemail.com>
Subject: Re: a complete solution for building a website search with lucene
To: java-user@lucene.apache.org
Date: Saturday, 9 January, 2010, 6:16 PM

I don't know that much about nutch but hadoop shouldn't really run
under windows in production. If you use windows for development this
should not be a big issue.
Oatis is right you should use cygwin together with hadoop. look at
http://wiki.apache.org/hadoop/FAQ for initial info.

simon

On Sat, Jan 9, 2010 at 5:20 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Nutch is written in Java, so Nutch itself *should* work on other non-Linux OSs that the JVM supports.
> But it does contain some shell scripts, as does Hadoop that Nutch uses.  Oh, I guess Windows people run it under Cygwin?
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> ----- Original Message ----
>> From: "jyzhou817@yahoo.com" <jy...@yahoo.com>
>> To: java-user@lucene.apache.org
>> Sent: Fri, January 8, 2010 5:03:41 AM
>> Subject: Re: a complete solution for building a website search with lucene
>>
>> Hi Paul,
>>
>> Thanks.
>> Use Nutch to do crawling. and integrate Lucene to the web application, so that
>> can do search online.
>>
>> BTW, Nutch seems to have only Linux version, what my development is on Windows.
>> Am i right?
>>
>> Zhou
>>
>> --- On Fri, 8/1/10, Paul Libbrecht wrote:
>>
>> From: Paul Libbrecht
>> Subject: Re: a complete solution for building a website search with lucene
>> To: java-user@lucene.apache.org
>> Date: Friday, 8 January, 2010, 4:27 PM
>>
>> Zhou,
>>
>> Lucene is a back-end library, it's very useful for developer but it is not a
>> complete site-search-engine.
>> A lucene-based site-search-engine is Nutch, it does crawl.
>> Solr also provides functions close to these with a large amount of thoughts on
>> flexible integration; crawling methods are rather based on feeds or other
>> acquisition methods (see DIH for example).
>>
>> paul
>>
>>
>>
>>
>> Le 08-janv.-10 à 08:08, a écrit :
>>
>> > Hi ,
>> >
>> > I am new in Lucene.
>> >
>> > To build a web search function, it need to have a backendc indexing function.
>> But, before that, should run a Crawler? because Lucene index based on Html
>> documents, while Crawler can change the website pages to Html documents. Am i
>> right?
>> >
>> > If so, please anyone suggest to me a Crawler? like Nutch?
>> > Thanks
>> > Zhou
>> >
>> >
>> >
>> >
>> >      New Email names for you!
>> > Get the Email name you've always wanted on the new @ymail and @rocketmail.
>> > Hurry before someone else does!
>> > http://mail.promotions.yahoo.com/newdomains/sg/
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>       New Email names for you!
>> Get the Email name you've always wanted on the new @ymail and @rocketmail.
>> Hurry before someone else does!
>> http://mail.promotions.yahoo.com/newdomains/sg/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




      New Email names for you! 
Get the Email name you&#39;ve always wanted on the new @ymail and @rocketmail. 
Hurry before someone else does!
http://mail.promotions.yahoo.com/newdomains/sg/

Re: a complete solution for building a website search with lucene

Posted by Simon Willnauer <si...@googlemail.com>.
I don't know that much about nutch but hadoop shouldn't really run
under windows in production. If you use windows for development this
should not be a big issue.
Oatis is right you should use cygwin together with hadoop. look at
http://wiki.apache.org/hadoop/FAQ for initial info.

simon

On Sat, Jan 9, 2010 at 5:20 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Nutch is written in Java, so Nutch itself *should* work on other non-Linux OSs that the JVM supports.
> But it does contain some shell scripts, as does Hadoop that Nutch uses.  Oh, I guess Windows people run it under Cygwin?
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> ----- Original Message ----
>> From: "jyzhou817@yahoo.com" <jy...@yahoo.com>
>> To: java-user@lucene.apache.org
>> Sent: Fri, January 8, 2010 5:03:41 AM
>> Subject: Re: a complete solution for building a website search with lucene
>>
>> Hi Paul,
>>
>> Thanks.
>> Use Nutch to do crawling. and integrate Lucene to the web application, so that
>> can do search online.
>>
>> BTW, Nutch seems to have only Linux version, what my development is on Windows.
>> Am i right?
>>
>> Zhou
>>
>> --- On Fri, 8/1/10, Paul Libbrecht wrote:
>>
>> From: Paul Libbrecht
>> Subject: Re: a complete solution for building a website search with lucene
>> To: java-user@lucene.apache.org
>> Date: Friday, 8 January, 2010, 4:27 PM
>>
>> Zhou,
>>
>> Lucene is a back-end library, it's very useful for developer but it is not a
>> complete site-search-engine.
>> A lucene-based site-search-engine is Nutch, it does crawl.
>> Solr also provides functions close to these with a large amount of thoughts on
>> flexible integration; crawling methods are rather based on feeds or other
>> acquisition methods (see DIH for example).
>>
>> paul
>>
>>
>>
>>
>> Le 08-janv.-10 à 08:08, a écrit :
>>
>> > Hi ,
>> >
>> > I am new in Lucene.
>> >
>> > To build a web search function, it need to have a backendc indexing function.
>> But, before that, should run a Crawler? because Lucene index based on Html
>> documents, while Crawler can change the website pages to Html documents. Am i
>> right?
>> >
>> > If so, please anyone suggest to me a Crawler? like Nutch?
>> > Thanks
>> > Zhou
>> >
>> >
>> >
>> >
>> >      New Email names for you!
>> > Get the Email name you've always wanted on the new @ymail and @rocketmail.
>> > Hurry before someone else does!
>> > http://mail.promotions.yahoo.com/newdomains/sg/
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>       New Email names for you!
>> Get the Email name you've always wanted on the new @ymail and @rocketmail.
>> Hurry before someone else does!
>> http://mail.promotions.yahoo.com/newdomains/sg/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: a complete solution for building a website search with lucene

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Nutch is written in Java, so Nutch itself *should* work on other non-Linux OSs that the JVM supports.
But it does contain some shell scripts, as does Hadoop that Nutch uses.  Oh, I guess Windows people run it under Cygwin?
 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: "jyzhou817@yahoo.com" <jy...@yahoo.com>
> To: java-user@lucene.apache.org
> Sent: Fri, January 8, 2010 5:03:41 AM
> Subject: Re: a complete solution for building a website search with lucene
> 
> Hi Paul,
> 
> Thanks. 
> Use Nutch to do crawling. and integrate Lucene to the web application, so that 
> can do search online.
> 
> BTW, Nutch seems to have only Linux version, what my development is on Windows. 
> Am i right?
> 
> Zhou
> 
> --- On Fri, 8/1/10, Paul Libbrecht wrote:
> 
> From: Paul Libbrecht 
> Subject: Re: a complete solution for building a website search with lucene
> To: java-user@lucene.apache.org
> Date: Friday, 8 January, 2010, 4:27 PM
> 
> Zhou,
> 
> Lucene is a back-end library, it's very useful for developer but it is not a 
> complete site-search-engine.
> A lucene-based site-search-engine is Nutch, it does crawl.
> Solr also provides functions close to these with a large amount of thoughts on 
> flexible integration; crawling methods are rather based on feeds or other 
> acquisition methods (see DIH for example).
> 
> paul
> 
> 
> 
> 
> Le 08-janv.-10 à 08:08, a écrit :
> 
> > Hi ,
> > 
> > I am new in Lucene.
> > 
> > To build a web search function, it need to have a backendc indexing function. 
> But, before that, should run a Crawler? because Lucene index based on Html 
> documents, while Crawler can change the website pages to Html documents. Am i 
> right?
> > 
> > If so, please anyone suggest to me a Crawler? like Nutch?
> > Thanks
> > Zhou
> > 
> > 
> > 
> > 
> >      New Email names for you!
> > Get the Email name you've always wanted on the new @ymail and @rocketmail.
> > Hurry before someone else does!
> > http://mail.promotions.yahoo.com/newdomains/sg/
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 
>       New Email names for you! 
> Get the Email name you've always wanted on the new @ymail and @rocketmail. 
> Hurry before someone else does!
> http://mail.promotions.yahoo.com/newdomains/sg/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: a complete solution for building a website search with lucene

Posted by jy...@yahoo.com.
Hi Paul,

Thanks. 
Use Nutch to do crawling. and integrate Lucene to the web application, so that can do search online.

BTW, Nutch seems to have only Linux version, what my development is on Windows. Am i right?

Zhou

--- On Fri, 8/1/10, Paul Libbrecht <pa...@activemath.org> wrote:

From: Paul Libbrecht <pa...@activemath.org>
Subject: Re: a complete solution for building a website search with lucene
To: java-user@lucene.apache.org
Date: Friday, 8 January, 2010, 4:27 PM

Zhou,

Lucene is a back-end library, it's very useful for developer but it is not a complete site-search-engine.
A lucene-based site-search-engine is Nutch, it does crawl.
Solr also provides functions close to these with a large amount of thoughts on flexible integration; crawling methods are rather based on feeds or other acquisition methods (see DIH for example).

paul




Le 08-janv.-10 à 08:08, <jy...@yahoo.com> a écrit :

> Hi ,
> 
> I am new in Lucene.
> 
> To build a web search function, it need to have a backendc indexing function. But, before that, should run a Crawler? because Lucene index based on Html documents, while Crawler can change the website pages to Html documents. Am i right?
> 
> If so, please anyone suggest to me a Crawler? like Nutch?
> Thanks
> Zhou
> 
> 
> 
> 
>      New Email names for you!
> Get the Email name you've always wanted on the new @ymail and @rocketmail.
> Hurry before someone else does!
> http://mail.promotions.yahoo.com/newdomains/sg/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




      New Email names for you! 
Get the Email name you&#39;ve always wanted on the new @ymail and @rocketmail. 
Hurry before someone else does!
http://mail.promotions.yahoo.com/newdomains/sg/

Re: a complete solution for building a website search with lucene

Posted by Paul Libbrecht <pa...@activemath.org>.
Zhou,

Lucene is a back-end library, it's very useful for developer but it is  
not a complete site-search-engine.
A lucene-based site-search-engine is Nutch, it does crawl.
Solr also provides functions close to these with a large amount of  
thoughts on flexible integration; crawling methods are rather based on  
feeds or other acquisition methods (see DIH for example).

paul




Le 08-janv.-10 à 08:08, <jy...@yahoo.com> a écrit :

> Hi ,
>
> I am new in Lucene.
>
> To build a web search function, it need to have a backendc indexing  
> function. But, before that, should run a Crawler? because Lucene  
> index based on Html documents, while Crawler can change the website  
> pages to Html documents. Am i right?
>
> If so, please anyone suggest to me a Crawler? like Nutch?
> Thanks
> Zhou
>
>
>
>
>      New Email names for you!
> Get the Email name you&#39;ve always wanted on the new @ymail and  
> @rocketmail.
> Hurry before someone else does!
> http://mail.promotions.yahoo.com/newdomains/sg/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org