You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Klaus <kl...@vommond.de> on 2005/11/09 14:48:35 UTC

Lucene or Nutch

Hello,

 

my name is Klaus and I'm a new member in this mailing list. I'm currently
working on my master thesis. One of my tasks is to implement a full text
search into an existing information system. Browsing the web, I found lucene
and nutch. Unfortunately I'm not sure which of these tools fits best into my
project. Let me outline the requirements shortly:

 

1)       Integration into an existing informations system

2)       Full text search on all objects of the information system

3)       Full text search on pdf and word documents appended to the objects
of the is.

 


>From my point of view, I would prefer lucene, because I don't need the ui
etc. On the other hand  I would like to use word and pdf parser and the
LanguageIdentifier. Do you see any problems using these classes within
lucene?


Thanks


Klaus


 


Re: Lucene or Nutch

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> I would be disappointed by this move - language identifier is an 
> important component in Nutch. Now the mere fact that it's bundled with 
> Nutch encourages its proper maintenance. If there is enough drive in 
> terms of willingness and long-term commitment it would make sense to 
> move it to a separate project on its own (or maybe as a part of Jakarta 
> Commons), but moving it into a catch-all purely optional category like 
> Lucene contrib would increase risks that it slides into oblivion...

In 1.9 and beyond the plan is to build and distribute the contrib with 
Lucene.  So 'ant test' in Lucene should test contrib too, etc.  The 
point is to make sure that this stuff is maintained, but not to merge it 
into the core.  So stuff in contrib should not slide into oblivion.

Doug

Re: Lucene or Nutch

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:

> > I would be disappointed by this move - language identifier is an
> > important component in Nutch. Now the mere fact that it's bundled
> > with Nutch encourages its proper maintenance. If there is enough
> > drive in terms of willingness and long-term commitment it would
> > make sense to move it to a separate project on its own (or maybe as
> > a part of Jakarta Commons), but moving it into a catch-all purely
> > optional category like Lucene contrib would increase risks that it
> > slides into oblivion...
>
>
>  Ok, Andrzej, I really understand your meaning. But more and more
>  people are contacting me directly in order to use the
>  language-identifier, but not as a nutch plugin, simply as a
>  standalone library. They get confused when I explain them that they
>  need the nutch jar in order to use the language-identifier. That's
>  why I would like to make it a standalone jar. A short-term solutions
>  could be to move the core classes (which have no dependencies on
>  nutch) to a new lib-plugin (lib-lang for instance and adding a
>  dependecy to this plugin in the language-identifier), so that this
>  code could be used as a standalone lib.
>
>  Are you ok, with such changes?
>

Yes, certainly, it's a good intermediate step before moving it to a 
separate project.

There are some other things that Doug mentioned that he would like to 
separate from Nutch, like the IO and mapred frameworks. A similar 
approach could be taken with these parts - this would encourage good 
separation in design, and also prepare these parts to be separated into 
their own projects.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Lucene or Nutch

Posted by Sami Siren <s....@sonera.inet.fi>.
Jérôme Charron wrote:
> jar. A short-term solutions could be to move the core classes (which have no
> dependencies on
> nutch) to a new lib-plugin (lib-lang for instance and adding a dependecy to
> this plugin in the
> language-identifier), so that this code could be used as a standalone lib.
> 
> Are you ok, with such changes?

Perhaps you could isolate ngram specific stuff to own plugin and the 
lang-id into other.

Or the other option would be (what I implemented some time ago) 
something like this (as ngram categorizer can also used for other
interesting stuff):

new package in core nutch containing classes like:

NGramProfile - pretty much as is
Categorizer - generic configurable ngram categorizer, configure 
profiles, ngram sizes etc.
CategorizerFactory - to get hold of different categorizers

In LangId plugin you just get a correct ( configured to use lang ngram 
profiles and predefined settings for ngramsizes etc ) categorizer from 
factory and tell it to do it's job when needed.

--
  Sami Siren

Re: Lucene or Nutch

Posted by Jérôme Charron <je...@gmail.com>.
> I would be disappointed by this move - language identifier is an
> important component in Nutch. Now the mere fact that it's bundled with
> Nutch encourages its proper maintenance. If there is enough drive in
> terms of willingness and long-term commitment it would make sense to
> move it to a separate project on its own (or maybe as a part of Jakarta
> Commons), but moving it into a catch-all purely optional category like
> Lucene contrib would increase risks that it slides into oblivion...

Ok, Andrzej, I really understand your meaning.
But more and more people are contacting me directly in order to use the
language-identifier, but not as a nutch plugin, simply as a standalone
library. They get confused when I explain them that they need the nutch jar
in order to use the language-identifier. That's why I would like to make it
a standalone
jar. A short-term solutions could be to move the core classes (which have no
dependencies on
nutch) to a new lib-plugin (lib-lang for instance and adding a dependecy to
this plugin in the
language-identifier), so that this code could be used as a standalone lib.

Are you ok, with such changes?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Lucene or Nutch

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Jérôme Charron wrote:
>
>> In fact, I think it could be a good idea to move the nutch language
>> identifier core code
>> to a standalone library or to lucene code.
>> Does it make sense? What do you think about it? What is the best 
>> solution
>> (standalone vs lucene)?
>
>
> One could put it in the lucene contrib directory.


I would be disappointed by this move - language identifier is an 
important component in Nutch. Now the mere fact that it's bundled with 
Nutch encourages its proper maintenance. If there is enough drive in 
terms of willingness and long-term commitment it would make sense to 
move it to a separate project on its own (or maybe as a part of Jakarta 
Commons), but moving it into a catch-all purely optional category like 
Lucene contrib would increase risks that it slides into oblivion...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Lucene or Nutch

Posted by Doug Cutting <cu...@nutch.org>.
Jérôme Charron wrote:
> In fact, I think it could be a good idea to move the nutch language
> identifier core code
> to a standalone library or to lucene code.
> Does it make sense? What do you think about it? What is the best solution
> (standalone vs lucene)?

One could put it in the lucene contrib directory.

Doug

Re: Lucene or Nutch

Posted by Jérôme Charron <je...@gmail.com>.
> Yes, Lucene is the best fit for what you're after. Nutch is built on
> Lucene, and adds web crawling on top. You don't need a web crawler,
> so using Lucene directly is the best fit - of course you'll have to
> write code to integrate Lucene.

Erik,

I was thinking about it for a while, but don't take time to. This mail is a
good oportunity...
In fact, I think it could be a good idea to move the nutch language
identifier core code
to a standalone library or to lucene code.
Does it make sense? What do you think about it? What is the best solution
(standalone vs lucene)?
Doug?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Lucene or Nutch

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Yes, Lucene is the best fit for what you're after.  Nutch is built on  
Lucene, and adds web crawling on top.  You don't need a web crawler,  
so using Lucene directly is the best fit - of course you'll have to  
write code to integrate Lucene.

     Erik


On 9 Nov 2005, at 08:48, Klaus wrote:

> Hello,
>
>
>
> my name is Klaus and I'm a new member in this mailing list. I'm  
> currently
> working on my master thesis. One of my tasks is to implement a full  
> text
> search into an existing information system. Browsing the web, I  
> found lucene
> and nutch. Unfortunately I'm not sure which of these tools fits  
> best into my
> project. Let me outline the requirements shortly:
>
>
>
> 1)       Integration into an existing informations system
>
> 2)       Full text search on all objects of the information system
>
> 3)       Full text search on pdf and word documents appended to the  
> objects
> of the is.
>
>
>
>
>
>> From my point of view, I would prefer lucene, because I don't need  
>> the ui
>>
> etc. On the other hand  I would like to use word and pdf parser and  
> the
> LanguageIdentifier. Do you see any problems using these classes within
> lucene?
>
>
> Thanks
>
>
> Klaus
>
>
>
>
>