You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Patrick Debois <Pa...@sos.be> on 2003/11/06 14:14:11 UTC

Java TextCat 0.1

Java interfacing with libtextcat. Might be of interest for you (according
to the mailling lists)

I've used it for choosing the correct analyzer in Lucene Snowball

I will provide it on my website http://www.jedi.be/JTextCat/index.html

Hope it does not violate any copyrights.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Java TextCat 0.1

Posted by Pete Lewis <pe...@uptima.co.uk>.

Hi Maurits

With the language guesser it doesn't matter whether they are in one index or
language specific indexes, more how you want to organise your data.  Even if
you have separate language dictionaries, I think that it would be best to
have a language field - holding the guessed language of the document.

An alternative would be language tagging, where you embed language tags into
the document and in this way can correctly handle documents that comprise
more than one language - but unfortunately I don't think that there are any
opensource language taggers.

Cheers

Pete

----- Original Message ----- 
From: "maurits van wijland" <m....@quicknet.nl>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Saturday, November 08, 2003 7:30 AM
Subject: Re: Java TextCat 0.1


> Pete,
>
> It's because I think of search engine as a guided search engine. They
should
> offer
> the 'end-user' help when trying to find information. So a drop-down should
> not
> be included into the search interface.
>
> Ofcourse a drop down is a good method to choose a query language. Are the
> different
> languages in different indexes or are they all combined into one?
>
> chrs,
>
> Maurits
>
> ----- Original Message ----- 
> From: "Pete Lewis" <pe...@uptima.co.uk>
> To: "Lucene Developers List" <lu...@jakarta.apache.org>
> Sent: Friday, November 07, 2003 8:58 PM
> Subject: Re: Java TextCat 0.1
>
>
> > Hi Maurits
> >
> > Language guessing is OK for documents where you have a fair amount of
text
> > to play with; search clues however are much shorter - often just a word
or
> > two.  Therefore why don't you have a default query language and then
just
> > have a drop-down box to let the user select the query language if
> different
> > from the default.
> >
> > Cheers
> >
> > Pete
> >
> > ----- Original Message ----- 
> > From: "maurits van wijland" <m....@quicknet.nl>
> > To: "Lucene Developers List" <lu...@jakarta.apache.org>
> > Sent: Friday, November 07, 2003 7:12 PM
> > Subject: Re: Java TextCat 0.1
> >
> >
> > > Hi all,
> > >
> > > Incze,  do you choose the analyer when indexing and seraching? how?
> > > Can you send an example code?
> > >
> > > I have tried this with a naive bayes language guesser, but the problem
i
> > > found is that whren searching, the query words are to 'small' to
> > accurately
> > > predict a language...
> > >
> > > So, how do you manage?
> > >
> > > kind regards,
> > >
> > > Maurits van Wijland
> > >
> > >
> > > ----- Original Message ----- 
> > > From: "Incze Lajos" <in...@mail.matav.hu>
> > > To: "Lucene Developers List" <lu...@jakarta.apache.org>
> > > Sent: Friday, November 07, 2003 2:31 AM
> > > Subject: Re: Java TextCat 0.1
> > >
> > >
> > > > On Thu, Nov 06, 2003 at 02:14:11PM +0100, Patrick Debois wrote:
> > > > > Java interfacing with libtextcat. Might be of interest for you
> > > (according
> > > > > to the mailling lists)
> > > > >
> > > > > I've used it for choosing the correct analyzer in Lucene Snowball
> > > > >
> > > > > I will provide it on my website
> http://www.jedi.be/JTextCat/index.html
> > > > >
> > > > > Hope it does not violate any copyrights.
> > > > >
> > > >
> > ---------------------------------------------------------------------
> > > >
> > > > Have you seen this project?
> > > >
> > > > http://ngramj.sourceforge.net/
> > > >
> > > > (Pure java N-Gram lib, with a sample servlet.)
> > > >
> > > > incze
> > > >
> > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > > > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Java TextCat 0.1

Posted by maurits van wijland <m....@quicknet.nl>.

Pete,

It's because I think of search engine as a guided search engine. They should
offer
the 'end-user' help when trying to find information. So a drop-down should
not
be included into the search interface.

Ofcourse a drop down is a good method to choose a query language. Are the
different
languages in different indexes or are they all combined into one?

chrs,

Maurits

----- Original Message ----- 
From: "Pete Lewis" <pe...@uptima.co.uk>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Friday, November 07, 2003 8:58 PM
Subject: Re: Java TextCat 0.1


> Hi Maurits
>
> Language guessing is OK for documents where you have a fair amount of text
> to play with; search clues however are much shorter - often just a word or
> two.  Therefore why don't you have a default query language and then just
> have a drop-down box to let the user select the query language if
different
> from the default.
>
> Cheers
>
> Pete
>
> ----- Original Message ----- 
> From: "maurits van wijland" <m....@quicknet.nl>
> To: "Lucene Developers List" <lu...@jakarta.apache.org>
> Sent: Friday, November 07, 2003 7:12 PM
> Subject: Re: Java TextCat 0.1
>
>
> > Hi all,
> >
> > Incze,  do you choose the analyer when indexing and seraching? how?
> > Can you send an example code?
> >
> > I have tried this with a naive bayes language guesser, but the problem i
> > found is that whren searching, the query words are to 'small' to
> accurately
> > predict a language...
> >
> > So, how do you manage?
> >
> > kind regards,
> >
> > Maurits van Wijland
> >
> >
> > ----- Original Message ----- 
> > From: "Incze Lajos" <in...@mail.matav.hu>
> > To: "Lucene Developers List" <lu...@jakarta.apache.org>
> > Sent: Friday, November 07, 2003 2:31 AM
> > Subject: Re: Java TextCat 0.1
> >
> >
> > > On Thu, Nov 06, 2003 at 02:14:11PM +0100, Patrick Debois wrote:
> > > > Java interfacing with libtextcat. Might be of interest for you
> > (according
> > > > to the mailling lists)
> > > >
> > > > I've used it for choosing the correct analyzer in Lucene Snowball
> > > >
> > > > I will provide it on my website
http://www.jedi.be/JTextCat/index.html
> > > >
> > > > Hope it does not violate any copyrights.
> > > >
> > >
> ---------------------------------------------------------------------
> > >
> > > Have you seen this project?
> > >
> > > http://ngramj.sourceforge.net/
> > >
> > > (Pure java N-Gram lib, with a sample servlet.)
> > >
> > > incze
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Java TextCat 0.1

Posted by Pete Lewis <pe...@uptima.co.uk>.

Hi Maurits

Language guessing is OK for documents where you have a fair amount of text
to play with; search clues however are much shorter - often just a word or
two.  Therefore why don't you have a default query language and then just
have a drop-down box to let the user select the query language if different
from the default.

Cheers

Pete

----- Original Message ----- 
From: "maurits van wijland" <m....@quicknet.nl>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Friday, November 07, 2003 7:12 PM
Subject: Re: Java TextCat 0.1


> Hi all,
>
> Incze,  do you choose the analyer when indexing and seraching? how?
> Can you send an example code?
>
> I have tried this with a naive bayes language guesser, but the problem i
> found is that whren searching, the query words are to 'small' to
accurately
> predict a language...
>
> So, how do you manage?
>
> kind regards,
>
> Maurits van Wijland
>
>
> ----- Original Message ----- 
> From: "Incze Lajos" <in...@mail.matav.hu>
> To: "Lucene Developers List" <lu...@jakarta.apache.org>
> Sent: Friday, November 07, 2003 2:31 AM
> Subject: Re: Java TextCat 0.1
>
>
> > On Thu, Nov 06, 2003 at 02:14:11PM +0100, Patrick Debois wrote:
> > > Java interfacing with libtextcat. Might be of interest for you
> (according
> > > to the mailling lists)
> > >
> > > I've used it for choosing the correct analyzer in Lucene Snowball
> > >
> > > I will provide it on my website http://www.jedi.be/JTextCat/index.html
> > >
> > > Hope it does not violate any copyrights.
> > >
> > > ---------------------------------------------------------------------
> >
> > Have you seen this project?
> >
> > http://ngramj.sourceforge.net/
> >
> > (Pure java N-Gram lib, with a sample servlet.)
> >
> > incze
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Java TextCat 0.1

Posted by maurits van wijland <m....@quicknet.nl>.

Hi all,

Incze,  do you choose the analyer when indexing and seraching? how?
Can you send an example code?

I have tried this with a naive bayes language guesser, but the problem i
found is that whren searching, the query words are to 'small' to accurately
predict a language...

So, how do you manage?

kind regards,

Maurits van Wijland


----- Original Message ----- 
From: "Incze Lajos" <in...@mail.matav.hu>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Friday, November 07, 2003 2:31 AM
Subject: Re: Java TextCat 0.1


> On Thu, Nov 06, 2003 at 02:14:11PM +0100, Patrick Debois wrote:
> > Java interfacing with libtextcat. Might be of interest for you
(according
> > to the mailling lists)
> >
> > I've used it for choosing the correct analyzer in Lucene Snowball
> >
> > I will provide it on my website http://www.jedi.be/JTextCat/index.html
> >
> > Hope it does not violate any copyrights.
> >
> > ---------------------------------------------------------------------
>
> Have you seen this project?
>
> http://ngramj.sourceforge.net/
>
> (Pure java N-Gram lib, with a sample servlet.)
>
> incze
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Java TextCat 0.1

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

incze and patrick - i moderated both of your messages in on this 
subject.  in the future please subscribe to the list you're posting to, 
or risk them being discarded.

	Erik


On Thursday, November 6, 2003, at 08:31  PM, Incze Lajos wrote:

> On Thu, Nov 06, 2003 at 02:14:11PM +0100, Patrick Debois wrote:
>> Java interfacing with libtextcat. Might be of interest for you 
>> (according
>> to the mailling lists)
>>
>> I've used it for choosing the correct analyzer in Lucene Snowball
>>
>> I will provide it on my website http://www.jedi.be/JTextCat/index.html
>>
>> Hope it does not violate any copyrights.
>>
>> ---------------------------------------------------------------------
>
> Have you seen this project?
>
> http://ngramj.sourceforge.net/
>
> (Pure java N-Gram lib, with a sample servlet.)
>
> incze
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Java TextCat 0.1

Posted by Incze Lajos <in...@mail.matav.hu>.

On Thu, Nov 06, 2003 at 02:14:11PM +0100, Patrick Debois wrote:
> Java interfacing with libtextcat. Might be of interest for you (according
> to the mailling lists)
> 
> I've used it for choosing the correct analyzer in Lucene Snowball
> 
> I will provide it on my website http://www.jedi.be/JTextCat/index.html
> 
> Hope it does not violate any copyrights.
> 
> ---------------------------------------------------------------------

Have you seen this project?

http://ngramj.sourceforge.net/

(Pure java N-Gram lib, with a sample servlet.)

incze

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

RE: Lucene sandbox

Posted by Gregor Heinrich <he...@igd.fhg.de>.

Good to hear that. Thanks for the info,

Gregor

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Thursday, November 06, 2003 3:28 PM
To: Lucene Developers List; heinrich@igd.fhg.de
Subject: Re: Lucene sandbox

Excellent!  I read Mark Rosen's paper and was going to see if it made
any sense to make similar modifications to Lucene... when I have some
more free time, ha ha ha.

Regarding Sandbox contributions, it's simple.
When you have everything (including a short README and
build.xml/project.xml) zip it up and stick it in Bugzilla.

The sources should contain ASL at the top, as seen in Lucene's sources.

I think that is all.

Thanks,
Otis
P.S.
Another person recently expressed interest in adding Term Vector
support.  I wonder if he has been doing that and I wonder if this
resulted in duplicated effort.

--- Gregor Heinrich <he...@igd.fhg.de> wrote:
> Hi,
> 
> what is the workflow to upload code to the Lucene sandbox CVS
> repository? I
> am working on a document vector storage that is based on 1.3-RC2 and
> Mark
> Rosen's Haystack forward indexer (see past mailing list
> communication) and
> would like to make this work available once it is in a uploadable
> state.
> 
> Best regards,
> 
> Gregor
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 

__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Lucene sandbox

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Excellent!  I read Mark Rosen's paper and was going to see if it made
any sense to make similar modifications to Lucene... when I have some
more free time, ha ha ha.

Regarding Sandbox contributions, it's simple.
When you have everything (including a short README and
build.xml/project.xml) zip it up and stick it in Bugzilla.

The sources should contain ASL at the top, as seen in Lucene's sources.

I think that is all.

Thanks,
Otis
P.S.
Another person recently expressed interest in adding Term Vector
support.  I wonder if he has been doing that and I wonder if this
resulted in duplicated effort.

--- Gregor Heinrich <he...@igd.fhg.de> wrote:
> Hi,
> 
> what is the workflow to upload code to the Lucene sandbox CVS
> repository? I
> am working on a document vector storage that is based on 1.3-RC2 and
> Mark
> Rosen's Haystack forward indexer (see past mailing list
> communication) and
> would like to make this work available once it is in a uploadable
> state.
> 
> Best regards,
> 
> Gregor
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 

__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Lucene sandbox

Posted by Gregor Heinrich <he...@igd.fhg.de>.

Hi,

what is the workflow to upload code to the Lucene sandbox CVS repository? I
am working on a document vector storage that is based on 1.3-RC2 and Mark
Rosen's Haystack forward indexer (see past mailing list communication) and
would like to make this work available once it is in a uploadable state.

Best regards,

Gregor



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org