You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Robert Muir <rc...@gmail.com> on 2009/06/03 18:12:51 UTC

Re: How to support stemming and case folding for english content mixed with non-english content?

KK, is all of your latin script text actually english? Is there stuff like
german or french mixed in?

And for your non-english content (your examples have been indian writing
systems), is it generally true that if you had devanagari, you can assume
its hindi? or is there stuff like marathi mixed in?

Reason I say this is to invoke the right stemmers, you really need some
language detection, but perhaps in your case you can cheat and detect this
based on scripts...

Thanks,
Robert

On Wed, Jun 3, 2009 at 10:15 AM, KK <di...@gmail.com> wrote:

> Hi All,
> I'm indexing some non-english content. But the page also contains english
> content. As of now I'm using WhitespaceAnalyzer for all content and I'm
> storing the full webpage content under a single filed. Now we require to
> support case folding and stemmming for the english content intermingled
> with
> non-english content. I must metion that we dont have stemming and case
> folding for these non-english content. I'm stuck with this. Some one do let
> me know how to proceed for fixing this issue.
>
> Thanks,
> KK.
>

-- 
Robert Muir
rcmuir@gmail.com

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Robert, I tried to use the worddelimiterfilterfactory as well, but I faced
the same problem[I saw solr using the factory instead of the filter]. I
think I sould try using the first option[copyting that to local directory]
and use it with the options you mentioned. I'll try it out and will post it
here.

Thanks,
KK.
On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rc...@gmail.com> wrote:

> kk an easier solution to your first problem is to use
> worddelimiterfilterfactory if possible... you can get an instance of
> worddelimiter filter from that.
>
> thanks,
> robert
>
> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rc...@gmail.com> wrote:
> > kk as for your first issue, that WordDelimiterFilter is package
> > protected, one option is to make a copy of the code and change the
> > class declaration to public.
> > the other option is to put your entire analyzer in
> > 'org.apache.solr.analysis' package so that you can access it...
> >
> > for the 2nd issue, yes you need to supply some options to it. the
> > default options solr applies to type 'text' seemed to work well for me
> > with indic:
> >
> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
> >
> > On Fri, Jun 5, 2009 at 9:12 AM, KK <di...@gmail.com> wrote:
> >>
> >> Thanks Robert. There is one problem though, I'm able to plugin the word
> >> delimiter filter from solr-nightly jar file. When I tried to do
> something
> >> like,
> >>  TokenStream ts = new WhitespaceTokenizer(reader);
> >>   ts = new WordDelimiterFilter(ts);
> >>   ts = new PorterStemmerFilter(ts);
> >>   ...rest as in the last mail...
> >>
> >> It gave me an error saying that
> >>
> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
> >> org.apache.solr.analysis; cannot be accessed from outside package
> >> import org.apache.solr.analysis.WordDelimiterFilter;
> >>                               ^
> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
> >> symbol  : class WordDelimiterFilter
> >> location: class solrSearch.IndicAnalyzer
> >>    ts = new WordDelimiterFilter(ts);
> >>             ^
> >> 2 errors
> >>
> >> Then i tried to see the code for worddelimitefiter from solrnightly src
> and
> >> found that there are many deprecated constructors though they require a
> lot
> >> of parameters alongwith tokenstream. I went through the solr wiki for
> >> worddelimiterfilterfactory here,
> >>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> >> and say that there also its specified that we've to mention the
> parameters
> >> and both are different for indexing and querying.
> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in my
> >> custom analyzer, I've to use it anyway.
> >> In my code I've to make use of worddelimiterfilter and not
> >> worddelimiterfilterfactory, right? I don't know whats the use of the
> other
> >> one. Anyway can you guide me getting rid of the above error. And yes
> I'll
> >> change the order of applying the filters as you said.
> >>
> >> Thanks,
> >> KK.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rc...@gmail.com> wrote:
> >>
> >> > KK, you got the right idea.
> >> >
> >> > though I think you might want to change the order, move the stopfilter
> >> > before the porter stem filter... otherwise it might not work
> correctly.
> >> >
> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <di...@gmail.com>
> wrote:
> >> >
> >> > > Thanks Robert. This is exactly what I did and  its working but
> delimiter
> >> > is
> >> > > missing I'm going to add that from solr-nightly.jar
> >> > >
> >> > > /**
> >> > >  * Analyzer for Indian language.
> >> > >  */
> >> > > public class IndicAnalyzer extends Analyzer {
> >> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
> >> > >    ts = new PorterStemFilter(ts);
> >> > >    ts = new LowerCaseFilter(ts);
> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> > >    return ts;
> >> > >  }
> >> > > }
> >> > >
> >> > > Its able to do stemming/case-folding and supports search for both
> english
> >> > > and indic texts. let me try out the delimiter. Will update you on
> that.
> >> > >
> >> > > Thanks a lot.
> >> > > KK
> >> > >
> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com>
> wrote:
> >> > >
> >> > > > i think you are on the right track... once you build your
> analyzer, put
> >> > > it
> >> > > > in your classpath and play around with it in luke and see if it
> does
> >> > what
> >> > > > you want.
> >> > > >
> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <di...@gmail.com>
> wrote:
> >> > > >
> >> > > > > Hi Robert,
> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> >> > > > >
> >> > > > > public class ThaiAnalyzer extends Analyzer {
> >> > > > >  public TokenStream tokenStream(String fieldName, Reader reader)
> {
> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
> >> > > > >    ts = new StandardFilter(ts);
> >> > > > >    ts = new ThaiWordFilter(ts);
> >> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> > > > >    return ts;
> >> > > > >  }
> >> > > > > }
> >> > > > >
> >> > > > > Now as you said, I've to use whitespacetokenizer
> >> > > > > withworddelimitefilter[solr
> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
> >> > something
> >> > > > like
> >> > > > > this,
> >> > > > > public class IndicAnalyzer extends Analyzer {
> >> > > > >  public TokenStream tokenStream(String fieldName, Reader reader)
> {
> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> >> > > > >   ts = new WordDelimiterFilter(ts);
> >> > > > >   ts = new LowerCaseFilter(ts);
> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   //
> >> > english
> >> > > > > stop filter, is this the default one?
> >> > > > >   ts = new PorterFilter(ts);
> >> > > > >   return ts;
> >> > > > >  }
> >> > > > > }
> >> > > > >
> >> > > > > Does this sound OK? I think it will do the job...let me try it
> out..
> >> > > > > I dont need custom filter as per my requirement, at least not
> for
> >> > these
> >> > > > > basic things I'm doing? I think so...
> >> > > > >
> >> > > > > Thanks,
> >> > > > > KK.
> >> > > > >
> >> > > > >
> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com>
> >> > wrote:
> >> > > > >
> >> > > > > > KK well you can always get some good examples from the lucene
> >> > contrib
> >> > > > > > codebase.
> >> > > > > > For example, look at the DutchAnalyzer, especially:
> >> > > > > >
> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
> >> > > > > >
> >> > > > > > See how it combines a specified tokenizer with various
> filters?
> >> > this
> >> > > is
> >> > > > > > what
> >> > > > > > you want to do, except of course you want to use different
> >> > tokenizer
> >> > > > and
> >> > > > > > filters.
> >> > > > > >
> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
> dioxide.software@gmail.com>
> >> > > wrote:
> >> > > > > >
> >> > > > > > > Thanks Muir.
> >> > > > > > > Thanks for letting me know that I dont need language
> identifiers.
> >> > > > > > >  I'll have a look and will try to write the analyzer. For my
> case
> >> > I
> >> > > > > think
> >> > > > > > > it
> >> > > > > > > wont be that difficult.
> >> > > > > > > BTW, can you point me to some sample codes/tutorials writing
> >> > custom
> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
> something
> >> > > > htere?
> >> > > > > > do
> >> > > > > > > let me know.
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > > KK.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
> rcmuir@gmail.com>
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > KK, for your case, you don't really need to go to the
> effort of
> >> > > > > > detecting
> >> > > > > > > > whether fragments are english or not.
> >> > > > > > > > Because the English stemmers in lucene will not modify
> your
> >> > Indic
> >> > > > > text,
> >> > > > > > > and
> >> > > > > > > > neither will the LowerCaseFilter.
> >> > > > > > > >
> >> > > > > > > > what you want to do is create a custom analyzer that works
> like
> >> > > > this
> >> > > > > > > >
> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr
> >> > nightly
> >> > > > > jar],
> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> >> > > > > > > >
> >> > > > > > > > Thanks,
> >> > > > > > > > Robert
> >> > > > > > > >
> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
> dioxide.software@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Thank you all.
> >> > > > > > > > > To be frank I was using Solr in the begining half a
> month
> >> > ago.
> >> > > > The
> >> > > > > > > > > problem[rather bug] with solr was creation of new index
> on
> >> > the
> >> > > > fly.
> >> > > > > > > > Though
> >> > > > > > > > > they have a restful method for teh same, but it was not
> >> > > working.
> >> > > > If
> >> > > > > I
> >> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I
> dont
> >> > know
> >> > > > his
> >> > > > > > > real
> >> > > > > > > > > name] was trying to help me. I tried many nightly builds
> and
> >> > > > > spending
> >> > > > > > a
> >> > > > > > > > > couple of days stuck at that made me think of lucene and
> I
> >> > > > switched
> >> > > > > > to
> >> > > > > > > > it.
> >> > > > > > > > > Now after working with lucene which gives you full
> control of
> >> > > > > > > everything
> >> > > > > > > > I
> >> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is
> >> > similar
> >> > > > to
> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back to
> the
> >> > > > point
> >> > > > > as
> >> > > > > > > Uwe
> >> > > > > > > > > mentioned that we can do the same thing in lucene as
> well,
> >> > what
> >> > > > is
> >> > > > > > > > > available
> >> > > > > > > > > in Solr, Solr is based on Lucene only, right?
> >> > > > > > > > > I request Uwe to give me some more ideas on using the
> >> > analyzers
> >> > > > > from
> >> > > > > > > solr
> >> > > > > > > > > that will do the job for me, handling a mix of both
> english
> >> > and
> >> > > > > > > > non-english
> >> > > > > > > > > content.
> >> > > > > > > > > Muir, can you give me a bit detail description of how to
> use
> >> > > the
> >> > > > > > > > > WordDelimiteFilter to do my job.
> >> > > > > > > > > On a side note, I was thingking of writing a simple
> analyzer
> >> > > that
> >> > > > > > will
> >> > > > > > > do
> >> > > > > > > > > the following,
> >> > > > > > > > > #. If the webpage fragment is non-english[for me its
> some
> >> > > indian
> >> > > > > > > > language]
> >> > > > > > > > > then index them as such, no stemming/ stop word removal
> to
> >> > > begin
> >> > > > > > with.
> >> > > > > > > As
> >> > > > > > > > I
> >> > > > > > > > > know its in UCN unicode something like
> >> > > > > \u0021\u0012\u34ae\u0031[just
> >> > > > > > a
> >> > > > > > > > > sample]
> >> > > > > > > > > # If the fragment is english then apply standard
> anlyzing
> >> > > process
> >> > > > > for
> >> > > > > > > > > english content. I've not thought of quering in the same
> way
> >> > as
> >> > > > of
> >> > > > > > now
> >> > > > > > > > i.e
> >> > > > > > > > > mix of non-english and engish words.
> >> > > > > > > > > Now to get all this,
> >> > > > > > > > >  #1. I need some sort of way which will let me know if
> the
> >> > > > content
> >> > > > > is
> >> > > > > > > > > english or not. If not english just add the tokens to
> the
> >> > > > document.
> >> > > > > > Do
> >> > > > > > > we
> >> > > > > > > > > really need language identifiers, as i dont have any
> other
> >> > > > content
> >> > > > > > that
> >> > > > > > > > > uses
> >> > > > > > > > > the same script as english other than those \u1234
> things for
> >> > > my
> >> > > > > > indian
> >> > > > > > > > > language content. Any smart hack/trick for the same?
> >> > > > > > > > >  #2. If the its english apply all normal process and add
> the
> >> > > > > stemmed
> >> > > > > > > > token
> >> > > > > > > > > to document.
> >> > > > > > > > > For all this I was thinking of iterating earch word of
> the
> >> > web
> >> > > > page
> >> > > > > > and
> >> > > > > > > > > apply the above procedure. And finallyadd  the newly
> created
> >> > > > > document
> >> > > > > > > to
> >> > > > > > > > > the
> >> > > > > > > > > index.
> >> > > > > > > > >
> >> > > > > > > > > I would like some one to guide me in this direction. I'm
> >> > pretty
> >> > > > > > people
> >> > > > > > > > must
> >> > > > > > > > > have done similar/same thing earlier, I request them to
> guide
> >> > > me/
> >> > > > > > point
> >> > > > > > > > me
> >> > > > > > > > > to some tutorials for the same.
> >> > > > > > > > > Else help me out writing a custom analyzer only if thats
> not
> >> > > > going
> >> > > > > to
> >> > > > > > > be
> >> > > > > > > > > too
> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know basics
> of
> >> > Java
> >> > > > > > coding.
> >> > > > > > > > > Thank you very much.
> >> > > > > > > > >
> >> > > > > > > > > --KK.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
> >> > rcmuir@gmail.com>
> >> > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > yes this is true. for starters KK, might be good to
> startup
> >> > > > solr
> >> > > > > > and
> >> > > > > > > > look
> >> > > > > > > > > > at
> >> > > > > > > > > >
> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> >> > > > > > > > > >
> >> > > > > > > > > > if you want to stick with lucene, the
> WordDelimiterFilter
> >> > is
> >> > > > the
> >> > > > > > > piece
> >> > > > > > > > > you
> >> > > > > > > > > > will want for your text, mainly for punctuation but
> also
> >> > for
> >> > > > > format
> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
> >> > > > > > > > > >
> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
> >> > > uwe@thetaphi.de
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > You can also re-use the solr analyzers, as far as I
> found
> >> > > > out.
> >> > > > > > > There
> >> > > > > > > > is
> >> > > > > > > > > > an
> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge them.
> >> > > > > > > > > > >
> >> > > > > > > > > > > -----
> >> > > > > > > > > > > Uwe Schindler
> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> > > > > > > > > > > http://www.thetaphi.de
> >> > > > > > > > > > > eMail: uwe@thetaphi.de
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > > -----Original Message-----
> >> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> >> > > > > > > > > > > > To: java-user@lucene.apache.org
> >> > > > > > > > > > > > Subject: Re: How to support stemming and case
> folding
> >> > for
> >> > > > > > english
> >> > > > > > > > > > content
> >> > > > > > > > > > > > mixed with non-english content?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > KK, ok, so you only really want to stem the
> english.
> >> > This
> >> > > > is
> >> > > > > > > good.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Is it possible for you to consider using solr?
> solr's
> >> > > > default
> >> > > > > > > > > analyzer
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > type 'text' will be good for your case. it will do
> the
> >> > > > > > following
> >> > > > > > > > > > > > 1. tokenize on whitespace
> >> > > > > > > > > > > > 2. handle both indian language and english
> punctuation
> >> > > > > > > > > > > > 3. lowercase the english.
> >> > > > > > > > > > > > 4. stem the english.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > try a nightly build,
> >> > > > > > > > > > >
> http://people.apache.org/builds/lucene/solr/nightly/
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> >> > > > > dioxide.software@gmail.com
> >> > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Muir, thanks for your response.
> >> > > > > > > > > > > > > I'm indexing indian language web pages which has
> got
> >> > > > > descent
> >> > > > > > > > amount
> >> > > > > > > > > > of
> >> > > > > > > > > > > > > english content mixed with therein. For the time
> >> > being
> >> > > > I'm
> >> > > > > > not
> >> > > > > > > > > going
> >> > > > > > > > > > to
> >> > > > > > > > > > > > use
> >> > > > > > > > > > > > > any stemmers as we don't have standard stemmers
> for
> >> > > > indian
> >> > > > > > > > > languages
> >> > > > > > > > > > .
> >> > > > > > > > > > > > So
> >> > > > > > > > > > > > > what I want to do is like this,
> >> > > > > > > > > > > > > Say I've a web page having hindi content with 5%
> >> > > english
> >> > > > > > > content.
> >> > > > > > > > > > Then
> >> > > > > > > > > > > > for
> >> > > > > > > > > > > > > hindi I want to use the basic white space
> analyzer as
> >> > > we
> >> > > > > dont
> >> > > > > > > > have
> >> > > > > > > > > > > > stemmers
> >> > > > > > > > > > > > > for this as I mentioned earlier and whereever
> english
> >> > > > > appears
> >> > > > > > I
> >> > > > > > > > > want
> >> > > > > > > > > > > > them
> >> > > > > > > > > > > > > to
> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard process
> used
> >> > for
> >> > > > > > english
> >> > > > > > > > > > > content].
> >> > > > > > > > > > > > As
> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for the
> full
> >> > > content
> >> > > > > > which
> >> > > > > > > > > > doesnot
> >> > > > > > > > > > > > > support case folding, stemming etc for teh
> content.
> >> > So
> >> > > if
> >> > > > > > there
> >> > > > > > > > is
> >> > > > > > > > > an
> >> > > > > > > > > > > > > english word say "Detection" indexed as such
> then
> >> > > > searching
> >> > > > > > for
> >> > > > > > > > > > > > detection
> >> > > > > > > > > > > > > or
> >> > > > > > > > > > > > > detect is not giving any results, which is the
> >> > expected
> >> > > > > > > behavior,
> >> > > > > > > > > but
> >> > > > > > > > > > I
> >> > > > > > > > > > > > > want
> >> > > > > > > > > > > > > this kind of queries to give results.
> >> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas on
> >> > doing
> >> > > > the
> >> > > > > > > same.
> >> > > > > > > > > And
> >> > > > > > > > > > > one
> >> > > > > > > > > > > > > more thing, I'm storing the full webpage content
> >> > under
> >> > > a
> >> > > > > > single
> >> > > > > > > > > > field,
> >> > > > > > > > > > > I
> >> > > > > > > > > > > > > hope this will not make any difference, right?
> >> > > > > > > > > > > > > It seems I've to use language identifiers, but
> do we
> >> > > > really
> >> > > > > > > need
> >> > > > > > > > > > that?
> >> > > > > > > > > > > > > Because we've only non-english content mixed
> with
> >> > > > > english[and
> >> > > > > > > not
> >> > > > > > > > > > > french
> >> > > > > > > > > > > > or
> >> > > > > > > > > > > > > russian etc].
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > What is the best way of approaching the problem?
> Any
> >> > > > > > thoughts!
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > KK.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> >> > > > > > rcmuir@gmail.com>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > KK, is all of your latin script text actually
> >> > > english?
> >> > > > Is
> >> > > > > > > there
> >> > > > > > > > > > stuff
> >> > > > > > > > > > > > > like
> >> > > > > > > > > > > > > > german or french mixed in?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > And for your non-english content (your
> examples
> >> > have
> >> > > > been
> >> > > > > > > > indian
> >> > > > > > > > > > > > writing
> >> > > > > > > > > > > > > > systems), is it generally true that if you had
> >> > > > > devanagari,
> >> > > > > > > you
> >> > > > > > > > > can
> >> > > > > > > > > > > > assume
> >> > > > > > > > > > > > > > its hindi? or is there stuff like marathi
> mixed in?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Reason I say this is to invoke the right
> stemmers,
> >> > > you
> >> > > > > > really
> >> > > > > > > > > need
> >> > > > > > > > > > > > some
> >> > > > > > > > > > > > > > language detection, but perhaps in your case
> you
> >> > can
> >> > > > > cheat
> >> > > > > > > and
> >> > > > > > > > > > detect
> >> > > > > > > > > > > > > this
> >> > > > > > > > > > > > > > based on scripts...
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > Robert
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> >> > > > > > > > dioxide.software@gmail.com>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi All,
> >> > > > > > > > > > > > > > > I'm indexing some non-english content. But
> the
> >> > page
> >> > > > > also
> >> > > > > > > > > contains
> >> > > > > > > > > > > > > english
> >> > > > > > > > > > > > > > > content. As of now I'm using
> WhitespaceAnalyzer
> >> > for
> >> > > > all
> >> > > > > > > > content
> >> > > > > > > > > > and
> >> > > > > > > > > > > > I'm
> >> > > > > > > > > > > > > > > storing the full webpage content under a
> single
> >> > > > filed.
> >> > > > > > Now
> >> > > > > > > we
> >> > > > > > > > > > > > require
> >> > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > support case folding and stemmming for the
> >> > english
> >> > > > > > content
> >> > > > > > > > > > > > intermingled
> >> > > > > > > > > > > > > > > with
> >> > > > > > > > > > > > > > > non-english content. I must metion that we
> dont
> >> > > have
> >> > > > > > > stemming
> >> > > > > > > > > and
> >> > > > > > > > > > > > case
> >> > > > > > > > > > > > > > > folding for these non-english content. I'm
> stuck
> >> > > with
> >> > > > > > this.
> >> > > > > > > > > Some
> >> > > > > > > > > > > one
> >> > > > > > > > > > > > do
> >> > > > > > > > > > > > > > let
> >> > > > > > > > > > > > > > > me know how to proceed for fixing this
> issue.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > > KK.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > Robert Muir
> >> > > > > > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > --
> >> > > > > > > > > > > > Robert Muir
> >> > > > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > >
> >> > >
> ---------------------------------------------------------------------
> >> > > > > > > > > > > To unsubscribe, e-mail:
> >> > > > > java-user-unsubscribe@lucene.apache.org
> >> > > > > > > > > > > For additional commands, e-mail:
> >> > > > > > java-user-help@lucene.apache.org
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > --
> >> > > > > > > > > > Robert Muir
> >> > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Robert Muir
> >> > > > > > > > rcmuir@gmail.com
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Robert Muir
> >> > > > > > rcmuir@gmail.com
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Robert Muir
> >> > > > rcmuir@gmail.com
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcmuir@gmail.com
> >> >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Note: I request Solr users to go through this mail and let me thier ideas.

Thanks Yonik, you rightly pointed it out. That clearly says that the way I'm
trying to mimic the default behaviour of Solr indexing/searching in Lucene
is wrong, right?.
 I downloaded the latest version of solr nightly on may20[at that time I was
using Solr, now switched to Lucene]. I hope the issue must have been fixed
with that version.Anyway I'm going to download the latest nightly build
today and try it out. I hope using the nightly build instead of getting the
src from latest trunk is more or less same[provided I donwload the latest
nightly build, right?]as I don't know much about getting/compiling the src
from solr trunk. Do let me know if I've to use the trunk anyway, in that
case I'm ready to spend time to get that done.
BTW, Yonik, as per the basic Solr schema.xml file, the analyzers/filters
used by default are these ones, correct me if I'm wrong,
this is the code snip  that mentions the filters used for indexing in Solr
------------------------------------

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
             enablePositionIncrements=true ensures that a 'gap' is left to
             allow for accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
----------------------------------------
and this is the part used for Solr querying,

<analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

To summarize the names are like this,
Indexing:
1. solr.WhitespaceTokenizerFactory  -- tokenizer and the followings filters
as is clear from the name itself
2. solr.SynonymFilterFactory
3. solr.StopFilterFactory
4. solr.WordDelimiterFilterFactory  (with the options as,
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1")

5. solr.LowerCaseFilterFactory
6. solr.EnglishPorterFilterFactory
7. solr.RemoveDuplicatesTokenFilterFactory

Querying:
1. solr.WhitespaceTokenizerFactory
2. solr.SynonymFilterFactory
3. solr.StopFilterFactory
4. solr.WordDelimiterFilterFactory( options are: generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1")

5. solr.LowerCaseFilterFactory
6. solr.EnglishPorterFilterFactory
7. solr.RemoveDuplicatesTokenFilterFactory

Now the filters/analyzers I used that tried to mimic the above behavior of
Solr [in Lucene] is as show below.
I pulled out the whitespacedelimiterfilter from Solr and my custom analyzer
for indexing is like this,
/**
 * Analyzer for Indian language.
 */
public class IndicAnalyzerIndex extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = new WhitespaceTokenizer(reader);
     ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);  // I tried using
...(ts, 1, 1, 1, 1, 0, 1) 7 params, but no constructor found for that, I
didn't try to modify the code to add this feature though, then used this
with 6 params, that uses the constructor for which the last option for
splitOnCaseChange is set to 1 so we're doing the same thing even in this
way...
    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
    ts = new LowerCaseFilter(ts);
    ts = new PorterStemFilter(ts);
    return ts;
  }
}

and for querying this is teh code
/**
 * Analyzer for Indian language.
 */
public class IndicAnalyzerQuery extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = new WhitespaceTokenizer(reader);
    ts = new WordDelimiterFilter(ts, 1, 1, 0, 0, 0);
    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
    ts = new LowerCaseFilter(ts);
    ts = new PorterStemFilter(ts);
    return ts;
  }
}

The only difference for both is just the worddelimiterfilter with different
options... Comparing the analyzers/filters used by Solr and the above custom
analyzer we can see that I'm not using synonymfilter and
removeduplicatefilter. I hope these make sense for english content only and
using/skipping them will not make any differece to my non-english content.
Can someone with knowledge of Solr/Lulcene source code point me what exactly
is going wrong in my case whn I'm trying to do the same thing in Lucene. It
seems I'm missing some minor yet important thing...hence my custom
IndicAnalyzer is not behaving the way Solr's default anlyzer works and this
is clearly shown  by Yonik that Solr is smart enough to detect unicoded word
endings and behaving as expected.
Any idea on this issue is welcome. Help me fix the issue. BTW, lucene ppl
when is that basic worddelimiterfilter going to be added to Lucene as well?
Any idea?

Thanks,
KK.

On Tue, Jun 9, 2009 at 7:01 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> I just cut'n'pasted your word into Solr... it worked fine (it didn't
> split the word).
> Make sure you're using the latest from the trunk version of Solr...
> this was fixed since 1.3
>
> http://localhost:8983/solr/select?q=साल&debugQuery=true
> [...]
> <lst name="debug">
>  <str name="rawquerystring">साल</str>
>  <str name="querystring">साल</str>
>  <str name="parsedquery">text:साल</str>
>  <str name="parsedquery_toString">text:साल</str>
>
> -Yonik
>
>
> On Tue, Jun 9, 2009 at 7:48 AM, KK <di...@gmail.com> wrote:
> > Hi Robert, I tried a sample code to check whats the reason. The
> > worddelimiterfilter uses isLetter() method to tokenize, and for hindi
> words
> > some parts of word are not actually letters but just part of the word[but
> > that doesnot mean they can be used as word delimiters], since they are
> not
> > letters isLetter() returns false and the word is getting breaked around
> > that. This is some sample code with a hindi word pronounced saal[meaning
> > year in english],
> >
> > import java.lang.String;
> >
> > public class HindiUnicodeTest {
> >    public static void main(String args[]) {
> >        String hindiStr = "साल";
> >        int length = hindiStr.length();
> >        System.out.println("str length " + length);
> >        for (int i=0; i<length; i++) {
> >            System.out.println(hindiStr.charAt(i) + " is " +
> > Character.isLetter(hindiStr.charAt(i)));
> >        }
> >
> >    }
> > }
> >
> > Running this gives this output,
> > str length 3
> > स is true
> > ा is false
> > ल is true
> >
> > As you can see the second one is false, which says that it is not a
> letter
> > but this makes worddelimiterfilter break/tokenize around the word. I even
> > tried to use my custom parser[which I mentioned earlier] and tried to
> print
> > the string that is the output after the query getting parsed, and what I
> > found is that if I send the above hindi word then the query string after
> > being parsed is something like this,
> > Parsed Query string: स ल
> > it essentialy removes the non-letter character[the second one], and it
> seems
> > it treats them as separate and whenever thse two characters appear
> adjacent,
> > they are in th top of result set, also whereever these two letters appers
> in
> > the doc, it says they are part of the result set [and hence highlights
> > them].
> >
> > I hope I made it clear. Do let me if some more information is required.
> >
> > Thanks,
> > KK.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Thank you very much Yonik. I downloaded the latest Solr build, pulled the
WordDelimiterFilter and used it with the same option as used by Solr default
and it worked like a charm.  Thanks to Robert also.

Thanks,
KK

On Tue, Jun 9, 2009 at 7:01 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> I just cut'n'pasted your word into Solr... it worked fine (it didn't
> split the word).
> Make sure you're using the latest from the trunk version of Solr...
> this was fixed since 1.3
>
> http://localhost:8983/solr/select?q=साल&debugQuery=true
> [...]
> <lst name="debug">
>  <str name="rawquerystring">साल</str>
>  <str name="querystring">साल</str>
>  <str name="parsedquery">text:साल</str>
>  <str name="parsedquery_toString">text:साल</str>
>
> -Yonik
>
>
> On Tue, Jun 9, 2009 at 7:48 AM, KK <di...@gmail.com> wrote:
> > Hi Robert, I tried a sample code to check whats the reason. The
> > worddelimiterfilter uses isLetter() method to tokenize, and for hindi
> words
> > some parts of word are not actually letters but just part of the word[but
> > that doesnot mean they can be used as word delimiters], since they are
> not
> > letters isLetter() returns false and the word is getting breaked around
> > that. This is some sample code with a hindi word pronounced saal[meaning
> > year in english],
> >
> > import java.lang.String;
> >
> > public class HindiUnicodeTest {
> >    public static void main(String args[]) {
> >        String hindiStr = "साल";
> >        int length = hindiStr.length();
> >        System.out.println("str length " + length);
> >        for (int i=0; i<length; i++) {
> >            System.out.println(hindiStr.charAt(i) + " is " +
> > Character.isLetter(hindiStr.charAt(i)));
> >        }
> >
> >    }
> > }
> >
> > Running this gives this output,
> > str length 3
> > स is true
> > ा is false
> > ल is true
> >
> > As you can see the second one is false, which says that it is not a
> letter
> > but this makes worddelimiterfilter break/tokenize around the word. I even
> > tried to use my custom parser[which I mentioned earlier] and tried to
> print
> > the string that is the output after the query getting parsed, and what I
> > found is that if I send the above hindi word then the query string after
> > being parsed is something like this,
> > Parsed Query string: स ल
> > it essentialy removes the non-letter character[the second one], and it
> seems
> > it treats them as separate and whenever thse two characters appear
> adjacent,
> > they are in th top of result set, also whereever these two letters appers
> in
> > the doc, it says they are part of the result set [and hence highlights
> > them].
> >
> > I hope I made it clear. Do let me if some more information is required.
> >
> > Thanks,
> > KK.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Note: I request Solr users to go through this mail and let me thier ideas.

Thanks Yonik, you rightly pointed it out. That clearly says that the way I'm
trying to mimic the default behaviour of Solr indexing/searching in Lucene
is wrong, right?.
 I downloaded the latest version of solr nightly on may20[at that time I was
using Solr, now switched to Lucene]. I hope the issue must have been fixed
with that version.Anyway I'm going to download the latest nightly build
today and try it out. I hope using the nightly build instead of getting the
src from latest trunk is more or less same[provided I donwload the latest
nightly build, right?]as I don't know much about getting/compiling the src
from solr trunk. Do let me know if I've to use the trunk anyway, in that
case I'm ready to spend time to get that done.
BTW, Yonik, as per the basic Solr schema.xml file, the analyzers/filters
used by default are these ones, correct me if I'm wrong,
this is the code snip  that mentions the filters used for indexing in Solr
------------------------------------

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
             enablePositionIncrements=true ensures that a 'gap' is left to
             allow for accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
----------------------------------------
and this is the part used for Solr querying,

<analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

To summarize the names are like this,
Indexing:
1. solr.WhitespaceTokenizerFactory  -- tokenizer and the followings filters
as is clear from the name itself
2. solr.SynonymFilterFactory
3. solr.StopFilterFactory
4. solr.WordDelimiterFilterFactory  (with the options as,
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1")

5. solr.LowerCaseFilterFactory
6. solr.EnglishPorterFilterFactory
7. solr.RemoveDuplicatesTokenFilterFactory

Querying:
1. solr.WhitespaceTokenizerFactory
2. solr.SynonymFilterFactory
3. solr.StopFilterFactory
4. solr.WordDelimiterFilterFactory( options are: generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1")

5. solr.LowerCaseFilterFactory
6. solr.EnglishPorterFilterFactory
7. solr.RemoveDuplicatesTokenFilterFactory

Now the filters/analyzers I used that tried to mimic the above behavior of
Solr [in Lucene] is as show below.
I pulled out the whitespacedelimiterfilter from Solr and my custom analyzer
for indexing is like this,
/**
 * Analyzer for Indian language.
 */
public class IndicAnalyzerIndex extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = new WhitespaceTokenizer(reader);
     ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);  // I tried using
...(ts, 1, 1, 1, 1, 0, 1) 7 params, but no constructor found for that, I
didn't try to modify the code to add this feature though, then used this
with 6 params, that uses the constructor for which the last option for
splitOnCaseChange is set to 1 so we're doing the same thing even in this
way...
    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
    ts = new LowerCaseFilter(ts);
    ts = new PorterStemFilter(ts);
    return ts;
  }
}

and for querying this is teh code
/**
 * Analyzer for Indian language.
 */
public class IndicAnalyzerQuery extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = new WhitespaceTokenizer(reader);
    ts = new WordDelimiterFilter(ts, 1, 1, 0, 0, 0);
    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
    ts = new LowerCaseFilter(ts);
    ts = new PorterStemFilter(ts);
    return ts;
  }
}

The only difference for both is just the worddelimiterfilter with different
options... Comparing the analyzers/filters used by Solr and the above custom
analyzer we can see that I'm not using synonymfilter and
removeduplicatefilter. I hope these make sense for english content only and
using/skipping them will not make any differece to my non-english content.
Can someone with knowledge of Solr/Lulcene source code point me what exactly
is going wrong in my case whn I'm trying to do the same thing in Lucene. It
seems I'm missing some minor yet important thing...hence my custom
IndicAnalyzer is not behaving the way Solr's default anlyzer works and this
is clearly shown  by Yonik that Solr is smart enough to detect unicoded word
endings and behaving as expected.
Any idea on this issue is welcome. Help me fix the issue. BTW, lucene ppl
when is that basic worddelimiterfilter going to be added to Lucene as well?
Any idea?

Thanks,
KK.

On Tue, Jun 9, 2009 at 7:01 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> I just cut'n'pasted your word into Solr... it worked fine (it didn't
> split the word).
> Make sure you're using the latest from the trunk version of Solr...
> this was fixed since 1.3
>
> http://localhost:8983/solr/select?q=साल&debugQuery=true
> [...]
> <lst name="debug">
>  <str name="rawquerystring">साल</str>
>  <str name="querystring">साल</str>
>  <str name="parsedquery">text:साल</str>
>  <str name="parsedquery_toString">text:साल</str>
>
> -Yonik
>
>
> On Tue, Jun 9, 2009 at 7:48 AM, KK <di...@gmail.com> wrote:
> > Hi Robert, I tried a sample code to check whats the reason. The
> > worddelimiterfilter uses isLetter() method to tokenize, and for hindi
> words
> > some parts of word are not actually letters but just part of the word[but
> > that doesnot mean they can be used as word delimiters], since they are
> not
> > letters isLetter() returns false and the word is getting breaked around
> > that. This is some sample code with a hindi word pronounced saal[meaning
> > year in english],
> >
> > import java.lang.String;
> >
> > public class HindiUnicodeTest {
> >    public static void main(String args[]) {
> >        String hindiStr = "साल";
> >        int length = hindiStr.length();
> >        System.out.println("str length " + length);
> >        for (int i=0; i<length; i++) {
> >            System.out.println(hindiStr.charAt(i) + " is " +
> > Character.isLetter(hindiStr.charAt(i)));
> >        }
> >
> >    }
> > }
> >
> > Running this gives this output,
> > str length 3
> > स is true
> > ा is false
> > ल is true
> >
> > As you can see the second one is false, which says that it is not a
> letter
> > but this makes worddelimiterfilter break/tokenize around the word. I even
> > tried to use my custom parser[which I mentioned earlier] and tried to
> print
> > the string that is the output after the query getting parsed, and what I
> > found is that if I send the above hindi word then the query string after
> > being parsed is something like this,
> > Parsed Query string: स ल
> > it essentialy removes the non-letter character[the second one], and it
> seems
> > it treats them as separate and whenever thse two characters appear
> adjacent,
> > they are in th top of result set, also whereever these two letters appers
> in
> > the doc, it says they are part of the result set [and hence highlights
> > them].
> >
> > I hope I made it clear. Do let me if some more information is required.
> >
> > Thanks,
> > KK.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

I just cut'n'pasted your word into Solr... it worked fine (it didn't
split the word).
Make sure you're using the latest from the trunk version of Solr...
this was fixed since 1.3

http://localhost:8983/solr/select?q=साल&debugQuery=true
[...]
<lst name="debug">
  <str name="rawquerystring">साल</str>
  <str name="querystring">साल</str>
  <str name="parsedquery">text:साल</str>
  <str name="parsedquery_toString">text:साल</str>

-Yonik


On Tue, Jun 9, 2009 at 7:48 AM, KK <di...@gmail.com> wrote:
> Hi Robert, I tried a sample code to check whats the reason. The
> worddelimiterfilter uses isLetter() method to tokenize, and for hindi words
> some parts of word are not actually letters but just part of the word[but
> that doesnot mean they can be used as word delimiters], since they are not
> letters isLetter() returns false and the word is getting breaked around
> that. This is some sample code with a hindi word pronounced saal[meaning
> year in english],
>
> import java.lang.String;
>
> public class HindiUnicodeTest {
>    public static void main(String args[]) {
>        String hindiStr = "साल";
>        int length = hindiStr.length();
>        System.out.println("str length " + length);
>        for (int i=0; i<length; i++) {
>            System.out.println(hindiStr.charAt(i) + " is " +
> Character.isLetter(hindiStr.charAt(i)));
>        }
>
>    }
> }
>
> Running this gives this output,
> str length 3
> स is true
> ा is false
> ल is true
>
> As you can see the second one is false, which says that it is not a letter
> but this makes worddelimiterfilter break/tokenize around the word. I even
> tried to use my custom parser[which I mentioned earlier] and tried to print
> the string that is the output after the query getting parsed, and what I
> found is that if I send the above hindi word then the query string after
> being parsed is something like this,
> Parsed Query string: स ल
> it essentialy removes the non-letter character[the second one], and it seems
> it treats them as separate and whenever thse two characters appear adjacent,
> they are in th top of result set, also whereever these two letters appers in
> the doc, it says they are part of the result set [and hence highlights
> them].
>
> I hope I made it clear. Do let me if some more information is required.
>
> Thanks,
> KK.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Hi Robert, I tried a sample code to check whats the reason. The
worddelimiterfilter uses isLetter() method to tokenize, and for hindi words
some parts of word are not actually letters but just part of the word[but
that doesnot mean they can be used as word delimiters], since they are not
letters isLetter() returns false and the word is getting breaked around
that. This is some sample code with a hindi word pronounced saal[meaning
year in english],

import java.lang.String;

public class HindiUnicodeTest {
    public static void main(String args[]) {
        String hindiStr = "साल";
        int length = hindiStr.length();
        System.out.println("str length " + length);
        for (int i=0; i<length; i++) {
            System.out.println(hindiStr.charAt(i) + " is " +
Character.isLetter(hindiStr.charAt(i)));
        }

    }
}

Running this gives this output,
str length 3
स is true
ा is false
ल is true

As you can see the second one is false, which says that it is not a letter
but this makes worddelimiterfilter break/tokenize around the word. I even
tried to use my custom parser[which I mentioned earlier] and tried to print
the string that is the output after the query getting parsed, and what I
found is that if I send the above hindi word then the query string after
being parsed is something like this,
Parsed Query string: स ल
it essentialy removes the non-letter character[the second one], and it seems
it treats them as separate and whenever thse two characters appear adjacent,
they are in th top of result set, also whereever these two letters appers in
the doc, it says they are part of the result set [and hence highlights
them].

I hope I made it clear. Do let me if some more information is required.

Thanks,
KK.

On Mon, Jun 8, 2009 at 3:34 PM, Robert Muir <rc...@gmail.com> wrote:

> KK can you give me an example of some indian text for which it is doing
> this?
>
> Thanks!
>
> On Mon, Jun 8, 2009 at 1:03 AM, KK<di...@gmail.com> wrote:
> > Hi Robert,
> > The problem is that worddelimiterfilter is doing its job for english
> content
> > but for non-english indian content which are unicoded it highlights the
> > searched word but alongwith that it also highlights the characters of
> that
> > word which was not hapenning without worddelimfilter, thats my concern.
> Say
> > for example I searched for a hindi word say "xyz ab" [assume these are in
> > hindi]  then in the search results it highlights these words but it also
> > highlights x/y/z/a/b whereever these letters appear which is obiviously
> > sounds bad. it should only highlight words not the letters therein. I
> hope I
> > made it clear. What could be the reason for this? Any idea on fixing the
> > same.
> >
> > Thanks,
> > KK
> >
> > On Sat, Jun 6, 2009 at 9:45 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> >> kk, i haven't had that experience with worddelimiterfilter on indian
> >> languages, is it possible you could provide me an example of how its
> >> creating nuisance?
> >>
> >> On Sat, Jun 6, 2009 at 9:42 AM, KK<di...@gmail.com> wrote:
> >> > Robert, I tried to use worddelimiterfilter from solr-nightly by
> putting
> >> it
> >> > in my working directory for this project, I set the parameters as you
> >> told
> >> > me. I must accept that its splitting words around those chars[like . @
> >> > etc]but alongwith that its messing with other non-english/unicode
> >> contents
> >> > and thats creating nuisance. I dont want worddelimiterfilter to fiddle
> >> > around with my non-english content.
> >> > This is what I'm doing,
> >> > /**
> >> >  * Analyzer for Indian language.
> >> >  */
> >> > public class IndicAnalyzer extends Analyzer {
> >> >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >> >    TokenStream ts = new WhitespaceTokenizer(reader);
> >> >    ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
> >> >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> >    ts = new LowerCaseFilter(ts);
> >> >    ts = new PorterStemFilter(ts);
> >> >    return ts;
> >> >  }
> >> > }
> >> >
> >> > I've to use the deprecated API for setting 5 values, thats fine, but
> >> somehow
> >> > its messing with unicode content. How to get rid of that? Any thougts?
> It
> >> > seems setting those values is some proper way might fix the problem,
> I'm
> >> not
> >> > sure, though.
> >> >
> >> > Thanks,
> >> > KK.
> >> >
> >> >
> >> > On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rc...@gmail.com> wrote:
> >> >
> >> >> kk an easier solution to your first problem is to use
> >> >> worddelimiterfilterfactory if possible... you can get an instance of
> >> >> worddelimiter filter from that.
> >> >>
> >> >> thanks,
> >> >> robert
> >> >>
> >> >> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rc...@gmail.com>
> wrote:
> >> >> > kk as for your first issue, that WordDelimiterFilter is package
> >> >> > protected, one option is to make a copy of the code and change the
> >> >> > class declaration to public.
> >> >> > the other option is to put your entire analyzer in
> >> >> > 'org.apache.solr.analysis' package so that you can access it...
> >> >> >
> >> >> > for the 2nd issue, yes you need to supply some options to it. the
> >> >> > default options solr applies to type 'text' seemed to work well for
> me
> >> >> > with indic:
> >> >> >
> >> >> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
> >> >> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
> >> >> >
> >> >> > On Fri, Jun 5, 2009 at 9:12 AM, KK <di...@gmail.com>
> >> wrote:
> >> >> >>
> >> >> >> Thanks Robert. There is one problem though, I'm able to plugin the
> >> word
> >> >> >> delimiter filter from solr-nightly jar file. When I tried to do
> >> >> something
> >> >> >> like,
> >> >> >>  TokenStream ts = new WhitespaceTokenizer(reader);
> >> >> >>   ts = new WordDelimiterFilter(ts);
> >> >> >>   ts = new PorterStemmerFilter(ts);
> >> >> >>   ...rest as in the last mail...
> >> >> >>
> >> >> >> It gave me an error saying that
> >> >> >>
> >> >> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
> >> >> >> org.apache.solr.analysis; cannot be accessed from outside package
> >> >> >> import org.apache.solr.analysis.WordDelimiterFilter;
> >> >> >>                               ^
> >> >> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
> >> >> >> symbol  : class WordDelimiterFilter
> >> >> >> location: class solrSearch.IndicAnalyzer
> >> >> >>    ts = new WordDelimiterFilter(ts);
> >> >> >>             ^
> >> >> >> 2 errors
> >> >> >>
> >> >> >> Then i tried to see the code for worddelimitefiter from
> solrnightly
> >> src
> >> >> and
> >> >> >> found that there are many deprecated constructors though they
> require
> >> a
> >> >> lot
> >> >> >> of parameters alongwith tokenstream. I went through the solr wiki
> for
> >> >> >> worddelimiterfilterfactory here,
> >> >> >>
> >> >>
> >>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> >> >> >> and say that there also its specified that we've to mention the
> >> >> parameters
> >> >> >> and both are different for indexing and querying.
> >> >> >> I'm kind of stuck here, how do I make use of worddelimiterfilter
> in
> >> my
> >> >> >> custom analyzer, I've to use it anyway.
> >> >> >> In my code I've to make use of worddelimiterfilter and not
> >> >> >> worddelimiterfilterfactory, right? I don't know whats the use of
> the
> >> >> other
> >> >> >> one. Anyway can you guide me getting rid of the above error. And
> yes
> >> >> I'll
> >> >> >> change the order of applying the filters as you said.
> >> >> >>
> >> >> >> Thanks,
> >> >> >> KK.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rc...@gmail.com>
> >> wrote:
> >> >> >>
> >> >> >> > KK, you got the right idea.
> >> >> >> >
> >> >> >> > though I think you might want to change the order, move the
> >> stopfilter
> >> >> >> > before the porter stem filter... otherwise it might not work
> >> >> correctly.
> >> >> >> >
> >> >> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <di...@gmail.com>
> >> >> wrote:
> >> >> >> >
> >> >> >> > > Thanks Robert. This is exactly what I did and  its working but
> >> >> delimiter
> >> >> >> > is
> >> >> >> > > missing I'm going to add that from solr-nightly.jar
> >> >> >> > >
> >> >> >> > > /**
> >> >> >> > >  * Analyzer for Indian language.
> >> >> >> > >  */
> >> >> >> > > public class IndicAnalyzer extends Analyzer {
> >> >> >> > >  public TokenStream tokenStream(String fieldName, Reader
> reader)
> >> {
> >> >> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
> >> >> >> > >    ts = new PorterStemFilter(ts);
> >> >> >> > >    ts = new LowerCaseFilter(ts);
> >> >> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> >> >> > >    return ts;
> >> >> >> > >  }
> >> >> >> > > }
> >> >> >> > >
> >> >> >> > > Its able to do stemming/case-folding and supports search for
> both
> >> >> english
> >> >> >> > > and indic texts. let me try out the delimiter. Will update you
> on
> >> >> that.
> >> >> >> > >
> >> >> >> > > Thanks a lot.
> >> >> >> > > KK
> >> >> >> > >
> >> >> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcmuir@gmail.com
> >
> >> >> wrote:
> >> >> >> > >
> >> >> >> > > > i think you are on the right track... once you build your
> >> >> analyzer, put
> >> >> >> > > it
> >> >> >> > > > in your classpath and play around with it in luke and see if
> it
> >> >> does
> >> >> >> > what
> >> >> >> > > > you want.
> >> >> >> > > >
> >> >> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <
> dioxide.software@gmail.com
> >> >
> >> >> wrote:
> >> >> >> > > >
> >> >> >> > > > > Hi Robert,
> >> >> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> >> >> >> > > > >
> >> >> >> > > > > public class ThaiAnalyzer extends Analyzer {
> >> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
> >> reader)
> >> >> {
> >> >> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
> >> >> >> > > > >    ts = new StandardFilter(ts);
> >> >> >> > > > >    ts = new ThaiWordFilter(ts);
> >> >> >> > > > >    ts = new StopFilter(ts,
> StopAnalyzer.ENGLISH_STOP_WORDS);
> >> >> >> > > > >    return ts;
> >> >> >> > > > >  }
> >> >> >> > > > > }
> >> >> >> > > > >
> >> >> >> > > > > Now as you said, I've to use whitespacetokenizer
> >> >> >> > > > > withworddelimitefilter[solr
> >> >> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it
> is
> >> >> >> > something
> >> >> >> > > > like
> >> >> >> > > > > this,
> >> >> >> > > > > public class IndicAnalyzer extends Analyzer {
> >> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
> >> reader)
> >> >> {
> >> >> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> >> >> >> > > > >   ts = new WordDelimiterFilter(ts);
> >> >> >> > > > >   ts = new LowerCaseFilter(ts);
> >> >> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)
> >> //
> >> >> >> > english
> >> >> >> > > > > stop filter, is this the default one?
> >> >> >> > > > >   ts = new PorterFilter(ts);
> >> >> >> > > > >   return ts;
> >> >> >> > > > >  }
> >> >> >> > > > > }
> >> >> >> > > > >
> >> >> >> > > > > Does this sound OK? I think it will do the job...let me
> try
> >> it
> >> >> out..
> >> >> >> > > > > I dont need custom filter as per my requirement, at least
> not
> >> >> for
> >> >> >> > these
> >> >> >> > > > > basic things I'm doing? I think so...
> >> >> >> > > > >
> >> >> >> > > > > Thanks,
> >> >> >> > > > > KK.
> >> >> >> > > > >
> >> >> >> > > > >
> >> >> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <
> >> rcmuir@gmail.com>
> >> >> >> > wrote:
> >> >> >> > > > >
> >> >> >> > > > > > KK well you can always get some good examples from the
> >> lucene
> >> >> >> > contrib
> >> >> >> > > > > > codebase.
> >> >> >> > > > > > For example, look at the DutchAnalyzer, especially:
> >> >> >> > > > > >
> >> >> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
> >> >> >> > > > > >
> >> >> >> > > > > > See how it combines a specified tokenizer with various
> >> >> filters?
> >> >> >> > this
> >> >> >> > > is
> >> >> >> > > > > > what
> >> >> >> > > > > > you want to do, except of course you want to use
> different
> >> >> >> > tokenizer
> >> >> >> > > > and
> >> >> >> > > > > > filters.
> >> >> >> > > > > >
> >> >> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
> >> >> dioxide.software@gmail.com>
> >> >> >> > > wrote:
> >> >> >> > > > > >
> >> >> >> > > > > > > Thanks Muir.
> >> >> >> > > > > > > Thanks for letting me know that I dont need language
> >> >> identifiers.
> >> >> >> > > > > > >  I'll have a look and will try to write the analyzer.
> For
> >> my
> >> >> case
> >> >> >> > I
> >> >> >> > > > > think
> >> >> >> > > > > > > it
> >> >> >> > > > > > > wont be that difficult.
> >> >> >> > > > > > > BTW, can you point me to some sample codes/tutorials
> >> writing
> >> >> >> > custom
> >> >> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
> >> >> something
> >> >> >> > > > htere?
> >> >> >> > > > > > do
> >> >> >> > > > > > > let me know.
> >> >> >> > > > > > >
> >> >> >> > > > > > > Thanks,
> >> >> >> > > > > > > KK.
> >> >> >> > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
> >> >> rcmuir@gmail.com>
> >> >> >> > > > wrote:
> >> >> >> > > > > > >
> >> >> >> > > > > > > > KK, for your case, you don't really need to go to
> the
> >> >> effort of
> >> >> >> > > > > > detecting
> >> >> >> > > > > > > > whether fragments are english or not.
> >> >> >> > > > > > > > Because the English stemmers in lucene will not
> modify
> >> >> your
> >> >> >> > Indic
> >> >> >> > > > > text,
> >> >> >> > > > > > > and
> >> >> >> > > > > > > > neither will the LowerCaseFilter.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > what you want to do is create a custom analyzer that
> >> works
> >> >> like
> >> >> >> > > > this
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from
> >> Solr
> >> >> >> > nightly
> >> >> >> > > > > jar],
> >> >> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Thanks,
> >> >> >> > > > > > > > Robert
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
> >> >> dioxide.software@gmail.com
> >> >> >> > >
> >> >> >> > > > > wrote:
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > > Thank you all.
> >> >> >> > > > > > > > > To be frank I was using Solr in the begining half
> a
> >> >> month
> >> >> >> > ago.
> >> >> >> > > > The
> >> >> >> > > > > > > > > problem[rather bug] with solr was creation of new
> >> index
> >> >> on
> >> >> >> > the
> >> >> >> > > > fly.
> >> >> >> > > > > > > > Though
> >> >> >> > > > > > > > > they have a restful method for teh same, but it
> was
> >> not
> >> >> >> > > working.
> >> >> >> > > > If
> >> >> >> > > > > I
> >> >> >> > > > > > > > > remember properly one of Solr commiter "Noble
> Paul"[I
> >> >> dont
> >> >> >> > know
> >> >> >> > > > his
> >> >> >> > > > > > > real
> >> >> >> > > > > > > > > name] was trying to help me. I tried many nightly
> >> builds
> >> >> and
> >> >> >> > > > > spending
> >> >> >> > > > > > a
> >> >> >> > > > > > > > > couple of days stuck at that made me think of
> lucene
> >> and
> >> >> I
> >> >> >> > > > switched
> >> >> >> > > > > > to
> >> >> >> > > > > > > > it.
> >> >> >> > > > > > > > > Now after working with lucene which gives you full
> >> >> control of
> >> >> >> > > > > > > everything
> >> >> >> > > > > > > > I
> >> >> >> > > > > > > > > don't want to switch to Solr.[LOL, to me
> Solr:Lucene
> >> is
> >> >> >> > similar
> >> >> >> > > > to
> >> >> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming
> back
> >> to
> >> >> the
> >> >> >> > > > point
> >> >> >> > > > > as
> >> >> >> > > > > > > Uwe
> >> >> >> > > > > > > > > mentioned that we can do the same thing in lucene
> as
> >> >> well,
> >> >> >> > what
> >> >> >> > > > is
> >> >> >> > > > > > > > > available
> >> >> >> > > > > > > > > in Solr, Solr is based on Lucene only, right?
> >> >> >> > > > > > > > > I request Uwe to give me some more ideas on using
> the
> >> >> >> > analyzers
> >> >> >> > > > > from
> >> >> >> > > > > > > solr
> >> >> >> > > > > > > > > that will do the job for me, handling a mix of
> both
> >> >> english
> >> >> >> > and
> >> >> >> > > > > > > > non-english
> >> >> >> > > > > > > > > content.
> >> >> >> > > > > > > > > Muir, can you give me a bit detail description of
> how
> >> to
> >> >> use
> >> >> >> > > the
> >> >> >> > > > > > > > > WordDelimiteFilter to do my job.
> >> >> >> > > > > > > > > On a side note, I was thingking of writing a
> simple
> >> >> analyzer
> >> >> >> > > that
> >> >> >> > > > > > will
> >> >> >> > > > > > > do
> >> >> >> > > > > > > > > the following,
> >> >> >> > > > > > > > > #. If the webpage fragment is non-english[for me
> its
> >> >> some
> >> >> >> > > indian
> >> >> >> > > > > > > > language]
> >> >> >> > > > > > > > > then index them as such, no stemming/ stop word
> >> removal
> >> >> to
> >> >> >> > > begin
> >> >> >> > > > > > with.
> >> >> >> > > > > > > As
> >> >> >> > > > > > > > I
> >> >> >> > > > > > > > > know its in UCN unicode something like
> >> >> >> > > > > \u0021\u0012\u34ae\u0031[just
> >> >> >> > > > > > a
> >> >> >> > > > > > > > > sample]
> >> >> >> > > > > > > > > # If the fragment is english then apply standard
> >> >> anlyzing
> >> >> >> > > process
> >> >> >> > > > > for
> >> >> >> > > > > > > > > english content. I've not thought of quering in
> the
> >> same
> >> >> way
> >> >> >> > as
> >> >> >> > > > of
> >> >> >> > > > > > now
> >> >> >> > > > > > > > i.e
> >> >> >> > > > > > > > > mix of non-english and engish words.
> >> >> >> > > > > > > > > Now to get all this,
> >> >> >> > > > > > > > >  #1. I need some sort of way which will let me
> know
> >> if
> >> >> the
> >> >> >> > > > content
> >> >> >> > > > > is
> >> >> >> > > > > > > > > english or not. If not english just add the tokens
> to
> >> >> the
> >> >> >> > > > document.
> >> >> >> > > > > > Do
> >> >> >> > > > > > > we
> >> >> >> > > > > > > > > really need language identifiers, as i dont have
> any
> >> >> other
> >> >> >> > > > content
> >> >> >> > > > > > that
> >> >> >> > > > > > > > > uses
> >> >> >> > > > > > > > > the same script as english other than those \u1234
> >> >> things for
> >> >> >> > > my
> >> >> >> > > > > > indian
> >> >> >> > > > > > > > > language content. Any smart hack/trick for the
> same?
> >> >> >> > > > > > > > >  #2. If the its english apply all normal process
> and
> >> add
> >> >> the
> >> >> >> > > > > stemmed
> >> >> >> > > > > > > > token
> >> >> >> > > > > > > > > to document.
> >> >> >> > > > > > > > > For all this I was thinking of iterating earch
> word
> >> of
> >> >> the
> >> >> >> > web
> >> >> >> > > > page
> >> >> >> > > > > > and
> >> >> >> > > > > > > > > apply the above procedure. And finallyadd  the
> newly
> >> >> created
> >> >> >> > > > > document
> >> >> >> > > > > > > to
> >> >> >> > > > > > > > > the
> >> >> >> > > > > > > > > index.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > I would like some one to guide me in this
> direction.
> >> I'm
> >> >> >> > pretty
> >> >> >> > > > > > people
> >> >> >> > > > > > > > must
> >> >> >> > > > > > > > > have done similar/same thing earlier, I request
> them
> >> to
> >> >> guide
> >> >> >> > > me/
> >> >> >> > > > > > point
> >> >> >> > > > > > > > me
> >> >> >> > > > > > > > > to some tutorials for the same.
> >> >> >> > > > > > > > > Else help me out writing a custom analyzer only if
> >> thats
> >> >> not
> >> >> >> > > > going
> >> >> >> > > > > to
> >> >> >> > > > > > > be
> >> >> >> > > > > > > > > too
> >> >> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know
> >> basics
> >> >> of
> >> >> >> > Java
> >> >> >> > > > > > coding.
> >> >> >> > > > > > > > > Thank you very much.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > --KK.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
> >> >> >> > rcmuir@gmail.com>
> >> >> >> > > > > > wrote:
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > > yes this is true. for starters KK, might be good
> to
> >> >> startup
> >> >> >> > > > solr
> >> >> >> > > > > > and
> >> >> >> > > > > > > > look
> >> >> >> > > > > > > > > > at
> >> >> >> > > > > > > > > >
> >> >> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > if you want to stick with lucene, the
> >> >> WordDelimiterFilter
> >> >> >> > is
> >> >> >> > > > the
> >> >> >> > > > > > > piece
> >> >> >> > > > > > > > > you
> >> >> >> > > > > > > > > > will want for your text, mainly for punctuation
> but
> >> >> also
> >> >> >> > for
> >> >> >> > > > > format
> >> >> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
> >> >> >> > > uwe@thetaphi.de
> >> >> >> > > > >
> >> >> >> > > > > > > wrote:
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > > You can also re-use the solr analyzers, as far
> as
> >> I
> >> >> found
> >> >> >> > > > out.
> >> >> >> > > > > > > There
> >> >> >> > > > > > > > is
> >> >> >> > > > > > > > > > an
> >> >> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge
> >> them.
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > -----
> >> >> >> > > > > > > > > > > Uwe Schindler
> >> >> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >> >> > > > > > > > > > > http://www.thetaphi.de
> >> >> >> > > > > > > > > > > eMail: uwe@thetaphi.de
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > > -----Original Message-----
> >> >> >> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> >> >> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> >> >> >> > > > > > > > > > > > To: java-user@lucene.apache.org
> >> >> >> > > > > > > > > > > > Subject: Re: How to support stemming and
> case
> >> >> folding
> >> >> >> > for
> >> >> >> > > > > > english
> >> >> >> > > > > > > > > > content
> >> >> >> > > > > > > > > > > > mixed with non-english content?
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > KK, ok, so you only really want to stem the
> >> >> english.
> >> >> >> > This
> >> >> >> > > > is
> >> >> >> > > > > > > good.
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > Is it possible for you to consider using
> solr?
> >> >> solr's
> >> >> >> > > > default
> >> >> >> > > > > > > > > analyzer
> >> >> >> > > > > > > > > > > for
> >> >> >> > > > > > > > > > > > type 'text' will be good for your case. it
> will
> >> do
> >> >> the
> >> >> >> > > > > > following
> >> >> >> > > > > > > > > > > > 1. tokenize on whitespace
> >> >> >> > > > > > > > > > > > 2. handle both indian language and english
> >> >> punctuation
> >> >> >> > > > > > > > > > > > 3. lowercase the english.
> >> >> >> > > > > > > > > > > > 4. stem the english.
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > try a nightly build,
> >> >> >> > > > > > > > > > >
> >> >> http://people.apache.org/builds/lucene/solr/nightly/
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> >> >> >> > > > > dioxide.software@gmail.com
> >> >> >> > > > > > >
> >> >> >> > > > > > > > > wrote:
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > Muir, thanks for your response.
> >> >> >> > > > > > > > > > > > > I'm indexing indian language web pages
> which
> >> has
> >> >> got
> >> >> >> > > > > descent
> >> >> >> > > > > > > > amount
> >> >> >> > > > > > > > > > of
> >> >> >> > > > > > > > > > > > > english content mixed with therein. For
> the
> >> time
> >> >> >> > being
> >> >> >> > > > I'm
> >> >> >> > > > > > not
> >> >> >> > > > > > > > > going
> >> >> >> > > > > > > > > > to
> >> >> >> > > > > > > > > > > > use
> >> >> >> > > > > > > > > > > > > any stemmers as we don't have standard
> >> stemmers
> >> >> for
> >> >> >> > > > indian
> >> >> >> > > > > > > > > languages
> >> >> >> > > > > > > > > > .
> >> >> >> > > > > > > > > > > > So
> >> >> >> > > > > > > > > > > > > what I want to do is like this,
> >> >> >> > > > > > > > > > > > > Say I've a web page having hindi content
> with
> >> 5%
> >> >> >> > > english
> >> >> >> > > > > > > content.
> >> >> >> > > > > > > > > > Then
> >> >> >> > > > > > > > > > > > for
> >> >> >> > > > > > > > > > > > > hindi I want to use the basic white space
> >> >> analyzer as
> >> >> >> > > we
> >> >> >> > > > > dont
> >> >> >> > > > > > > > have
> >> >> >> > > > > > > > > > > > stemmers
> >> >> >> > > > > > > > > > > > > for this as I mentioned earlier and
> whereever
> >> >> english
> >> >> >> > > > > appears
> >> >> >> > > > > > I
> >> >> >> > > > > > > > > want
> >> >> >> > > > > > > > > > > > them
> >> >> >> > > > > > > > > > > > > to
> >> >> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard
> process
> >> >> used
> >> >> >> > for
> >> >> >> > > > > > english
> >> >> >> > > > > > > > > > > content].
> >> >> >> > > > > > > > > > > > As
> >> >> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for
> the
> >> >> full
> >> >> >> > > content
> >> >> >> > > > > > which
> >> >> >> > > > > > > > > > doesnot
> >> >> >> > > > > > > > > > > > > support case folding, stemming etc for teh
> >> >> content.
> >> >> >> > So
> >> >> >> > > if
> >> >> >> > > > > > there
> >> >> >> > > > > > > > is
> >> >> >> > > > > > > > > an
> >> >> >> > > > > > > > > > > > > english word say "Detection" indexed as
> such
> >> >> then
> >> >> >> > > > searching
> >> >> >> > > > > > for
> >> >> >> > > > > > > > > > > > detection
> >> >> >> > > > > > > > > > > > > or
> >> >> >> > > > > > > > > > > > > detect is not giving any results, which is
> >> the
> >> >> >> > expected
> >> >> >> > > > > > > behavior,
> >> >> >> > > > > > > > > but
> >> >> >> > > > > > > > > > I
> >> >> >> > > > > > > > > > > > > want
> >> >> >> > > > > > > > > > > > > this kind of queries to give results.
> >> >> >> > > > > > > > > > > > > I hope I made it clear. Let me know any
> ideas
> >> on
> >> >> >> > doing
> >> >> >> > > > the
> >> >> >> > > > > > > same.
> >> >> >> > > > > > > > > And
> >> >> >> > > > > > > > > > > one
> >> >> >> > > > > > > > > > > > > more thing, I'm storing the full webpage
> >> content
> >> >> >> > under
> >> >> >> > > a
> >> >> >> > > > > > single
> >> >> >> > > > > > > > > > field,
> >> >> >> > > > > > > > > > > I
> >> >> >> > > > > > > > > > > > > hope this will not make any difference,
> >> right?
> >> >> >> > > > > > > > > > > > > It seems I've to use language identifiers,
> >> but
> >> >> do we
> >> >> >> > > > really
> >> >> >> > > > > > > need
> >> >> >> > > > > > > > > > that?
> >> >> >> > > > > > > > > > > > > Because we've only non-english content
> mixed
> >> >> with
> >> >> >> > > > > english[and
> >> >> >> > > > > > > not
> >> >> >> > > > > > > > > > > french
> >> >> >> > > > > > > > > > > > or
> >> >> >> > > > > > > > > > > > > russian etc].
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > What is the best way of approaching the
> >> problem?
> >> >> Any
> >> >> >> > > > > > thoughts!
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > Thanks,
> >> >> >> > > > > > > > > > > > > KK.
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert
> Muir <
> >> >> >> > > > > > rcmuir@gmail.com>
> >> >> >> > > > > > > > > > wrote:
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > KK, is all of your latin script text
> >> actually
> >> >> >> > > english?
> >> >> >> > > > Is
> >> >> >> > > > > > > there
> >> >> >> > > > > > > > > > stuff
> >> >> >> > > > > > > > > > > > > like
> >> >> >> > > > > > > > > > > > > > german or french mixed in?
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > And for your non-english content (your
> >> >> examples
> >> >> >> > have
> >> >> >> > > > been
> >> >> >> > > > > > > > indian
> >> >> >> > > > > > > > > > > > writing
> >> >> >> > > > > > > > > > > > > > systems), is it generally true that if
> you
> >> had
> >> >> >> > > > > devanagari,
> >> >> >> > > > > > > you
> >> >> >> > > > > > > > > can
> >> >> >> > > > > > > > > > > > assume
> >> >> >> > > > > > > > > > > > > > its hindi? or is there stuff like
> marathi
> >> >> mixed in?
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > Reason I say this is to invoke the right
> >> >> stemmers,
> >> >> >> > > you
> >> >> >> > > > > > really
> >> >> >> > > > > > > > > need
> >> >> >> > > > > > > > > > > > some
> >> >> >> > > > > > > > > > > > > > language detection, but perhaps in your
> >> case
> >> >> you
> >> >> >> > can
> >> >> >> > > > > cheat
> >> >> >> > > > > > > and
> >> >> >> > > > > > > > > > detect
> >> >> >> > > > > > > > > > > > > this
> >> >> >> > > > > > > > > > > > > > based on scripts...
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > Thanks,
> >> >> >> > > > > > > > > > > > > > Robert
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> >> >> >> > > > > > > > dioxide.software@gmail.com>
> >> >> >> > > > > > > > > > > > wrote:
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > > Hi All,
> >> >> >> > > > > > > > > > > > > > > I'm indexing some non-english content.
> >> But
> >> >> the
> >> >> >> > page
> >> >> >> > > > > also
> >> >> >> > > > > > > > > contains
> >> >> >> > > > > > > > > > > > > english
> >> >> >> > > > > > > > > > > > > > > content. As of now I'm using
> >> >> WhitespaceAnalyzer
> >> >> >> > for
> >> >> >> > > > all
> >> >> >> > > > > > > > content
> >> >> >> > > > > > > > > > and
> >> >> >> > > > > > > > > > > > I'm
> >> >> >> > > > > > > > > > > > > > > storing the full webpage content under
> a
> >> >> single
> >> >> >> > > > filed.
> >> >> >> > > > > > Now
> >> >> >> > > > > > > we
> >> >> >> > > > > > > > > > > > require
> >> >> >> > > > > > > > > > > > > to
> >> >> >> > > > > > > > > > > > > > > support case folding and stemmming for
> >> the
> >> >> >> > english
> >> >> >> > > > > > content
> >> >> >> > > > > > > > > > > > intermingled
> >> >> >> > > > > > > > > > > > > > > with
> >> >> >> > > > > > > > > > > > > > > non-english content. I must metion
> that
> >> we
> >> >> dont
> >> >> >> > > have
> >> >> >> > > > > > > stemming
> >> >> >> > > > > > > > > and
> >> >> >> > > > > > > > > > > > case
> >> >> >> > > > > > > > > > > > > > > folding for these non-english content.
> >> I'm
> >> >> stuck
> >> >> >> > > with
> >> >> >> > > > > > this.
> >> >> >> > > > > > > > > Some
> >> >> >> > > > > > > > > > > one
> >> >> >> > > > > > > > > > > > do
> >> >> >> > > > > > > > > > > > > > let
> >> >> >> > > > > > > > > > > > > > > me know how to proceed for fixing this
> >> >> issue.
> >> >> >> > > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > > Thanks,
> >> >> >> > > > > > > > > > > > > > > KK.
> >> >> >> > > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > > --
> >> >> >> > > > > > > > > > > > > > Robert Muir
> >> >> >> > > > > > > > > > > > > > rcmuir@gmail.com
> >> >> >> > > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > --
> >> >> >> > > > > > > > > > > > Robert Muir
> >> >> >> > > > > > > > > > > > rcmuir@gmail.com
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > >
> >> >> ---------------------------------------------------------------------
> >> >> >> > > > > > > > > > > To unsubscribe, e-mail:
> >> >> >> > > > > java-user-unsubscribe@lucene.apache.org
> >> >> >> > > > > > > > > > > For additional commands, e-mail:
> >> >> >> > > > > > java-user-help@lucene.apache.org
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > --
> >> >> >> > > > > > > > > > Robert Muir
> >> >> >> > > > > > > > > > rcmuir@gmail.com
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > --
> >> >> >> > > > > > > > Robert Muir
> >> >> >> > > > > > > > rcmuir@gmail.com
> >> >> >> > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > >
> >> >> >> > > > > >
> >> >> >> > > > > >
> >> >> >> > > > > > --
> >> >> >> > > > > > Robert Muir
> >> >> >> > > > > > rcmuir@gmail.com
> >> >> >> > > > > >
> >> >> >> > > > >
> >> >> >> > > >
> >> >> >> > > >
> >> >> >> > > >
> >> >> >> > > > --
> >> >> >> > > > Robert Muir
> >> >> >> > > > rcmuir@gmail.com
> >> >> >> > > >
> >> >> >> > >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Robert Muir
> >> >> >> > rcmuir@gmail.com
> >> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Robert Muir
> >> >> > rcmuir@gmail.com
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Robert Muir
> >> >> rcmuir@gmail.com
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

KK can you give me an example of some indian text for which it is doing this?

Thanks!

On Mon, Jun 8, 2009 at 1:03 AM, KK<di...@gmail.com> wrote:
> Hi Robert,
> The problem is that worddelimiterfilter is doing its job for english content
> but for non-english indian content which are unicoded it highlights the
> searched word but alongwith that it also highlights the characters of that
> word which was not hapenning without worddelimfilter, thats my concern. Say
> for example I searched for a hindi word say "xyz ab" [assume these are in
> hindi]  then in the search results it highlights these words but it also
> highlights x/y/z/a/b whereever these letters appear which is obiviously
> sounds bad. it should only highlight words not the letters therein. I hope I
> made it clear. What could be the reason for this? Any idea on fixing the
> same.
>
> Thanks,
> KK
>
> On Sat, Jun 6, 2009 at 9:45 PM, Robert Muir <rc...@gmail.com> wrote:
>
>> kk, i haven't had that experience with worddelimiterfilter on indian
>> languages, is it possible you could provide me an example of how its
>> creating nuisance?
>>
>> On Sat, Jun 6, 2009 at 9:42 AM, KK<di...@gmail.com> wrote:
>> > Robert, I tried to use worddelimiterfilter from solr-nightly by putting
>> it
>> > in my working directory for this project, I set the parameters as you
>> told
>> > me. I must accept that its splitting words around those chars[like . @
>> > etc]but alongwith that its messing with other non-english/unicode
>> contents
>> > and thats creating nuisance. I dont want worddelimiterfilter to fiddle
>> > around with my non-english content.
>> > This is what I'm doing,
>> > /**
>> >  * Analyzer for Indian language.
>> >  */
>> > public class IndicAnalyzer extends Analyzer {
>> >  public TokenStream tokenStream(String fieldName, Reader reader) {
>> >    TokenStream ts = new WhitespaceTokenizer(reader);
>> >    ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
>> >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> >    ts = new LowerCaseFilter(ts);
>> >    ts = new PorterStemFilter(ts);
>> >    return ts;
>> >  }
>> > }
>> >
>> > I've to use the deprecated API for setting 5 values, thats fine, but
>> somehow
>> > its messing with unicode content. How to get rid of that? Any thougts? It
>> > seems setting those values is some proper way might fix the problem, I'm
>> not
>> > sure, though.
>> >
>> > Thanks,
>> > KK.
>> >
>> >
>> > On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rc...@gmail.com> wrote:
>> >
>> >> kk an easier solution to your first problem is to use
>> >> worddelimiterfilterfactory if possible... you can get an instance of
>> >> worddelimiter filter from that.
>> >>
>> >> thanks,
>> >> robert
>> >>
>> >> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rc...@gmail.com> wrote:
>> >> > kk as for your first issue, that WordDelimiterFilter is package
>> >> > protected, one option is to make a copy of the code and change the
>> >> > class declaration to public.
>> >> > the other option is to put your entire analyzer in
>> >> > 'org.apache.solr.analysis' package so that you can access it...
>> >> >
>> >> > for the 2nd issue, yes you need to supply some options to it. the
>> >> > default options solr applies to type 'text' seemed to work well for me
>> >> > with indic:
>> >> >
>> >> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
>> >> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
>> >> >
>> >> > On Fri, Jun 5, 2009 at 9:12 AM, KK <di...@gmail.com>
>> wrote:
>> >> >>
>> >> >> Thanks Robert. There is one problem though, I'm able to plugin the
>> word
>> >> >> delimiter filter from solr-nightly jar file. When I tried to do
>> >> something
>> >> >> like,
>> >> >>  TokenStream ts = new WhitespaceTokenizer(reader);
>> >> >>   ts = new WordDelimiterFilter(ts);
>> >> >>   ts = new PorterStemmerFilter(ts);
>> >> >>   ...rest as in the last mail...
>> >> >>
>> >> >> It gave me an error saying that
>> >> >>
>> >> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
>> >> >> org.apache.solr.analysis; cannot be accessed from outside package
>> >> >> import org.apache.solr.analysis.WordDelimiterFilter;
>> >> >>                               ^
>> >> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
>> >> >> symbol  : class WordDelimiterFilter
>> >> >> location: class solrSearch.IndicAnalyzer
>> >> >>    ts = new WordDelimiterFilter(ts);
>> >> >>             ^
>> >> >> 2 errors
>> >> >>
>> >> >> Then i tried to see the code for worddelimitefiter from solrnightly
>> src
>> >> and
>> >> >> found that there are many deprecated constructors though they require
>> a
>> >> lot
>> >> >> of parameters alongwith tokenstream. I went through the solr wiki for
>> >> >> worddelimiterfilterfactory here,
>> >> >>
>> >>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
>> >> >> and say that there also its specified that we've to mention the
>> >> parameters
>> >> >> and both are different for indexing and querying.
>> >> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in
>> my
>> >> >> custom analyzer, I've to use it anyway.
>> >> >> In my code I've to make use of worddelimiterfilter and not
>> >> >> worddelimiterfilterfactory, right? I don't know whats the use of the
>> >> other
>> >> >> one. Anyway can you guide me getting rid of the above error. And yes
>> >> I'll
>> >> >> change the order of applying the filters as you said.
>> >> >>
>> >> >> Thanks,
>> >> >> KK.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rc...@gmail.com>
>> wrote:
>> >> >>
>> >> >> > KK, you got the right idea.
>> >> >> >
>> >> >> > though I think you might want to change the order, move the
>> stopfilter
>> >> >> > before the porter stem filter... otherwise it might not work
>> >> correctly.
>> >> >> >
>> >> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <di...@gmail.com>
>> >> wrote:
>> >> >> >
>> >> >> > > Thanks Robert. This is exactly what I did and  its working but
>> >> delimiter
>> >> >> > is
>> >> >> > > missing I'm going to add that from solr-nightly.jar
>> >> >> > >
>> >> >> > > /**
>> >> >> > >  * Analyzer for Indian language.
>> >> >> > >  */
>> >> >> > > public class IndicAnalyzer extends Analyzer {
>> >> >> > >  public TokenStream tokenStream(String fieldName, Reader reader)
>> {
>> >> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
>> >> >> > >    ts = new PorterStemFilter(ts);
>> >> >> > >    ts = new LowerCaseFilter(ts);
>> >> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> >> >> > >    return ts;
>> >> >> > >  }
>> >> >> > > }
>> >> >> > >
>> >> >> > > Its able to do stemming/case-folding and supports search for both
>> >> english
>> >> >> > > and indic texts. let me try out the delimiter. Will update you on
>> >> that.
>> >> >> > >
>> >> >> > > Thanks a lot.
>> >> >> > > KK
>> >> >> > >
>> >> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com>
>> >> wrote:
>> >> >> > >
>> >> >> > > > i think you are on the right track... once you build your
>> >> analyzer, put
>> >> >> > > it
>> >> >> > > > in your classpath and play around with it in luke and see if it
>> >> does
>> >> >> > what
>> >> >> > > > you want.
>> >> >> > > >
>> >> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.software@gmail.com
>> >
>> >> wrote:
>> >> >> > > >
>> >> >> > > > > Hi Robert,
>> >> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
>> >> >> > > > >
>> >> >> > > > > public class ThaiAnalyzer extends Analyzer {
>> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
>> reader)
>> >> {
>> >> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
>> >> >> > > > >    ts = new StandardFilter(ts);
>> >> >> > > > >    ts = new ThaiWordFilter(ts);
>> >> >> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> >> >> > > > >    return ts;
>> >> >> > > > >  }
>> >> >> > > > > }
>> >> >> > > > >
>> >> >> > > > > Now as you said, I've to use whitespacetokenizer
>> >> >> > > > > withworddelimitefilter[solr
>> >> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
>> >> >> > something
>> >> >> > > > like
>> >> >> > > > > this,
>> >> >> > > > > public class IndicAnalyzer extends Analyzer {
>> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
>> reader)
>> >> {
>> >> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
>> >> >> > > > >   ts = new WordDelimiterFilter(ts);
>> >> >> > > > >   ts = new LowerCaseFilter(ts);
>> >> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)
>> //
>> >> >> > english
>> >> >> > > > > stop filter, is this the default one?
>> >> >> > > > >   ts = new PorterFilter(ts);
>> >> >> > > > >   return ts;
>> >> >> > > > >  }
>> >> >> > > > > }
>> >> >> > > > >
>> >> >> > > > > Does this sound OK? I think it will do the job...let me try
>> it
>> >> out..
>> >> >> > > > > I dont need custom filter as per my requirement, at least not
>> >> for
>> >> >> > these
>> >> >> > > > > basic things I'm doing? I think so...
>> >> >> > > > >
>> >> >> > > > > Thanks,
>> >> >> > > > > KK.
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <
>> rcmuir@gmail.com>
>> >> >> > wrote:
>> >> >> > > > >
>> >> >> > > > > > KK well you can always get some good examples from the
>> lucene
>> >> >> > contrib
>> >> >> > > > > > codebase.
>> >> >> > > > > > For example, look at the DutchAnalyzer, especially:
>> >> >> > > > > >
>> >> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
>> >> >> > > > > >
>> >> >> > > > > > See how it combines a specified tokenizer with various
>> >> filters?
>> >> >> > this
>> >> >> > > is
>> >> >> > > > > > what
>> >> >> > > > > > you want to do, except of course you want to use different
>> >> >> > tokenizer
>> >> >> > > > and
>> >> >> > > > > > filters.
>> >> >> > > > > >
>> >> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
>> >> dioxide.software@gmail.com>
>> >> >> > > wrote:
>> >> >> > > > > >
>> >> >> > > > > > > Thanks Muir.
>> >> >> > > > > > > Thanks for letting me know that I dont need language
>> >> identifiers.
>> >> >> > > > > > >  I'll have a look and will try to write the analyzer. For
>> my
>> >> case
>> >> >> > I
>> >> >> > > > > think
>> >> >> > > > > > > it
>> >> >> > > > > > > wont be that difficult.
>> >> >> > > > > > > BTW, can you point me to some sample codes/tutorials
>> writing
>> >> >> > custom
>> >> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
>> >> something
>> >> >> > > > htere?
>> >> >> > > > > > do
>> >> >> > > > > > > let me know.
>> >> >> > > > > > >
>> >> >> > > > > > > Thanks,
>> >> >> > > > > > > KK.
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
>> >> rcmuir@gmail.com>
>> >> >> > > > wrote:
>> >> >> > > > > > >
>> >> >> > > > > > > > KK, for your case, you don't really need to go to the
>> >> effort of
>> >> >> > > > > > detecting
>> >> >> > > > > > > > whether fragments are english or not.
>> >> >> > > > > > > > Because the English stemmers in lucene will not modify
>> >> your
>> >> >> > Indic
>> >> >> > > > > text,
>> >> >> > > > > > > and
>> >> >> > > > > > > > neither will the LowerCaseFilter.
>> >> >> > > > > > > >
>> >> >> > > > > > > > what you want to do is create a custom analyzer that
>> works
>> >> like
>> >> >> > > > this
>> >> >> > > > > > > >
>> >> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from
>> Solr
>> >> >> > nightly
>> >> >> > > > > jar],
>> >> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
>> >> >> > > > > > > >
>> >> >> > > > > > > > Thanks,
>> >> >> > > > > > > > Robert
>> >> >> > > > > > > >
>> >> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
>> >> dioxide.software@gmail.com
>> >> >> > >
>> >> >> > > > > wrote:
>> >> >> > > > > > > >
>> >> >> > > > > > > > > Thank you all.
>> >> >> > > > > > > > > To be frank I was using Solr in the begining half a
>> >> month
>> >> >> > ago.
>> >> >> > > > The
>> >> >> > > > > > > > > problem[rather bug] with solr was creation of new
>> index
>> >> on
>> >> >> > the
>> >> >> > > > fly.
>> >> >> > > > > > > > Though
>> >> >> > > > > > > > > they have a restful method for teh same, but it was
>> not
>> >> >> > > working.
>> >> >> > > > If
>> >> >> > > > > I
>> >> >> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I
>> >> dont
>> >> >> > know
>> >> >> > > > his
>> >> >> > > > > > > real
>> >> >> > > > > > > > > name] was trying to help me. I tried many nightly
>> builds
>> >> and
>> >> >> > > > > spending
>> >> >> > > > > > a
>> >> >> > > > > > > > > couple of days stuck at that made me think of lucene
>> and
>> >> I
>> >> >> > > > switched
>> >> >> > > > > > to
>> >> >> > > > > > > > it.
>> >> >> > > > > > > > > Now after working with lucene which gives you full
>> >> control of
>> >> >> > > > > > > everything
>> >> >> > > > > > > > I
>> >> >> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene
>> is
>> >> >> > similar
>> >> >> > > > to
>> >> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back
>> to
>> >> the
>> >> >> > > > point
>> >> >> > > > > as
>> >> >> > > > > > > Uwe
>> >> >> > > > > > > > > mentioned that we can do the same thing in lucene as
>> >> well,
>> >> >> > what
>> >> >> > > > is
>> >> >> > > > > > > > > available
>> >> >> > > > > > > > > in Solr, Solr is based on Lucene only, right?
>> >> >> > > > > > > > > I request Uwe to give me some more ideas on using the
>> >> >> > analyzers
>> >> >> > > > > from
>> >> >> > > > > > > solr
>> >> >> > > > > > > > > that will do the job for me, handling a mix of both
>> >> english
>> >> >> > and
>> >> >> > > > > > > > non-english
>> >> >> > > > > > > > > content.
>> >> >> > > > > > > > > Muir, can you give me a bit detail description of how
>> to
>> >> use
>> >> >> > > the
>> >> >> > > > > > > > > WordDelimiteFilter to do my job.
>> >> >> > > > > > > > > On a side note, I was thingking of writing a simple
>> >> analyzer
>> >> >> > > that
>> >> >> > > > > > will
>> >> >> > > > > > > do
>> >> >> > > > > > > > > the following,
>> >> >> > > > > > > > > #. If the webpage fragment is non-english[for me its
>> >> some
>> >> >> > > indian
>> >> >> > > > > > > > language]
>> >> >> > > > > > > > > then index them as such, no stemming/ stop word
>> removal
>> >> to
>> >> >> > > begin
>> >> >> > > > > > with.
>> >> >> > > > > > > As
>> >> >> > > > > > > > I
>> >> >> > > > > > > > > know its in UCN unicode something like
>> >> >> > > > > \u0021\u0012\u34ae\u0031[just
>> >> >> > > > > > a
>> >> >> > > > > > > > > sample]
>> >> >> > > > > > > > > # If the fragment is english then apply standard
>> >> anlyzing
>> >> >> > > process
>> >> >> > > > > for
>> >> >> > > > > > > > > english content. I've not thought of quering in the
>> same
>> >> way
>> >> >> > as
>> >> >> > > > of
>> >> >> > > > > > now
>> >> >> > > > > > > > i.e
>> >> >> > > > > > > > > mix of non-english and engish words.
>> >> >> > > > > > > > > Now to get all this,
>> >> >> > > > > > > > >  #1. I need some sort of way which will let me know
>> if
>> >> the
>> >> >> > > > content
>> >> >> > > > > is
>> >> >> > > > > > > > > english or not. If not english just add the tokens to
>> >> the
>> >> >> > > > document.
>> >> >> > > > > > Do
>> >> >> > > > > > > we
>> >> >> > > > > > > > > really need language identifiers, as i dont have any
>> >> other
>> >> >> > > > content
>> >> >> > > > > > that
>> >> >> > > > > > > > > uses
>> >> >> > > > > > > > > the same script as english other than those \u1234
>> >> things for
>> >> >> > > my
>> >> >> > > > > > indian
>> >> >> > > > > > > > > language content. Any smart hack/trick for the same?
>> >> >> > > > > > > > >  #2. If the its english apply all normal process and
>> add
>> >> the
>> >> >> > > > > stemmed
>> >> >> > > > > > > > token
>> >> >> > > > > > > > > to document.
>> >> >> > > > > > > > > For all this I was thinking of iterating earch word
>> of
>> >> the
>> >> >> > web
>> >> >> > > > page
>> >> >> > > > > > and
>> >> >> > > > > > > > > apply the above procedure. And finallyadd  the newly
>> >> created
>> >> >> > > > > document
>> >> >> > > > > > > to
>> >> >> > > > > > > > > the
>> >> >> > > > > > > > > index.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > I would like some one to guide me in this direction.
>> I'm
>> >> >> > pretty
>> >> >> > > > > > people
>> >> >> > > > > > > > must
>> >> >> > > > > > > > > have done similar/same thing earlier, I request them
>> to
>> >> guide
>> >> >> > > me/
>> >> >> > > > > > point
>> >> >> > > > > > > > me
>> >> >> > > > > > > > > to some tutorials for the same.
>> >> >> > > > > > > > > Else help me out writing a custom analyzer only if
>> thats
>> >> not
>> >> >> > > > going
>> >> >> > > > > to
>> >> >> > > > > > > be
>> >> >> > > > > > > > > too
>> >> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know
>> basics
>> >> of
>> >> >> > Java
>> >> >> > > > > > coding.
>> >> >> > > > > > > > > Thank you very much.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > --KK.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
>> >> >> > rcmuir@gmail.com>
>> >> >> > > > > > wrote:
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > > yes this is true. for starters KK, might be good to
>> >> startup
>> >> >> > > > solr
>> >> >> > > > > > and
>> >> >> > > > > > > > look
>> >> >> > > > > > > > > > at
>> >> >> > > > > > > > > >
>> >> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > if you want to stick with lucene, the
>> >> WordDelimiterFilter
>> >> >> > is
>> >> >> > > > the
>> >> >> > > > > > > piece
>> >> >> > > > > > > > > you
>> >> >> > > > > > > > > > will want for your text, mainly for punctuation but
>> >> also
>> >> >> > for
>> >> >> > > > > format
>> >> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
>> >> >> > > uwe@thetaphi.de
>> >> >> > > > >
>> >> >> > > > > > > wrote:
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > > You can also re-use the solr analyzers, as far as
>> I
>> >> found
>> >> >> > > > out.
>> >> >> > > > > > > There
>> >> >> > > > > > > > is
>> >> >> > > > > > > > > > an
>> >> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge
>> them.
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > -----
>> >> >> > > > > > > > > > > Uwe Schindler
>> >> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> >> > > > > > > > > > > http://www.thetaphi.de
>> >> >> > > > > > > > > > > eMail: uwe@thetaphi.de
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > > -----Original Message-----
>> >> >> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
>> >> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
>> >> >> > > > > > > > > > > > To: java-user@lucene.apache.org
>> >> >> > > > > > > > > > > > Subject: Re: How to support stemming and case
>> >> folding
>> >> >> > for
>> >> >> > > > > > english
>> >> >> > > > > > > > > > content
>> >> >> > > > > > > > > > > > mixed with non-english content?
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > KK, ok, so you only really want to stem the
>> >> english.
>> >> >> > This
>> >> >> > > > is
>> >> >> > > > > > > good.
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > Is it possible for you to consider using solr?
>> >> solr's
>> >> >> > > > default
>> >> >> > > > > > > > > analyzer
>> >> >> > > > > > > > > > > for
>> >> >> > > > > > > > > > > > type 'text' will be good for your case. it will
>> do
>> >> the
>> >> >> > > > > > following
>> >> >> > > > > > > > > > > > 1. tokenize on whitespace
>> >> >> > > > > > > > > > > > 2. handle both indian language and english
>> >> punctuation
>> >> >> > > > > > > > > > > > 3. lowercase the english.
>> >> >> > > > > > > > > > > > 4. stem the english.
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > try a nightly build,
>> >> >> > > > > > > > > > >
>> >> http://people.apache.org/builds/lucene/solr/nightly/
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
>> >> >> > > > > dioxide.software@gmail.com
>> >> >> > > > > > >
>> >> >> > > > > > > > > wrote:
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Muir, thanks for your response.
>> >> >> > > > > > > > > > > > > I'm indexing indian language web pages which
>> has
>> >> got
>> >> >> > > > > descent
>> >> >> > > > > > > > amount
>> >> >> > > > > > > > > > of
>> >> >> > > > > > > > > > > > > english content mixed with therein. For the
>> time
>> >> >> > being
>> >> >> > > > I'm
>> >> >> > > > > > not
>> >> >> > > > > > > > > going
>> >> >> > > > > > > > > > to
>> >> >> > > > > > > > > > > > use
>> >> >> > > > > > > > > > > > > any stemmers as we don't have standard
>> stemmers
>> >> for
>> >> >> > > > indian
>> >> >> > > > > > > > > languages
>> >> >> > > > > > > > > > .
>> >> >> > > > > > > > > > > > So
>> >> >> > > > > > > > > > > > > what I want to do is like this,
>> >> >> > > > > > > > > > > > > Say I've a web page having hindi content with
>> 5%
>> >> >> > > english
>> >> >> > > > > > > content.
>> >> >> > > > > > > > > > Then
>> >> >> > > > > > > > > > > > for
>> >> >> > > > > > > > > > > > > hindi I want to use the basic white space
>> >> analyzer as
>> >> >> > > we
>> >> >> > > > > dont
>> >> >> > > > > > > > have
>> >> >> > > > > > > > > > > > stemmers
>> >> >> > > > > > > > > > > > > for this as I mentioned earlier and whereever
>> >> english
>> >> >> > > > > appears
>> >> >> > > > > > I
>> >> >> > > > > > > > > want
>> >> >> > > > > > > > > > > > them
>> >> >> > > > > > > > > > > > > to
>> >> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard process
>> >> used
>> >> >> > for
>> >> >> > > > > > english
>> >> >> > > > > > > > > > > content].
>> >> >> > > > > > > > > > > > As
>> >> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for the
>> >> full
>> >> >> > > content
>> >> >> > > > > > which
>> >> >> > > > > > > > > > doesnot
>> >> >> > > > > > > > > > > > > support case folding, stemming etc for teh
>> >> content.
>> >> >> > So
>> >> >> > > if
>> >> >> > > > > > there
>> >> >> > > > > > > > is
>> >> >> > > > > > > > > an
>> >> >> > > > > > > > > > > > > english word say "Detection" indexed as such
>> >> then
>> >> >> > > > searching
>> >> >> > > > > > for
>> >> >> > > > > > > > > > > > detection
>> >> >> > > > > > > > > > > > > or
>> >> >> > > > > > > > > > > > > detect is not giving any results, which is
>> the
>> >> >> > expected
>> >> >> > > > > > > behavior,
>> >> >> > > > > > > > > but
>> >> >> > > > > > > > > > I
>> >> >> > > > > > > > > > > > > want
>> >> >> > > > > > > > > > > > > this kind of queries to give results.
>> >> >> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas
>> on
>> >> >> > doing
>> >> >> > > > the
>> >> >> > > > > > > same.
>> >> >> > > > > > > > > And
>> >> >> > > > > > > > > > > one
>> >> >> > > > > > > > > > > > > more thing, I'm storing the full webpage
>> content
>> >> >> > under
>> >> >> > > a
>> >> >> > > > > > single
>> >> >> > > > > > > > > > field,
>> >> >> > > > > > > > > > > I
>> >> >> > > > > > > > > > > > > hope this will not make any difference,
>> right?
>> >> >> > > > > > > > > > > > > It seems I've to use language identifiers,
>> but
>> >> do we
>> >> >> > > > really
>> >> >> > > > > > > need
>> >> >> > > > > > > > > > that?
>> >> >> > > > > > > > > > > > > Because we've only non-english content mixed
>> >> with
>> >> >> > > > > english[and
>> >> >> > > > > > > not
>> >> >> > > > > > > > > > > french
>> >> >> > > > > > > > > > > > or
>> >> >> > > > > > > > > > > > > russian etc].
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > What is the best way of approaching the
>> problem?
>> >> Any
>> >> >> > > > > > thoughts!
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Thanks,
>> >> >> > > > > > > > > > > > > KK.
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
>> >> >> > > > > > rcmuir@gmail.com>
>> >> >> > > > > > > > > > wrote:
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > KK, is all of your latin script text
>> actually
>> >> >> > > english?
>> >> >> > > > Is
>> >> >> > > > > > > there
>> >> >> > > > > > > > > > stuff
>> >> >> > > > > > > > > > > > > like
>> >> >> > > > > > > > > > > > > > german or french mixed in?
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > And for your non-english content (your
>> >> examples
>> >> >> > have
>> >> >> > > > been
>> >> >> > > > > > > > indian
>> >> >> > > > > > > > > > > > writing
>> >> >> > > > > > > > > > > > > > systems), is it generally true that if you
>> had
>> >> >> > > > > devanagari,
>> >> >> > > > > > > you
>> >> >> > > > > > > > > can
>> >> >> > > > > > > > > > > > assume
>> >> >> > > > > > > > > > > > > > its hindi? or is there stuff like marathi
>> >> mixed in?
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > Reason I say this is to invoke the right
>> >> stemmers,
>> >> >> > > you
>> >> >> > > > > > really
>> >> >> > > > > > > > > need
>> >> >> > > > > > > > > > > > some
>> >> >> > > > > > > > > > > > > > language detection, but perhaps in your
>> case
>> >> you
>> >> >> > can
>> >> >> > > > > cheat
>> >> >> > > > > > > and
>> >> >> > > > > > > > > > detect
>> >> >> > > > > > > > > > > > > this
>> >> >> > > > > > > > > > > > > > based on scripts...
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > Thanks,
>> >> >> > > > > > > > > > > > > > Robert
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
>> >> >> > > > > > > > dioxide.software@gmail.com>
>> >> >> > > > > > > > > > > > wrote:
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > > Hi All,
>> >> >> > > > > > > > > > > > > > > I'm indexing some non-english content.
>> But
>> >> the
>> >> >> > page
>> >> >> > > > > also
>> >> >> > > > > > > > > contains
>> >> >> > > > > > > > > > > > > english
>> >> >> > > > > > > > > > > > > > > content. As of now I'm using
>> >> WhitespaceAnalyzer
>> >> >> > for
>> >> >> > > > all
>> >> >> > > > > > > > content
>> >> >> > > > > > > > > > and
>> >> >> > > > > > > > > > > > I'm
>> >> >> > > > > > > > > > > > > > > storing the full webpage content under a
>> >> single
>> >> >> > > > filed.
>> >> >> > > > > > Now
>> >> >> > > > > > > we
>> >> >> > > > > > > > > > > > require
>> >> >> > > > > > > > > > > > > to
>> >> >> > > > > > > > > > > > > > > support case folding and stemmming for
>> the
>> >> >> > english
>> >> >> > > > > > content
>> >> >> > > > > > > > > > > > intermingled
>> >> >> > > > > > > > > > > > > > > with
>> >> >> > > > > > > > > > > > > > > non-english content. I must metion that
>> we
>> >> dont
>> >> >> > > have
>> >> >> > > > > > > stemming
>> >> >> > > > > > > > > and
>> >> >> > > > > > > > > > > > case
>> >> >> > > > > > > > > > > > > > > folding for these non-english content.
>> I'm
>> >> stuck
>> >> >> > > with
>> >> >> > > > > > this.
>> >> >> > > > > > > > > Some
>> >> >> > > > > > > > > > > one
>> >> >> > > > > > > > > > > > do
>> >> >> > > > > > > > > > > > > > let
>> >> >> > > > > > > > > > > > > > > me know how to proceed for fixing this
>> >> issue.
>> >> >> > > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > > Thanks,
>> >> >> > > > > > > > > > > > > > > KK.
>> >> >> > > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > --
>> >> >> > > > > > > > > > > > > > Robert Muir
>> >> >> > > > > > > > > > > > > > rcmuir@gmail.com
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > --
>> >> >> > > > > > > > > > > > Robert Muir
>> >> >> > > > > > > > > > > > rcmuir@gmail.com
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > >
>> >> >> > >
>> >> ---------------------------------------------------------------------
>> >> >> > > > > > > > > > > To unsubscribe, e-mail:
>> >> >> > > > > java-user-unsubscribe@lucene.apache.org
>> >> >> > > > > > > > > > > For additional commands, e-mail:
>> >> >> > > > > > java-user-help@lucene.apache.org
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > --
>> >> >> > > > > > > > > > Robert Muir
>> >> >> > > > > > > > > > rcmuir@gmail.com
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > --
>> >> >> > > > > > > > Robert Muir
>> >> >> > > > > > > > rcmuir@gmail.com
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > > >
>> >> >> > > > > >
>> >> >> > > > > > --
>> >> >> > > > > > Robert Muir
>> >> >> > > > > > rcmuir@gmail.com
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > >
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > --
>> >> >> > > > Robert Muir
>> >> >> > > > rcmuir@gmail.com
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Robert Muir
>> >> >> > rcmuir@gmail.com
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Robert Muir
>> >> > rcmuir@gmail.com
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Robert Muir
>> >> rcmuir@gmail.com
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Hi Robert,
The problem is that worddelimiterfilter is doing its job for english content
but for non-english indian content which are unicoded it highlights the
searched word but alongwith that it also highlights the characters of that
word which was not hapenning without worddelimfilter, thats my concern. Say
for example I searched for a hindi word say "xyz ab" [assume these are in
hindi]  then in the search results it highlights these words but it also
highlights x/y/z/a/b whereever these letters appear which is obiviously
sounds bad. it should only highlight words not the letters therein. I hope I
made it clear. What could be the reason for this? Any idea on fixing the
same.

Thanks,
KK

On Sat, Jun 6, 2009 at 9:45 PM, Robert Muir <rc...@gmail.com> wrote:

> kk, i haven't had that experience with worddelimiterfilter on indian
> languages, is it possible you could provide me an example of how its
> creating nuisance?
>
> On Sat, Jun 6, 2009 at 9:42 AM, KK<di...@gmail.com> wrote:
> > Robert, I tried to use worddelimiterfilter from solr-nightly by putting
> it
> > in my working directory for this project, I set the parameters as you
> told
> > me. I must accept that its splitting words around those chars[like . @
> > etc]but alongwith that its messing with other non-english/unicode
> contents
> > and thats creating nuisance. I dont want worddelimiterfilter to fiddle
> > around with my non-english content.
> > This is what I'm doing,
> > /**
> >  * Analyzer for Indian language.
> >  */
> > public class IndicAnalyzer extends Analyzer {
> >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >    TokenStream ts = new WhitespaceTokenizer(reader);
> >    ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
> >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >    ts = new LowerCaseFilter(ts);
> >    ts = new PorterStemFilter(ts);
> >    return ts;
> >  }
> > }
> >
> > I've to use the deprecated API for setting 5 values, thats fine, but
> somehow
> > its messing with unicode content. How to get rid of that? Any thougts? It
> > seems setting those values is some proper way might fix the problem, I'm
> not
> > sure, though.
> >
> > Thanks,
> > KK.
> >
> >
> > On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> >> kk an easier solution to your first problem is to use
> >> worddelimiterfilterfactory if possible... you can get an instance of
> >> worddelimiter filter from that.
> >>
> >> thanks,
> >> robert
> >>
> >> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rc...@gmail.com> wrote:
> >> > kk as for your first issue, that WordDelimiterFilter is package
> >> > protected, one option is to make a copy of the code and change the
> >> > class declaration to public.
> >> > the other option is to put your entire analyzer in
> >> > 'org.apache.solr.analysis' package so that you can access it...
> >> >
> >> > for the 2nd issue, yes you need to supply some options to it. the
> >> > default options solr applies to type 'text' seemed to work well for me
> >> > with indic:
> >> >
> >> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
> >> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
> >> >
> >> > On Fri, Jun 5, 2009 at 9:12 AM, KK <di...@gmail.com>
> wrote:
> >> >>
> >> >> Thanks Robert. There is one problem though, I'm able to plugin the
> word
> >> >> delimiter filter from solr-nightly jar file. When I tried to do
> >> something
> >> >> like,
> >> >>  TokenStream ts = new WhitespaceTokenizer(reader);
> >> >>   ts = new WordDelimiterFilter(ts);
> >> >>   ts = new PorterStemmerFilter(ts);
> >> >>   ...rest as in the last mail...
> >> >>
> >> >> It gave me an error saying that
> >> >>
> >> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
> >> >> org.apache.solr.analysis; cannot be accessed from outside package
> >> >> import org.apache.solr.analysis.WordDelimiterFilter;
> >> >>                               ^
> >> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
> >> >> symbol  : class WordDelimiterFilter
> >> >> location: class solrSearch.IndicAnalyzer
> >> >>    ts = new WordDelimiterFilter(ts);
> >> >>             ^
> >> >> 2 errors
> >> >>
> >> >> Then i tried to see the code for worddelimitefiter from solrnightly
> src
> >> and
> >> >> found that there are many deprecated constructors though they require
> a
> >> lot
> >> >> of parameters alongwith tokenstream. I went through the solr wiki for
> >> >> worddelimiterfilterfactory here,
> >> >>
> >>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> >> >> and say that there also its specified that we've to mention the
> >> parameters
> >> >> and both are different for indexing and querying.
> >> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in
> my
> >> >> custom analyzer, I've to use it anyway.
> >> >> In my code I've to make use of worddelimiterfilter and not
> >> >> worddelimiterfilterfactory, right? I don't know whats the use of the
> >> other
> >> >> one. Anyway can you guide me getting rid of the above error. And yes
> >> I'll
> >> >> change the order of applying the filters as you said.
> >> >>
> >> >> Thanks,
> >> >> KK.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rc...@gmail.com>
> wrote:
> >> >>
> >> >> > KK, you got the right idea.
> >> >> >
> >> >> > though I think you might want to change the order, move the
> stopfilter
> >> >> > before the porter stem filter... otherwise it might not work
> >> correctly.
> >> >> >
> >> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <di...@gmail.com>
> >> wrote:
> >> >> >
> >> >> > > Thanks Robert. This is exactly what I did and  its working but
> >> delimiter
> >> >> > is
> >> >> > > missing I'm going to add that from solr-nightly.jar
> >> >> > >
> >> >> > > /**
> >> >> > >  * Analyzer for Indian language.
> >> >> > >  */
> >> >> > > public class IndicAnalyzer extends Analyzer {
> >> >> > >  public TokenStream tokenStream(String fieldName, Reader reader)
> {
> >> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
> >> >> > >    ts = new PorterStemFilter(ts);
> >> >> > >    ts = new LowerCaseFilter(ts);
> >> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> >> > >    return ts;
> >> >> > >  }
> >> >> > > }
> >> >> > >
> >> >> > > Its able to do stemming/case-folding and supports search for both
> >> english
> >> >> > > and indic texts. let me try out the delimiter. Will update you on
> >> that.
> >> >> > >
> >> >> > > Thanks a lot.
> >> >> > > KK
> >> >> > >
> >> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com>
> >> wrote:
> >> >> > >
> >> >> > > > i think you are on the right track... once you build your
> >> analyzer, put
> >> >> > > it
> >> >> > > > in your classpath and play around with it in luke and see if it
> >> does
> >> >> > what
> >> >> > > > you want.
> >> >> > > >
> >> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.software@gmail.com
> >
> >> wrote:
> >> >> > > >
> >> >> > > > > Hi Robert,
> >> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> >> >> > > > >
> >> >> > > > > public class ThaiAnalyzer extends Analyzer {
> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
> reader)
> >> {
> >> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
> >> >> > > > >    ts = new StandardFilter(ts);
> >> >> > > > >    ts = new ThaiWordFilter(ts);
> >> >> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> >> > > > >    return ts;
> >> >> > > > >  }
> >> >> > > > > }
> >> >> > > > >
> >> >> > > > > Now as you said, I've to use whitespacetokenizer
> >> >> > > > > withworddelimitefilter[solr
> >> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
> >> >> > something
> >> >> > > > like
> >> >> > > > > this,
> >> >> > > > > public class IndicAnalyzer extends Analyzer {
> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
> reader)
> >> {
> >> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> >> >> > > > >   ts = new WordDelimiterFilter(ts);
> >> >> > > > >   ts = new LowerCaseFilter(ts);
> >> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)
> //
> >> >> > english
> >> >> > > > > stop filter, is this the default one?
> >> >> > > > >   ts = new PorterFilter(ts);
> >> >> > > > >   return ts;
> >> >> > > > >  }
> >> >> > > > > }
> >> >> > > > >
> >> >> > > > > Does this sound OK? I think it will do the job...let me try
> it
> >> out..
> >> >> > > > > I dont need custom filter as per my requirement, at least not
> >> for
> >> >> > these
> >> >> > > > > basic things I'm doing? I think so...
> >> >> > > > >
> >> >> > > > > Thanks,
> >> >> > > > > KK.
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <
> rcmuir@gmail.com>
> >> >> > wrote:
> >> >> > > > >
> >> >> > > > > > KK well you can always get some good examples from the
> lucene
> >> >> > contrib
> >> >> > > > > > codebase.
> >> >> > > > > > For example, look at the DutchAnalyzer, especially:
> >> >> > > > > >
> >> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
> >> >> > > > > >
> >> >> > > > > > See how it combines a specified tokenizer with various
> >> filters?
> >> >> > this
> >> >> > > is
> >> >> > > > > > what
> >> >> > > > > > you want to do, except of course you want to use different
> >> >> > tokenizer
> >> >> > > > and
> >> >> > > > > > filters.
> >> >> > > > > >
> >> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
> >> dioxide.software@gmail.com>
> >> >> > > wrote:
> >> >> > > > > >
> >> >> > > > > > > Thanks Muir.
> >> >> > > > > > > Thanks for letting me know that I dont need language
> >> identifiers.
> >> >> > > > > > >  I'll have a look and will try to write the analyzer. For
> my
> >> case
> >> >> > I
> >> >> > > > > think
> >> >> > > > > > > it
> >> >> > > > > > > wont be that difficult.
> >> >> > > > > > > BTW, can you point me to some sample codes/tutorials
> writing
> >> >> > custom
> >> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
> >> something
> >> >> > > > htere?
> >> >> > > > > > do
> >> >> > > > > > > let me know.
> >> >> > > > > > >
> >> >> > > > > > > Thanks,
> >> >> > > > > > > KK.
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
> >> rcmuir@gmail.com>
> >> >> > > > wrote:
> >> >> > > > > > >
> >> >> > > > > > > > KK, for your case, you don't really need to go to the
> >> effort of
> >> >> > > > > > detecting
> >> >> > > > > > > > whether fragments are english or not.
> >> >> > > > > > > > Because the English stemmers in lucene will not modify
> >> your
> >> >> > Indic
> >> >> > > > > text,
> >> >> > > > > > > and
> >> >> > > > > > > > neither will the LowerCaseFilter.
> >> >> > > > > > > >
> >> >> > > > > > > > what you want to do is create a custom analyzer that
> works
> >> like
> >> >> > > > this
> >> >> > > > > > > >
> >> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from
> Solr
> >> >> > nightly
> >> >> > > > > jar],
> >> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> >> >> > > > > > > >
> >> >> > > > > > > > Thanks,
> >> >> > > > > > > > Robert
> >> >> > > > > > > >
> >> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
> >> dioxide.software@gmail.com
> >> >> > >
> >> >> > > > > wrote:
> >> >> > > > > > > >
> >> >> > > > > > > > > Thank you all.
> >> >> > > > > > > > > To be frank I was using Solr in the begining half a
> >> month
> >> >> > ago.
> >> >> > > > The
> >> >> > > > > > > > > problem[rather bug] with solr was creation of new
> index
> >> on
> >> >> > the
> >> >> > > > fly.
> >> >> > > > > > > > Though
> >> >> > > > > > > > > they have a restful method for teh same, but it was
> not
> >> >> > > working.
> >> >> > > > If
> >> >> > > > > I
> >> >> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I
> >> dont
> >> >> > know
> >> >> > > > his
> >> >> > > > > > > real
> >> >> > > > > > > > > name] was trying to help me. I tried many nightly
> builds
> >> and
> >> >> > > > > spending
> >> >> > > > > > a
> >> >> > > > > > > > > couple of days stuck at that made me think of lucene
> and
> >> I
> >> >> > > > switched
> >> >> > > > > > to
> >> >> > > > > > > > it.
> >> >> > > > > > > > > Now after working with lucene which gives you full
> >> control of
> >> >> > > > > > > everything
> >> >> > > > > > > > I
> >> >> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene
> is
> >> >> > similar
> >> >> > > > to
> >> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back
> to
> >> the
> >> >> > > > point
> >> >> > > > > as
> >> >> > > > > > > Uwe
> >> >> > > > > > > > > mentioned that we can do the same thing in lucene as
> >> well,
> >> >> > what
> >> >> > > > is
> >> >> > > > > > > > > available
> >> >> > > > > > > > > in Solr, Solr is based on Lucene only, right?
> >> >> > > > > > > > > I request Uwe to give me some more ideas on using the
> >> >> > analyzers
> >> >> > > > > from
> >> >> > > > > > > solr
> >> >> > > > > > > > > that will do the job for me, handling a mix of both
> >> english
> >> >> > and
> >> >> > > > > > > > non-english
> >> >> > > > > > > > > content.
> >> >> > > > > > > > > Muir, can you give me a bit detail description of how
> to
> >> use
> >> >> > > the
> >> >> > > > > > > > > WordDelimiteFilter to do my job.
> >> >> > > > > > > > > On a side note, I was thingking of writing a simple
> >> analyzer
> >> >> > > that
> >> >> > > > > > will
> >> >> > > > > > > do
> >> >> > > > > > > > > the following,
> >> >> > > > > > > > > #. If the webpage fragment is non-english[for me its
> >> some
> >> >> > > indian
> >> >> > > > > > > > language]
> >> >> > > > > > > > > then index them as such, no stemming/ stop word
> removal
> >> to
> >> >> > > begin
> >> >> > > > > > with.
> >> >> > > > > > > As
> >> >> > > > > > > > I
> >> >> > > > > > > > > know its in UCN unicode something like
> >> >> > > > > \u0021\u0012\u34ae\u0031[just
> >> >> > > > > > a
> >> >> > > > > > > > > sample]
> >> >> > > > > > > > > # If the fragment is english then apply standard
> >> anlyzing
> >> >> > > process
> >> >> > > > > for
> >> >> > > > > > > > > english content. I've not thought of quering in the
> same
> >> way
> >> >> > as
> >> >> > > > of
> >> >> > > > > > now
> >> >> > > > > > > > i.e
> >> >> > > > > > > > > mix of non-english and engish words.
> >> >> > > > > > > > > Now to get all this,
> >> >> > > > > > > > >  #1. I need some sort of way which will let me know
> if
> >> the
> >> >> > > > content
> >> >> > > > > is
> >> >> > > > > > > > > english or not. If not english just add the tokens to
> >> the
> >> >> > > > document.
> >> >> > > > > > Do
> >> >> > > > > > > we
> >> >> > > > > > > > > really need language identifiers, as i dont have any
> >> other
> >> >> > > > content
> >> >> > > > > > that
> >> >> > > > > > > > > uses
> >> >> > > > > > > > > the same script as english other than those \u1234
> >> things for
> >> >> > > my
> >> >> > > > > > indian
> >> >> > > > > > > > > language content. Any smart hack/trick for the same?
> >> >> > > > > > > > >  #2. If the its english apply all normal process and
> add
> >> the
> >> >> > > > > stemmed
> >> >> > > > > > > > token
> >> >> > > > > > > > > to document.
> >> >> > > > > > > > > For all this I was thinking of iterating earch word
> of
> >> the
> >> >> > web
> >> >> > > > page
> >> >> > > > > > and
> >> >> > > > > > > > > apply the above procedure. And finallyadd  the newly
> >> created
> >> >> > > > > document
> >> >> > > > > > > to
> >> >> > > > > > > > > the
> >> >> > > > > > > > > index.
> >> >> > > > > > > > >
> >> >> > > > > > > > > I would like some one to guide me in this direction.
> I'm
> >> >> > pretty
> >> >> > > > > > people
> >> >> > > > > > > > must
> >> >> > > > > > > > > have done similar/same thing earlier, I request them
> to
> >> guide
> >> >> > > me/
> >> >> > > > > > point
> >> >> > > > > > > > me
> >> >> > > > > > > > > to some tutorials for the same.
> >> >> > > > > > > > > Else help me out writing a custom analyzer only if
> thats
> >> not
> >> >> > > > going
> >> >> > > > > to
> >> >> > > > > > > be
> >> >> > > > > > > > > too
> >> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know
> basics
> >> of
> >> >> > Java
> >> >> > > > > > coding.
> >> >> > > > > > > > > Thank you very much.
> >> >> > > > > > > > >
> >> >> > > > > > > > > --KK.
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
> >> >> > rcmuir@gmail.com>
> >> >> > > > > > wrote:
> >> >> > > > > > > > >
> >> >> > > > > > > > > > yes this is true. for starters KK, might be good to
> >> startup
> >> >> > > > solr
> >> >> > > > > > and
> >> >> > > > > > > > look
> >> >> > > > > > > > > > at
> >> >> > > > > > > > > >
> >> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > if you want to stick with lucene, the
> >> WordDelimiterFilter
> >> >> > is
> >> >> > > > the
> >> >> > > > > > > piece
> >> >> > > > > > > > > you
> >> >> > > > > > > > > > will want for your text, mainly for punctuation but
> >> also
> >> >> > for
> >> >> > > > > format
> >> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
> >> >> > > uwe@thetaphi.de
> >> >> > > > >
> >> >> > > > > > > wrote:
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > > You can also re-use the solr analyzers, as far as
> I
> >> found
> >> >> > > > out.
> >> >> > > > > > > There
> >> >> > > > > > > > is
> >> >> > > > > > > > > > an
> >> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge
> them.
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > -----
> >> >> > > > > > > > > > > Uwe Schindler
> >> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >> > > > > > > > > > > http://www.thetaphi.de
> >> >> > > > > > > > > > > eMail: uwe@thetaphi.de
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > > -----Original Message-----
> >> >> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> >> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> >> >> > > > > > > > > > > > To: java-user@lucene.apache.org
> >> >> > > > > > > > > > > > Subject: Re: How to support stemming and case
> >> folding
> >> >> > for
> >> >> > > > > > english
> >> >> > > > > > > > > > content
> >> >> > > > > > > > > > > > mixed with non-english content?
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > KK, ok, so you only really want to stem the
> >> english.
> >> >> > This
> >> >> > > > is
> >> >> > > > > > > good.
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > Is it possible for you to consider using solr?
> >> solr's
> >> >> > > > default
> >> >> > > > > > > > > analyzer
> >> >> > > > > > > > > > > for
> >> >> > > > > > > > > > > > type 'text' will be good for your case. it will
> do
> >> the
> >> >> > > > > > following
> >> >> > > > > > > > > > > > 1. tokenize on whitespace
> >> >> > > > > > > > > > > > 2. handle both indian language and english
> >> punctuation
> >> >> > > > > > > > > > > > 3. lowercase the english.
> >> >> > > > > > > > > > > > 4. stem the english.
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > try a nightly build,
> >> >> > > > > > > > > > >
> >> http://people.apache.org/builds/lucene/solr/nightly/
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> >> >> > > > > dioxide.software@gmail.com
> >> >> > > > > > >
> >> >> > > > > > > > > wrote:
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > > Muir, thanks for your response.
> >> >> > > > > > > > > > > > > I'm indexing indian language web pages which
> has
> >> got
> >> >> > > > > descent
> >> >> > > > > > > > amount
> >> >> > > > > > > > > > of
> >> >> > > > > > > > > > > > > english content mixed with therein. For the
> time
> >> >> > being
> >> >> > > > I'm
> >> >> > > > > > not
> >> >> > > > > > > > > going
> >> >> > > > > > > > > > to
> >> >> > > > > > > > > > > > use
> >> >> > > > > > > > > > > > > any stemmers as we don't have standard
> stemmers
> >> for
> >> >> > > > indian
> >> >> > > > > > > > > languages
> >> >> > > > > > > > > > .
> >> >> > > > > > > > > > > > So
> >> >> > > > > > > > > > > > > what I want to do is like this,
> >> >> > > > > > > > > > > > > Say I've a web page having hindi content with
> 5%
> >> >> > > english
> >> >> > > > > > > content.
> >> >> > > > > > > > > > Then
> >> >> > > > > > > > > > > > for
> >> >> > > > > > > > > > > > > hindi I want to use the basic white space
> >> analyzer as
> >> >> > > we
> >> >> > > > > dont
> >> >> > > > > > > > have
> >> >> > > > > > > > > > > > stemmers
> >> >> > > > > > > > > > > > > for this as I mentioned earlier and whereever
> >> english
> >> >> > > > > appears
> >> >> > > > > > I
> >> >> > > > > > > > > want
> >> >> > > > > > > > > > > > them
> >> >> > > > > > > > > > > > > to
> >> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard process
> >> used
> >> >> > for
> >> >> > > > > > english
> >> >> > > > > > > > > > > content].
> >> >> > > > > > > > > > > > As
> >> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for the
> >> full
> >> >> > > content
> >> >> > > > > > which
> >> >> > > > > > > > > > doesnot
> >> >> > > > > > > > > > > > > support case folding, stemming etc for teh
> >> content.
> >> >> > So
> >> >> > > if
> >> >> > > > > > there
> >> >> > > > > > > > is
> >> >> > > > > > > > > an
> >> >> > > > > > > > > > > > > english word say "Detection" indexed as such
> >> then
> >> >> > > > searching
> >> >> > > > > > for
> >> >> > > > > > > > > > > > detection
> >> >> > > > > > > > > > > > > or
> >> >> > > > > > > > > > > > > detect is not giving any results, which is
> the
> >> >> > expected
> >> >> > > > > > > behavior,
> >> >> > > > > > > > > but
> >> >> > > > > > > > > > I
> >> >> > > > > > > > > > > > > want
> >> >> > > > > > > > > > > > > this kind of queries to give results.
> >> >> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas
> on
> >> >> > doing
> >> >> > > > the
> >> >> > > > > > > same.
> >> >> > > > > > > > > And
> >> >> > > > > > > > > > > one
> >> >> > > > > > > > > > > > > more thing, I'm storing the full webpage
> content
> >> >> > under
> >> >> > > a
> >> >> > > > > > single
> >> >> > > > > > > > > > field,
> >> >> > > > > > > > > > > I
> >> >> > > > > > > > > > > > > hope this will not make any difference,
> right?
> >> >> > > > > > > > > > > > > It seems I've to use language identifiers,
> but
> >> do we
> >> >> > > > really
> >> >> > > > > > > need
> >> >> > > > > > > > > > that?
> >> >> > > > > > > > > > > > > Because we've only non-english content mixed
> >> with
> >> >> > > > > english[and
> >> >> > > > > > > not
> >> >> > > > > > > > > > > french
> >> >> > > > > > > > > > > > or
> >> >> > > > > > > > > > > > > russian etc].
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > What is the best way of approaching the
> problem?
> >> Any
> >> >> > > > > > thoughts!
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > Thanks,
> >> >> > > > > > > > > > > > > KK.
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> >> >> > > > > > rcmuir@gmail.com>
> >> >> > > > > > > > > > wrote:
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > KK, is all of your latin script text
> actually
> >> >> > > english?
> >> >> > > > Is
> >> >> > > > > > > there
> >> >> > > > > > > > > > stuff
> >> >> > > > > > > > > > > > > like
> >> >> > > > > > > > > > > > > > german or french mixed in?
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > And for your non-english content (your
> >> examples
> >> >> > have
> >> >> > > > been
> >> >> > > > > > > > indian
> >> >> > > > > > > > > > > > writing
> >> >> > > > > > > > > > > > > > systems), is it generally true that if you
> had
> >> >> > > > > devanagari,
> >> >> > > > > > > you
> >> >> > > > > > > > > can
> >> >> > > > > > > > > > > > assume
> >> >> > > > > > > > > > > > > > its hindi? or is there stuff like marathi
> >> mixed in?
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > Reason I say this is to invoke the right
> >> stemmers,
> >> >> > > you
> >> >> > > > > > really
> >> >> > > > > > > > > need
> >> >> > > > > > > > > > > > some
> >> >> > > > > > > > > > > > > > language detection, but perhaps in your
> case
> >> you
> >> >> > can
> >> >> > > > > cheat
> >> >> > > > > > > and
> >> >> > > > > > > > > > detect
> >> >> > > > > > > > > > > > > this
> >> >> > > > > > > > > > > > > > based on scripts...
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > Thanks,
> >> >> > > > > > > > > > > > > > Robert
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> >> >> > > > > > > > dioxide.software@gmail.com>
> >> >> > > > > > > > > > > > wrote:
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > > Hi All,
> >> >> > > > > > > > > > > > > > > I'm indexing some non-english content.
> But
> >> the
> >> >> > page
> >> >> > > > > also
> >> >> > > > > > > > > contains
> >> >> > > > > > > > > > > > > english
> >> >> > > > > > > > > > > > > > > content. As of now I'm using
> >> WhitespaceAnalyzer
> >> >> > for
> >> >> > > > all
> >> >> > > > > > > > content
> >> >> > > > > > > > > > and
> >> >> > > > > > > > > > > > I'm
> >> >> > > > > > > > > > > > > > > storing the full webpage content under a
> >> single
> >> >> > > > filed.
> >> >> > > > > > Now
> >> >> > > > > > > we
> >> >> > > > > > > > > > > > require
> >> >> > > > > > > > > > > > > to
> >> >> > > > > > > > > > > > > > > support case folding and stemmming for
> the
> >> >> > english
> >> >> > > > > > content
> >> >> > > > > > > > > > > > intermingled
> >> >> > > > > > > > > > > > > > > with
> >> >> > > > > > > > > > > > > > > non-english content. I must metion that
> we
> >> dont
> >> >> > > have
> >> >> > > > > > > stemming
> >> >> > > > > > > > > and
> >> >> > > > > > > > > > > > case
> >> >> > > > > > > > > > > > > > > folding for these non-english content.
> I'm
> >> stuck
> >> >> > > with
> >> >> > > > > > this.
> >> >> > > > > > > > > Some
> >> >> > > > > > > > > > > one
> >> >> > > > > > > > > > > > do
> >> >> > > > > > > > > > > > > > let
> >> >> > > > > > > > > > > > > > > me know how to proceed for fixing this
> >> issue.
> >> >> > > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > > Thanks,
> >> >> > > > > > > > > > > > > > > KK.
> >> >> > > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > --
> >> >> > > > > > > > > > > > > > Robert Muir
> >> >> > > > > > > > > > > > > > rcmuir@gmail.com
> >> >> > > > > > > > > > > > > >
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > --
> >> >> > > > > > > > > > > > Robert Muir
> >> >> > > > > > > > > > > > rcmuir@gmail.com
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > >
> >> >> > >
> >> ---------------------------------------------------------------------
> >> >> > > > > > > > > > > To unsubscribe, e-mail:
> >> >> > > > > java-user-unsubscribe@lucene.apache.org
> >> >> > > > > > > > > > > For additional commands, e-mail:
> >> >> > > > > > java-user-help@lucene.apache.org
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > --
> >> >> > > > > > > > > > Robert Muir
> >> >> > > > > > > > > > rcmuir@gmail.com
> >> >> > > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > --
> >> >> > > > > > > > Robert Muir
> >> >> > > > > > > > rcmuir@gmail.com
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > > > --
> >> >> > > > > > Robert Muir
> >> >> > > > > > rcmuir@gmail.com
> >> >> > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > --
> >> >> > > > Robert Muir
> >> >> > > > rcmuir@gmail.com
> >> >> > > >
> >> >> > >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Robert Muir
> >> >> > rcmuir@gmail.com
> >> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcmuir@gmail.com
> >> >
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

kk, i haven't had that experience with worddelimiterfilter on indian
languages, is it possible you could provide me an example of how its
creating nuisance?

On Sat, Jun 6, 2009 at 9:42 AM, KK<di...@gmail.com> wrote:
> Robert, I tried to use worddelimiterfilter from solr-nightly by putting it
> in my working directory for this project, I set the parameters as you told
> me. I must accept that its splitting words around those chars[like . @
> etc]but alongwith that its messing with other non-english/unicode contents
> and thats creating nuisance. I dont want worddelimiterfilter to fiddle
> around with my non-english content.
> This is what I'm doing,
> /**
>  * Analyzer for Indian language.
>  */
> public class IndicAnalyzer extends Analyzer {
>  public TokenStream tokenStream(String fieldName, Reader reader) {
>    TokenStream ts = new WhitespaceTokenizer(reader);
>    ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
>    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>    ts = new LowerCaseFilter(ts);
>    ts = new PorterStemFilter(ts);
>    return ts;
>  }
> }
>
> I've to use the deprecated API for setting 5 values, thats fine, but somehow
> its messing with unicode content. How to get rid of that? Any thougts? It
> seems setting those values is some proper way might fix the problem, I'm not
> sure, though.
>
> Thanks,
> KK.
>
>
> On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rc...@gmail.com> wrote:
>
>> kk an easier solution to your first problem is to use
>> worddelimiterfilterfactory if possible... you can get an instance of
>> worddelimiter filter from that.
>>
>> thanks,
>> robert
>>
>> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rc...@gmail.com> wrote:
>> > kk as for your first issue, that WordDelimiterFilter is package
>> > protected, one option is to make a copy of the code and change the
>> > class declaration to public.
>> > the other option is to put your entire analyzer in
>> > 'org.apache.solr.analysis' package so that you can access it...
>> >
>> > for the 2nd issue, yes you need to supply some options to it. the
>> > default options solr applies to type 'text' seemed to work well for me
>> > with indic:
>> >
>> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
>> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
>> >
>> > On Fri, Jun 5, 2009 at 9:12 AM, KK <di...@gmail.com> wrote:
>> >>
>> >> Thanks Robert. There is one problem though, I'm able to plugin the word
>> >> delimiter filter from solr-nightly jar file. When I tried to do
>> something
>> >> like,
>> >>  TokenStream ts = new WhitespaceTokenizer(reader);
>> >>   ts = new WordDelimiterFilter(ts);
>> >>   ts = new PorterStemmerFilter(ts);
>> >>   ...rest as in the last mail...
>> >>
>> >> It gave me an error saying that
>> >>
>> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
>> >> org.apache.solr.analysis; cannot be accessed from outside package
>> >> import org.apache.solr.analysis.WordDelimiterFilter;
>> >>                               ^
>> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
>> >> symbol  : class WordDelimiterFilter
>> >> location: class solrSearch.IndicAnalyzer
>> >>    ts = new WordDelimiterFilter(ts);
>> >>             ^
>> >> 2 errors
>> >>
>> >> Then i tried to see the code for worddelimitefiter from solrnightly src
>> and
>> >> found that there are many deprecated constructors though they require a
>> lot
>> >> of parameters alongwith tokenstream. I went through the solr wiki for
>> >> worddelimiterfilterfactory here,
>> >>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
>> >> and say that there also its specified that we've to mention the
>> parameters
>> >> and both are different for indexing and querying.
>> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in my
>> >> custom analyzer, I've to use it anyway.
>> >> In my code I've to make use of worddelimiterfilter and not
>> >> worddelimiterfilterfactory, right? I don't know whats the use of the
>> other
>> >> one. Anyway can you guide me getting rid of the above error. And yes
>> I'll
>> >> change the order of applying the filters as you said.
>> >>
>> >> Thanks,
>> >> KK.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rc...@gmail.com> wrote:
>> >>
>> >> > KK, you got the right idea.
>> >> >
>> >> > though I think you might want to change the order, move the stopfilter
>> >> > before the porter stem filter... otherwise it might not work
>> correctly.
>> >> >
>> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <di...@gmail.com>
>> wrote:
>> >> >
>> >> > > Thanks Robert. This is exactly what I did and  its working but
>> delimiter
>> >> > is
>> >> > > missing I'm going to add that from solr-nightly.jar
>> >> > >
>> >> > > /**
>> >> > >  * Analyzer for Indian language.
>> >> > >  */
>> >> > > public class IndicAnalyzer extends Analyzer {
>> >> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
>> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
>> >> > >    ts = new PorterStemFilter(ts);
>> >> > >    ts = new LowerCaseFilter(ts);
>> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> >> > >    return ts;
>> >> > >  }
>> >> > > }
>> >> > >
>> >> > > Its able to do stemming/case-folding and supports search for both
>> english
>> >> > > and indic texts. let me try out the delimiter. Will update you on
>> that.
>> >> > >
>> >> > > Thanks a lot.
>> >> > > KK
>> >> > >
>> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com>
>> wrote:
>> >> > >
>> >> > > > i think you are on the right track... once you build your
>> analyzer, put
>> >> > > it
>> >> > > > in your classpath and play around with it in luke and see if it
>> does
>> >> > what
>> >> > > > you want.
>> >> > > >
>> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <di...@gmail.com>
>> wrote:
>> >> > > >
>> >> > > > > Hi Robert,
>> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
>> >> > > > >
>> >> > > > > public class ThaiAnalyzer extends Analyzer {
>> >> > > > >  public TokenStream tokenStream(String fieldName, Reader reader)
>> {
>> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
>> >> > > > >    ts = new StandardFilter(ts);
>> >> > > > >    ts = new ThaiWordFilter(ts);
>> >> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> >> > > > >    return ts;
>> >> > > > >  }
>> >> > > > > }
>> >> > > > >
>> >> > > > > Now as you said, I've to use whitespacetokenizer
>> >> > > > > withworddelimitefilter[solr
>> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
>> >> > something
>> >> > > > like
>> >> > > > > this,
>> >> > > > > public class IndicAnalyzer extends Analyzer {
>> >> > > > >  public TokenStream tokenStream(String fieldName, Reader reader)
>> {
>> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
>> >> > > > >   ts = new WordDelimiterFilter(ts);
>> >> > > > >   ts = new LowerCaseFilter(ts);
>> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   //
>> >> > english
>> >> > > > > stop filter, is this the default one?
>> >> > > > >   ts = new PorterFilter(ts);
>> >> > > > >   return ts;
>> >> > > > >  }
>> >> > > > > }
>> >> > > > >
>> >> > > > > Does this sound OK? I think it will do the job...let me try it
>> out..
>> >> > > > > I dont need custom filter as per my requirement, at least not
>> for
>> >> > these
>> >> > > > > basic things I'm doing? I think so...
>> >> > > > >
>> >> > > > > Thanks,
>> >> > > > > KK.
>> >> > > > >
>> >> > > > >
>> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com>
>> >> > wrote:
>> >> > > > >
>> >> > > > > > KK well you can always get some good examples from the lucene
>> >> > contrib
>> >> > > > > > codebase.
>> >> > > > > > For example, look at the DutchAnalyzer, especially:
>> >> > > > > >
>> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
>> >> > > > > >
>> >> > > > > > See how it combines a specified tokenizer with various
>> filters?
>> >> > this
>> >> > > is
>> >> > > > > > what
>> >> > > > > > you want to do, except of course you want to use different
>> >> > tokenizer
>> >> > > > and
>> >> > > > > > filters.
>> >> > > > > >
>> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
>> dioxide.software@gmail.com>
>> >> > > wrote:
>> >> > > > > >
>> >> > > > > > > Thanks Muir.
>> >> > > > > > > Thanks for letting me know that I dont need language
>> identifiers.
>> >> > > > > > >  I'll have a look and will try to write the analyzer. For my
>> case
>> >> > I
>> >> > > > > think
>> >> > > > > > > it
>> >> > > > > > > wont be that difficult.
>> >> > > > > > > BTW, can you point me to some sample codes/tutorials writing
>> >> > custom
>> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
>> something
>> >> > > > htere?
>> >> > > > > > do
>> >> > > > > > > let me know.
>> >> > > > > > >
>> >> > > > > > > Thanks,
>> >> > > > > > > KK.
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
>> rcmuir@gmail.com>
>> >> > > > wrote:
>> >> > > > > > >
>> >> > > > > > > > KK, for your case, you don't really need to go to the
>> effort of
>> >> > > > > > detecting
>> >> > > > > > > > whether fragments are english or not.
>> >> > > > > > > > Because the English stemmers in lucene will not modify
>> your
>> >> > Indic
>> >> > > > > text,
>> >> > > > > > > and
>> >> > > > > > > > neither will the LowerCaseFilter.
>> >> > > > > > > >
>> >> > > > > > > > what you want to do is create a custom analyzer that works
>> like
>> >> > > > this
>> >> > > > > > > >
>> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr
>> >> > nightly
>> >> > > > > jar],
>> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
>> >> > > > > > > >
>> >> > > > > > > > Thanks,
>> >> > > > > > > > Robert
>> >> > > > > > > >
>> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
>> dioxide.software@gmail.com
>> >> > >
>> >> > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > Thank you all.
>> >> > > > > > > > > To be frank I was using Solr in the begining half a
>> month
>> >> > ago.
>> >> > > > The
>> >> > > > > > > > > problem[rather bug] with solr was creation of new index
>> on
>> >> > the
>> >> > > > fly.
>> >> > > > > > > > Though
>> >> > > > > > > > > they have a restful method for teh same, but it was not
>> >> > > working.
>> >> > > > If
>> >> > > > > I
>> >> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I
>> dont
>> >> > know
>> >> > > > his
>> >> > > > > > > real
>> >> > > > > > > > > name] was trying to help me. I tried many nightly builds
>> and
>> >> > > > > spending
>> >> > > > > > a
>> >> > > > > > > > > couple of days stuck at that made me think of lucene and
>> I
>> >> > > > switched
>> >> > > > > > to
>> >> > > > > > > > it.
>> >> > > > > > > > > Now after working with lucene which gives you full
>> control of
>> >> > > > > > > everything
>> >> > > > > > > > I
>> >> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is
>> >> > similar
>> >> > > > to
>> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back to
>> the
>> >> > > > point
>> >> > > > > as
>> >> > > > > > > Uwe
>> >> > > > > > > > > mentioned that we can do the same thing in lucene as
>> well,
>> >> > what
>> >> > > > is
>> >> > > > > > > > > available
>> >> > > > > > > > > in Solr, Solr is based on Lucene only, right?
>> >> > > > > > > > > I request Uwe to give me some more ideas on using the
>> >> > analyzers
>> >> > > > > from
>> >> > > > > > > solr
>> >> > > > > > > > > that will do the job for me, handling a mix of both
>> english
>> >> > and
>> >> > > > > > > > non-english
>> >> > > > > > > > > content.
>> >> > > > > > > > > Muir, can you give me a bit detail description of how to
>> use
>> >> > > the
>> >> > > > > > > > > WordDelimiteFilter to do my job.
>> >> > > > > > > > > On a side note, I was thingking of writing a simple
>> analyzer
>> >> > > that
>> >> > > > > > will
>> >> > > > > > > do
>> >> > > > > > > > > the following,
>> >> > > > > > > > > #. If the webpage fragment is non-english[for me its
>> some
>> >> > > indian
>> >> > > > > > > > language]
>> >> > > > > > > > > then index them as such, no stemming/ stop word removal
>> to
>> >> > > begin
>> >> > > > > > with.
>> >> > > > > > > As
>> >> > > > > > > > I
>> >> > > > > > > > > know its in UCN unicode something like
>> >> > > > > \u0021\u0012\u34ae\u0031[just
>> >> > > > > > a
>> >> > > > > > > > > sample]
>> >> > > > > > > > > # If the fragment is english then apply standard
>> anlyzing
>> >> > > process
>> >> > > > > for
>> >> > > > > > > > > english content. I've not thought of quering in the same
>> way
>> >> > as
>> >> > > > of
>> >> > > > > > now
>> >> > > > > > > > i.e
>> >> > > > > > > > > mix of non-english and engish words.
>> >> > > > > > > > > Now to get all this,
>> >> > > > > > > > >  #1. I need some sort of way which will let me know if
>> the
>> >> > > > content
>> >> > > > > is
>> >> > > > > > > > > english or not. If not english just add the tokens to
>> the
>> >> > > > document.
>> >> > > > > > Do
>> >> > > > > > > we
>> >> > > > > > > > > really need language identifiers, as i dont have any
>> other
>> >> > > > content
>> >> > > > > > that
>> >> > > > > > > > > uses
>> >> > > > > > > > > the same script as english other than those \u1234
>> things for
>> >> > > my
>> >> > > > > > indian
>> >> > > > > > > > > language content. Any smart hack/trick for the same?
>> >> > > > > > > > >  #2. If the its english apply all normal process and add
>> the
>> >> > > > > stemmed
>> >> > > > > > > > token
>> >> > > > > > > > > to document.
>> >> > > > > > > > > For all this I was thinking of iterating earch word of
>> the
>> >> > web
>> >> > > > page
>> >> > > > > > and
>> >> > > > > > > > > apply the above procedure. And finallyadd  the newly
>> created
>> >> > > > > document
>> >> > > > > > > to
>> >> > > > > > > > > the
>> >> > > > > > > > > index.
>> >> > > > > > > > >
>> >> > > > > > > > > I would like some one to guide me in this direction. I'm
>> >> > pretty
>> >> > > > > > people
>> >> > > > > > > > must
>> >> > > > > > > > > have done similar/same thing earlier, I request them to
>> guide
>> >> > > me/
>> >> > > > > > point
>> >> > > > > > > > me
>> >> > > > > > > > > to some tutorials for the same.
>> >> > > > > > > > > Else help me out writing a custom analyzer only if thats
>> not
>> >> > > > going
>> >> > > > > to
>> >> > > > > > > be
>> >> > > > > > > > > too
>> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know basics
>> of
>> >> > Java
>> >> > > > > > coding.
>> >> > > > > > > > > Thank you very much.
>> >> > > > > > > > >
>> >> > > > > > > > > --KK.
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
>> >> > rcmuir@gmail.com>
>> >> > > > > > wrote:
>> >> > > > > > > > >
>> >> > > > > > > > > > yes this is true. for starters KK, might be good to
>> startup
>> >> > > > solr
>> >> > > > > > and
>> >> > > > > > > > look
>> >> > > > > > > > > > at
>> >> > > > > > > > > >
>> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
>> >> > > > > > > > > >
>> >> > > > > > > > > > if you want to stick with lucene, the
>> WordDelimiterFilter
>> >> > is
>> >> > > > the
>> >> > > > > > > piece
>> >> > > > > > > > > you
>> >> > > > > > > > > > will want for your text, mainly for punctuation but
>> also
>> >> > for
>> >> > > > > format
>> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
>> >> > > > > > > > > >
>> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
>> >> > > uwe@thetaphi.de
>> >> > > > >
>> >> > > > > > > wrote:
>> >> > > > > > > > > >
>> >> > > > > > > > > > > You can also re-use the solr analyzers, as far as I
>> found
>> >> > > > out.
>> >> > > > > > > There
>> >> > > > > > > > is
>> >> > > > > > > > > > an
>> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge them.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > -----
>> >> > > > > > > > > > > Uwe Schindler
>> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> > > > > > > > > > > http://www.thetaphi.de
>> >> > > > > > > > > > > eMail: uwe@thetaphi.de
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > > -----Original Message-----
>> >> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
>> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
>> >> > > > > > > > > > > > To: java-user@lucene.apache.org
>> >> > > > > > > > > > > > Subject: Re: How to support stemming and case
>> folding
>> >> > for
>> >> > > > > > english
>> >> > > > > > > > > > content
>> >> > > > > > > > > > > > mixed with non-english content?
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > KK, ok, so you only really want to stem the
>> english.
>> >> > This
>> >> > > > is
>> >> > > > > > > good.
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > Is it possible for you to consider using solr?
>> solr's
>> >> > > > default
>> >> > > > > > > > > analyzer
>> >> > > > > > > > > > > for
>> >> > > > > > > > > > > > type 'text' will be good for your case. it will do
>> the
>> >> > > > > > following
>> >> > > > > > > > > > > > 1. tokenize on whitespace
>> >> > > > > > > > > > > > 2. handle both indian language and english
>> punctuation
>> >> > > > > > > > > > > > 3. lowercase the english.
>> >> > > > > > > > > > > > 4. stem the english.
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > try a nightly build,
>> >> > > > > > > > > > >
>> http://people.apache.org/builds/lucene/solr/nightly/
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
>> >> > > > > dioxide.software@gmail.com
>> >> > > > > > >
>> >> > > > > > > > > wrote:
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > > Muir, thanks for your response.
>> >> > > > > > > > > > > > > I'm indexing indian language web pages which has
>> got
>> >> > > > > descent
>> >> > > > > > > > amount
>> >> > > > > > > > > > of
>> >> > > > > > > > > > > > > english content mixed with therein. For the time
>> >> > being
>> >> > > > I'm
>> >> > > > > > not
>> >> > > > > > > > > going
>> >> > > > > > > > > > to
>> >> > > > > > > > > > > > use
>> >> > > > > > > > > > > > > any stemmers as we don't have standard stemmers
>> for
>> >> > > > indian
>> >> > > > > > > > > languages
>> >> > > > > > > > > > .
>> >> > > > > > > > > > > > So
>> >> > > > > > > > > > > > > what I want to do is like this,
>> >> > > > > > > > > > > > > Say I've a web page having hindi content with 5%
>> >> > > english
>> >> > > > > > > content.
>> >> > > > > > > > > > Then
>> >> > > > > > > > > > > > for
>> >> > > > > > > > > > > > > hindi I want to use the basic white space
>> analyzer as
>> >> > > we
>> >> > > > > dont
>> >> > > > > > > > have
>> >> > > > > > > > > > > > stemmers
>> >> > > > > > > > > > > > > for this as I mentioned earlier and whereever
>> english
>> >> > > > > appears
>> >> > > > > > I
>> >> > > > > > > > > want
>> >> > > > > > > > > > > > them
>> >> > > > > > > > > > > > > to
>> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard process
>> used
>> >> > for
>> >> > > > > > english
>> >> > > > > > > > > > > content].
>> >> > > > > > > > > > > > As
>> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for the
>> full
>> >> > > content
>> >> > > > > > which
>> >> > > > > > > > > > doesnot
>> >> > > > > > > > > > > > > support case folding, stemming etc for teh
>> content.
>> >> > So
>> >> > > if
>> >> > > > > > there
>> >> > > > > > > > is
>> >> > > > > > > > > an
>> >> > > > > > > > > > > > > english word say "Detection" indexed as such
>> then
>> >> > > > searching
>> >> > > > > > for
>> >> > > > > > > > > > > > detection
>> >> > > > > > > > > > > > > or
>> >> > > > > > > > > > > > > detect is not giving any results, which is the
>> >> > expected
>> >> > > > > > > behavior,
>> >> > > > > > > > > but
>> >> > > > > > > > > > I
>> >> > > > > > > > > > > > > want
>> >> > > > > > > > > > > > > this kind of queries to give results.
>> >> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas on
>> >> > doing
>> >> > > > the
>> >> > > > > > > same.
>> >> > > > > > > > > And
>> >> > > > > > > > > > > one
>> >> > > > > > > > > > > > > more thing, I'm storing the full webpage content
>> >> > under
>> >> > > a
>> >> > > > > > single
>> >> > > > > > > > > > field,
>> >> > > > > > > > > > > I
>> >> > > > > > > > > > > > > hope this will not make any difference, right?
>> >> > > > > > > > > > > > > It seems I've to use language identifiers, but
>> do we
>> >> > > > really
>> >> > > > > > > need
>> >> > > > > > > > > > that?
>> >> > > > > > > > > > > > > Because we've only non-english content mixed
>> with
>> >> > > > > english[and
>> >> > > > > > > not
>> >> > > > > > > > > > > french
>> >> > > > > > > > > > > > or
>> >> > > > > > > > > > > > > russian etc].
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > What is the best way of approaching the problem?
>> Any
>> >> > > > > > thoughts!
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > Thanks,
>> >> > > > > > > > > > > > > KK.
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
>> >> > > > > > rcmuir@gmail.com>
>> >> > > > > > > > > > wrote:
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > KK, is all of your latin script text actually
>> >> > > english?
>> >> > > > Is
>> >> > > > > > > there
>> >> > > > > > > > > > stuff
>> >> > > > > > > > > > > > > like
>> >> > > > > > > > > > > > > > german or french mixed in?
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > And for your non-english content (your
>> examples
>> >> > have
>> >> > > > been
>> >> > > > > > > > indian
>> >> > > > > > > > > > > > writing
>> >> > > > > > > > > > > > > > systems), is it generally true that if you had
>> >> > > > > devanagari,
>> >> > > > > > > you
>> >> > > > > > > > > can
>> >> > > > > > > > > > > > assume
>> >> > > > > > > > > > > > > > its hindi? or is there stuff like marathi
>> mixed in?
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > Reason I say this is to invoke the right
>> stemmers,
>> >> > > you
>> >> > > > > > really
>> >> > > > > > > > > need
>> >> > > > > > > > > > > > some
>> >> > > > > > > > > > > > > > language detection, but perhaps in your case
>> you
>> >> > can
>> >> > > > > cheat
>> >> > > > > > > and
>> >> > > > > > > > > > detect
>> >> > > > > > > > > > > > > this
>> >> > > > > > > > > > > > > > based on scripts...
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > Thanks,
>> >> > > > > > > > > > > > > > Robert
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
>> >> > > > > > > > dioxide.software@gmail.com>
>> >> > > > > > > > > > > > wrote:
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > > Hi All,
>> >> > > > > > > > > > > > > > > I'm indexing some non-english content. But
>> the
>> >> > page
>> >> > > > > also
>> >> > > > > > > > > contains
>> >> > > > > > > > > > > > > english
>> >> > > > > > > > > > > > > > > content. As of now I'm using
>> WhitespaceAnalyzer
>> >> > for
>> >> > > > all
>> >> > > > > > > > content
>> >> > > > > > > > > > and
>> >> > > > > > > > > > > > I'm
>> >> > > > > > > > > > > > > > > storing the full webpage content under a
>> single
>> >> > > > filed.
>> >> > > > > > Now
>> >> > > > > > > we
>> >> > > > > > > > > > > > require
>> >> > > > > > > > > > > > > to
>> >> > > > > > > > > > > > > > > support case folding and stemmming for the
>> >> > english
>> >> > > > > > content
>> >> > > > > > > > > > > > intermingled
>> >> > > > > > > > > > > > > > > with
>> >> > > > > > > > > > > > > > > non-english content. I must metion that we
>> dont
>> >> > > have
>> >> > > > > > > stemming
>> >> > > > > > > > > and
>> >> > > > > > > > > > > > case
>> >> > > > > > > > > > > > > > > folding for these non-english content. I'm
>> stuck
>> >> > > with
>> >> > > > > > this.
>> >> > > > > > > > > Some
>> >> > > > > > > > > > > one
>> >> > > > > > > > > > > > do
>> >> > > > > > > > > > > > > > let
>> >> > > > > > > > > > > > > > > me know how to proceed for fixing this
>> issue.
>> >> > > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > > Thanks,
>> >> > > > > > > > > > > > > > > KK.
>> >> > > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > --
>> >> > > > > > > > > > > > > > Robert Muir
>> >> > > > > > > > > > > > > > rcmuir@gmail.com
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > --
>> >> > > > > > > > > > > > Robert Muir
>> >> > > > > > > > > > > > rcmuir@gmail.com
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > >
>> >> > >
>> ---------------------------------------------------------------------
>> >> > > > > > > > > > > To unsubscribe, e-mail:
>> >> > > > > java-user-unsubscribe@lucene.apache.org
>> >> > > > > > > > > > > For additional commands, e-mail:
>> >> > > > > > java-user-help@lucene.apache.org
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > --
>> >> > > > > > > > > > Robert Muir
>> >> > > > > > > > > > rcmuir@gmail.com
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > --
>> >> > > > > > > > Robert Muir
>> >> > > > > > > > rcmuir@gmail.com
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > --
>> >> > > > > > Robert Muir
>> >> > > > > > rcmuir@gmail.com
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > --
>> >> > > > Robert Muir
>> >> > > > rcmuir@gmail.com
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Robert Muir
>> >> > rcmuir@gmail.com
>> >> >
>> >
>> >
>> >
>> > --
>> > Robert Muir
>> > rcmuir@gmail.com
>> >
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Robert, I tried to use worddelimiterfilter from solr-nightly by putting it
in my working directory for this project, I set the parameters as you told
me. I must accept that its splitting words around those chars[like . @
etc]but alongwith that its messing with other non-english/unicode contents
and thats creating nuisance. I dont want worddelimiterfilter to fiddle
around with my non-english content.
This is what I'm doing,
/**
 * Analyzer for Indian language.
 */
public class IndicAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = new WhitespaceTokenizer(reader);
    ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
    ts = new LowerCaseFilter(ts);
    ts = new PorterStemFilter(ts);
    return ts;
  }
}

I've to use the deprecated API for setting 5 values, thats fine, but somehow
its messing with unicode content. How to get rid of that? Any thougts? It
seems setting those values is some proper way might fix the problem, I'm not
sure, though.

Thanks,
KK.


On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rc...@gmail.com> wrote:

> kk an easier solution to your first problem is to use
> worddelimiterfilterfactory if possible... you can get an instance of
> worddelimiter filter from that.
>
> thanks,
> robert
>
> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rc...@gmail.com> wrote:
> > kk as for your first issue, that WordDelimiterFilter is package
> > protected, one option is to make a copy of the code and change the
> > class declaration to public.
> > the other option is to put your entire analyzer in
> > 'org.apache.solr.analysis' package so that you can access it...
> >
> > for the 2nd issue, yes you need to supply some options to it. the
> > default options solr applies to type 'text' seemed to work well for me
> > with indic:
> >
> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
> >
> > On Fri, Jun 5, 2009 at 9:12 AM, KK <di...@gmail.com> wrote:
> >>
> >> Thanks Robert. There is one problem though, I'm able to plugin the word
> >> delimiter filter from solr-nightly jar file. When I tried to do
> something
> >> like,
> >>  TokenStream ts = new WhitespaceTokenizer(reader);
> >>   ts = new WordDelimiterFilter(ts);
> >>   ts = new PorterStemmerFilter(ts);
> >>   ...rest as in the last mail...
> >>
> >> It gave me an error saying that
> >>
> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
> >> org.apache.solr.analysis; cannot be accessed from outside package
> >> import org.apache.solr.analysis.WordDelimiterFilter;
> >>                               ^
> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
> >> symbol  : class WordDelimiterFilter
> >> location: class solrSearch.IndicAnalyzer
> >>    ts = new WordDelimiterFilter(ts);
> >>             ^
> >> 2 errors
> >>
> >> Then i tried to see the code for worddelimitefiter from solrnightly src
> and
> >> found that there are many deprecated constructors though they require a
> lot
> >> of parameters alongwith tokenstream. I went through the solr wiki for
> >> worddelimiterfilterfactory here,
> >>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> >> and say that there also its specified that we've to mention the
> parameters
> >> and both are different for indexing and querying.
> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in my
> >> custom analyzer, I've to use it anyway.
> >> In my code I've to make use of worddelimiterfilter and not
> >> worddelimiterfilterfactory, right? I don't know whats the use of the
> other
> >> one. Anyway can you guide me getting rid of the above error. And yes
> I'll
> >> change the order of applying the filters as you said.
> >>
> >> Thanks,
> >> KK.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rc...@gmail.com> wrote:
> >>
> >> > KK, you got the right idea.
> >> >
> >> > though I think you might want to change the order, move the stopfilter
> >> > before the porter stem filter... otherwise it might not work
> correctly.
> >> >
> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <di...@gmail.com>
> wrote:
> >> >
> >> > > Thanks Robert. This is exactly what I did and  its working but
> delimiter
> >> > is
> >> > > missing I'm going to add that from solr-nightly.jar
> >> > >
> >> > > /**
> >> > >  * Analyzer for Indian language.
> >> > >  */
> >> > > public class IndicAnalyzer extends Analyzer {
> >> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
> >> > >    ts = new PorterStemFilter(ts);
> >> > >    ts = new LowerCaseFilter(ts);
> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> > >    return ts;
> >> > >  }
> >> > > }
> >> > >
> >> > > Its able to do stemming/case-folding and supports search for both
> english
> >> > > and indic texts. let me try out the delimiter. Will update you on
> that.
> >> > >
> >> > > Thanks a lot.
> >> > > KK
> >> > >
> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com>
> wrote:
> >> > >
> >> > > > i think you are on the right track... once you build your
> analyzer, put
> >> > > it
> >> > > > in your classpath and play around with it in luke and see if it
> does
> >> > what
> >> > > > you want.
> >> > > >
> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <di...@gmail.com>
> wrote:
> >> > > >
> >> > > > > Hi Robert,
> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> >> > > > >
> >> > > > > public class ThaiAnalyzer extends Analyzer {
> >> > > > >  public TokenStream tokenStream(String fieldName, Reader reader)
> {
> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
> >> > > > >    ts = new StandardFilter(ts);
> >> > > > >    ts = new ThaiWordFilter(ts);
> >> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >> > > > >    return ts;
> >> > > > >  }
> >> > > > > }
> >> > > > >
> >> > > > > Now as you said, I've to use whitespacetokenizer
> >> > > > > withworddelimitefilter[solr
> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
> >> > something
> >> > > > like
> >> > > > > this,
> >> > > > > public class IndicAnalyzer extends Analyzer {
> >> > > > >  public TokenStream tokenStream(String fieldName, Reader reader)
> {
> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> >> > > > >   ts = new WordDelimiterFilter(ts);
> >> > > > >   ts = new LowerCaseFilter(ts);
> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   //
> >> > english
> >> > > > > stop filter, is this the default one?
> >> > > > >   ts = new PorterFilter(ts);
> >> > > > >   return ts;
> >> > > > >  }
> >> > > > > }
> >> > > > >
> >> > > > > Does this sound OK? I think it will do the job...let me try it
> out..
> >> > > > > I dont need custom filter as per my requirement, at least not
> for
> >> > these
> >> > > > > basic things I'm doing? I think so...
> >> > > > >
> >> > > > > Thanks,
> >> > > > > KK.
> >> > > > >
> >> > > > >
> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com>
> >> > wrote:
> >> > > > >
> >> > > > > > KK well you can always get some good examples from the lucene
> >> > contrib
> >> > > > > > codebase.
> >> > > > > > For example, look at the DutchAnalyzer, especially:
> >> > > > > >
> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
> >> > > > > >
> >> > > > > > See how it combines a specified tokenizer with various
> filters?
> >> > this
> >> > > is
> >> > > > > > what
> >> > > > > > you want to do, except of course you want to use different
> >> > tokenizer
> >> > > > and
> >> > > > > > filters.
> >> > > > > >
> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
> dioxide.software@gmail.com>
> >> > > wrote:
> >> > > > > >
> >> > > > > > > Thanks Muir.
> >> > > > > > > Thanks for letting me know that I dont need language
> identifiers.
> >> > > > > > >  I'll have a look and will try to write the analyzer. For my
> case
> >> > I
> >> > > > > think
> >> > > > > > > it
> >> > > > > > > wont be that difficult.
> >> > > > > > > BTW, can you point me to some sample codes/tutorials writing
> >> > custom
> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
> something
> >> > > > htere?
> >> > > > > > do
> >> > > > > > > let me know.
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > > KK.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
> rcmuir@gmail.com>
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > KK, for your case, you don't really need to go to the
> effort of
> >> > > > > > detecting
> >> > > > > > > > whether fragments are english or not.
> >> > > > > > > > Because the English stemmers in lucene will not modify
> your
> >> > Indic
> >> > > > > text,
> >> > > > > > > and
> >> > > > > > > > neither will the LowerCaseFilter.
> >> > > > > > > >
> >> > > > > > > > what you want to do is create a custom analyzer that works
> like
> >> > > > this
> >> > > > > > > >
> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr
> >> > nightly
> >> > > > > jar],
> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> >> > > > > > > >
> >> > > > > > > > Thanks,
> >> > > > > > > > Robert
> >> > > > > > > >
> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
> dioxide.software@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Thank you all.
> >> > > > > > > > > To be frank I was using Solr in the begining half a
> month
> >> > ago.
> >> > > > The
> >> > > > > > > > > problem[rather bug] with solr was creation of new index
> on
> >> > the
> >> > > > fly.
> >> > > > > > > > Though
> >> > > > > > > > > they have a restful method for teh same, but it was not
> >> > > working.
> >> > > > If
> >> > > > > I
> >> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I
> dont
> >> > know
> >> > > > his
> >> > > > > > > real
> >> > > > > > > > > name] was trying to help me. I tried many nightly builds
> and
> >> > > > > spending
> >> > > > > > a
> >> > > > > > > > > couple of days stuck at that made me think of lucene and
> I
> >> > > > switched
> >> > > > > > to
> >> > > > > > > > it.
> >> > > > > > > > > Now after working with lucene which gives you full
> control of
> >> > > > > > > everything
> >> > > > > > > > I
> >> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is
> >> > similar
> >> > > > to
> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back to
> the
> >> > > > point
> >> > > > > as
> >> > > > > > > Uwe
> >> > > > > > > > > mentioned that we can do the same thing in lucene as
> well,
> >> > what
> >> > > > is
> >> > > > > > > > > available
> >> > > > > > > > > in Solr, Solr is based on Lucene only, right?
> >> > > > > > > > > I request Uwe to give me some more ideas on using the
> >> > analyzers
> >> > > > > from
> >> > > > > > > solr
> >> > > > > > > > > that will do the job for me, handling a mix of both
> english
> >> > and
> >> > > > > > > > non-english
> >> > > > > > > > > content.
> >> > > > > > > > > Muir, can you give me a bit detail description of how to
> use
> >> > > the
> >> > > > > > > > > WordDelimiteFilter to do my job.
> >> > > > > > > > > On a side note, I was thingking of writing a simple
> analyzer
> >> > > that
> >> > > > > > will
> >> > > > > > > do
> >> > > > > > > > > the following,
> >> > > > > > > > > #. If the webpage fragment is non-english[for me its
> some
> >> > > indian
> >> > > > > > > > language]
> >> > > > > > > > > then index them as such, no stemming/ stop word removal
> to
> >> > > begin
> >> > > > > > with.
> >> > > > > > > As
> >> > > > > > > > I
> >> > > > > > > > > know its in UCN unicode something like
> >> > > > > \u0021\u0012\u34ae\u0031[just
> >> > > > > > a
> >> > > > > > > > > sample]
> >> > > > > > > > > # If the fragment is english then apply standard
> anlyzing
> >> > > process
> >> > > > > for
> >> > > > > > > > > english content. I've not thought of quering in the same
> way
> >> > as
> >> > > > of
> >> > > > > > now
> >> > > > > > > > i.e
> >> > > > > > > > > mix of non-english and engish words.
> >> > > > > > > > > Now to get all this,
> >> > > > > > > > >  #1. I need some sort of way which will let me know if
> the
> >> > > > content
> >> > > > > is
> >> > > > > > > > > english or not. If not english just add the tokens to
> the
> >> > > > document.
> >> > > > > > Do
> >> > > > > > > we
> >> > > > > > > > > really need language identifiers, as i dont have any
> other
> >> > > > content
> >> > > > > > that
> >> > > > > > > > > uses
> >> > > > > > > > > the same script as english other than those \u1234
> things for
> >> > > my
> >> > > > > > indian
> >> > > > > > > > > language content. Any smart hack/trick for the same?
> >> > > > > > > > >  #2. If the its english apply all normal process and add
> the
> >> > > > > stemmed
> >> > > > > > > > token
> >> > > > > > > > > to document.
> >> > > > > > > > > For all this I was thinking of iterating earch word of
> the
> >> > web
> >> > > > page
> >> > > > > > and
> >> > > > > > > > > apply the above procedure. And finallyadd  the newly
> created
> >> > > > > document
> >> > > > > > > to
> >> > > > > > > > > the
> >> > > > > > > > > index.
> >> > > > > > > > >
> >> > > > > > > > > I would like some one to guide me in this direction. I'm
> >> > pretty
> >> > > > > > people
> >> > > > > > > > must
> >> > > > > > > > > have done similar/same thing earlier, I request them to
> guide
> >> > > me/
> >> > > > > > point
> >> > > > > > > > me
> >> > > > > > > > > to some tutorials for the same.
> >> > > > > > > > > Else help me out writing a custom analyzer only if thats
> not
> >> > > > going
> >> > > > > to
> >> > > > > > > be
> >> > > > > > > > > too
> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know basics
> of
> >> > Java
> >> > > > > > coding.
> >> > > > > > > > > Thank you very much.
> >> > > > > > > > >
> >> > > > > > > > > --KK.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
> >> > rcmuir@gmail.com>
> >> > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > yes this is true. for starters KK, might be good to
> startup
> >> > > > solr
> >> > > > > > and
> >> > > > > > > > look
> >> > > > > > > > > > at
> >> > > > > > > > > >
> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> >> > > > > > > > > >
> >> > > > > > > > > > if you want to stick with lucene, the
> WordDelimiterFilter
> >> > is
> >> > > > the
> >> > > > > > > piece
> >> > > > > > > > > you
> >> > > > > > > > > > will want for your text, mainly for punctuation but
> also
> >> > for
> >> > > > > format
> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
> >> > > > > > > > > >
> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
> >> > > uwe@thetaphi.de
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > You can also re-use the solr analyzers, as far as I
> found
> >> > > > out.
> >> > > > > > > There
> >> > > > > > > > is
> >> > > > > > > > > > an
> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge them.
> >> > > > > > > > > > >
> >> > > > > > > > > > > -----
> >> > > > > > > > > > > Uwe Schindler
> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> > > > > > > > > > > http://www.thetaphi.de
> >> > > > > > > > > > > eMail: uwe@thetaphi.de
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > > -----Original Message-----
> >> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> >> > > > > > > > > > > > To: java-user@lucene.apache.org
> >> > > > > > > > > > > > Subject: Re: How to support stemming and case
> folding
> >> > for
> >> > > > > > english
> >> > > > > > > > > > content
> >> > > > > > > > > > > > mixed with non-english content?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > KK, ok, so you only really want to stem the
> english.
> >> > This
> >> > > > is
> >> > > > > > > good.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Is it possible for you to consider using solr?
> solr's
> >> > > > default
> >> > > > > > > > > analyzer
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > type 'text' will be good for your case. it will do
> the
> >> > > > > > following
> >> > > > > > > > > > > > 1. tokenize on whitespace
> >> > > > > > > > > > > > 2. handle both indian language and english
> punctuation
> >> > > > > > > > > > > > 3. lowercase the english.
> >> > > > > > > > > > > > 4. stem the english.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > try a nightly build,
> >> > > > > > > > > > >
> http://people.apache.org/builds/lucene/solr/nightly/
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> >> > > > > dioxide.software@gmail.com
> >> > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Muir, thanks for your response.
> >> > > > > > > > > > > > > I'm indexing indian language web pages which has
> got
> >> > > > > descent
> >> > > > > > > > amount
> >> > > > > > > > > > of
> >> > > > > > > > > > > > > english content mixed with therein. For the time
> >> > being
> >> > > > I'm
> >> > > > > > not
> >> > > > > > > > > going
> >> > > > > > > > > > to
> >> > > > > > > > > > > > use
> >> > > > > > > > > > > > > any stemmers as we don't have standard stemmers
> for
> >> > > > indian
> >> > > > > > > > > languages
> >> > > > > > > > > > .
> >> > > > > > > > > > > > So
> >> > > > > > > > > > > > > what I want to do is like this,
> >> > > > > > > > > > > > > Say I've a web page having hindi content with 5%
> >> > > english
> >> > > > > > > content.
> >> > > > > > > > > > Then
> >> > > > > > > > > > > > for
> >> > > > > > > > > > > > > hindi I want to use the basic white space
> analyzer as
> >> > > we
> >> > > > > dont
> >> > > > > > > > have
> >> > > > > > > > > > > > stemmers
> >> > > > > > > > > > > > > for this as I mentioned earlier and whereever
> english
> >> > > > > appears
> >> > > > > > I
> >> > > > > > > > > want
> >> > > > > > > > > > > > them
> >> > > > > > > > > > > > > to
> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard process
> used
> >> > for
> >> > > > > > english
> >> > > > > > > > > > > content].
> >> > > > > > > > > > > > As
> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for the
> full
> >> > > content
> >> > > > > > which
> >> > > > > > > > > > doesnot
> >> > > > > > > > > > > > > support case folding, stemming etc for teh
> content.
> >> > So
> >> > > if
> >> > > > > > there
> >> > > > > > > > is
> >> > > > > > > > > an
> >> > > > > > > > > > > > > english word say "Detection" indexed as such
> then
> >> > > > searching
> >> > > > > > for
> >> > > > > > > > > > > > detection
> >> > > > > > > > > > > > > or
> >> > > > > > > > > > > > > detect is not giving any results, which is the
> >> > expected
> >> > > > > > > behavior,
> >> > > > > > > > > but
> >> > > > > > > > > > I
> >> > > > > > > > > > > > > want
> >> > > > > > > > > > > > > this kind of queries to give results.
> >> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas on
> >> > doing
> >> > > > the
> >> > > > > > > same.
> >> > > > > > > > > And
> >> > > > > > > > > > > one
> >> > > > > > > > > > > > > more thing, I'm storing the full webpage content
> >> > under
> >> > > a
> >> > > > > > single
> >> > > > > > > > > > field,
> >> > > > > > > > > > > I
> >> > > > > > > > > > > > > hope this will not make any difference, right?
> >> > > > > > > > > > > > > It seems I've to use language identifiers, but
> do we
> >> > > > really
> >> > > > > > > need
> >> > > > > > > > > > that?
> >> > > > > > > > > > > > > Because we've only non-english content mixed
> with
> >> > > > > english[and
> >> > > > > > > not
> >> > > > > > > > > > > french
> >> > > > > > > > > > > > or
> >> > > > > > > > > > > > > russian etc].
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > What is the best way of approaching the problem?
> Any
> >> > > > > > thoughts!
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > KK.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> >> > > > > > rcmuir@gmail.com>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > KK, is all of your latin script text actually
> >> > > english?
> >> > > > Is
> >> > > > > > > there
> >> > > > > > > > > > stuff
> >> > > > > > > > > > > > > like
> >> > > > > > > > > > > > > > german or french mixed in?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > And for your non-english content (your
> examples
> >> > have
> >> > > > been
> >> > > > > > > > indian
> >> > > > > > > > > > > > writing
> >> > > > > > > > > > > > > > systems), is it generally true that if you had
> >> > > > > devanagari,
> >> > > > > > > you
> >> > > > > > > > > can
> >> > > > > > > > > > > > assume
> >> > > > > > > > > > > > > > its hindi? or is there stuff like marathi
> mixed in?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Reason I say this is to invoke the right
> stemmers,
> >> > > you
> >> > > > > > really
> >> > > > > > > > > need
> >> > > > > > > > > > > > some
> >> > > > > > > > > > > > > > language detection, but perhaps in your case
> you
> >> > can
> >> > > > > cheat
> >> > > > > > > and
> >> > > > > > > > > > detect
> >> > > > > > > > > > > > > this
> >> > > > > > > > > > > > > > based on scripts...
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > Robert
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> >> > > > > > > > dioxide.software@gmail.com>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi All,
> >> > > > > > > > > > > > > > > I'm indexing some non-english content. But
> the
> >> > page
> >> > > > > also
> >> > > > > > > > > contains
> >> > > > > > > > > > > > > english
> >> > > > > > > > > > > > > > > content. As of now I'm using
> WhitespaceAnalyzer
> >> > for
> >> > > > all
> >> > > > > > > > content
> >> > > > > > > > > > and
> >> > > > > > > > > > > > I'm
> >> > > > > > > > > > > > > > > storing the full webpage content under a
> single
> >> > > > filed.
> >> > > > > > Now
> >> > > > > > > we
> >> > > > > > > > > > > > require
> >> > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > support case folding and stemmming for the
> >> > english
> >> > > > > > content
> >> > > > > > > > > > > > intermingled
> >> > > > > > > > > > > > > > > with
> >> > > > > > > > > > > > > > > non-english content. I must metion that we
> dont
> >> > > have
> >> > > > > > > stemming
> >> > > > > > > > > and
> >> > > > > > > > > > > > case
> >> > > > > > > > > > > > > > > folding for these non-english content. I'm
> stuck
> >> > > with
> >> > > > > > this.
> >> > > > > > > > > Some
> >> > > > > > > > > > > one
> >> > > > > > > > > > > > do
> >> > > > > > > > > > > > > > let
> >> > > > > > > > > > > > > > > me know how to proceed for fixing this
> issue.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > > KK.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > Robert Muir
> >> > > > > > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > --
> >> > > > > > > > > > > > Robert Muir
> >> > > > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > >
> >> > >
> ---------------------------------------------------------------------
> >> > > > > > > > > > > To unsubscribe, e-mail:
> >> > > > > java-user-unsubscribe@lucene.apache.org
> >> > > > > > > > > > > For additional commands, e-mail:
> >> > > > > > java-user-help@lucene.apache.org
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > --
> >> > > > > > > > > > Robert Muir
> >> > > > > > > > > > rcmuir@gmail.com
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Robert Muir
> >> > > > > > > > rcmuir@gmail.com
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Robert Muir
> >> > > > > > rcmuir@gmail.com
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Robert Muir
> >> > > > rcmuir@gmail.com
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcmuir@gmail.com
> >> >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

kk an easier solution to your first problem is to use
worddelimiterfilterfactory if possible... you can get an instance of
worddelimiter filter from that.

thanks,
robert

On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rc...@gmail.com> wrote:
> kk as for your first issue, that WordDelimiterFilter is package
> protected, one option is to make a copy of the code and change the
> class declaration to public.
> the other option is to put your entire analyzer in
> 'org.apache.solr.analysis' package so that you can access it...
>
> for the 2nd issue, yes you need to supply some options to it. the
> default options solr applies to type 'text' seemed to work well for me
> with indic:
>
> {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
> generateWordParts=1, catenateAll=0, catenateNumbers=1}
>
> On Fri, Jun 5, 2009 at 9:12 AM, KK <di...@gmail.com> wrote:
>>
>> Thanks Robert. There is one problem though, I'm able to plugin the word
>> delimiter filter from solr-nightly jar file. When I tried to do something
>> like,
>>  TokenStream ts = new WhitespaceTokenizer(reader);
>>   ts = new WordDelimiterFilter(ts);
>>   ts = new PorterStemmerFilter(ts);
>>   ...rest as in the last mail...
>>
>> It gave me an error saying that
>>
>> org.apache.solr.analysis.WordDelimiterFilter is not public in
>> org.apache.solr.analysis; cannot be accessed from outside package
>> import org.apache.solr.analysis.WordDelimiterFilter;
>>                               ^
>> solrSearch/IndicAnalyzer.java:38: cannot find symbol
>> symbol  : class WordDelimiterFilter
>> location: class solrSearch.IndicAnalyzer
>>    ts = new WordDelimiterFilter(ts);
>>             ^
>> 2 errors
>>
>> Then i tried to see the code for worddelimitefiter from solrnightly src and
>> found that there are many deprecated constructors though they require a lot
>> of parameters alongwith tokenstream. I went through the solr wiki for
>> worddelimiterfilterfactory here,
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
>> and say that there also its specified that we've to mention the parameters
>> and both are different for indexing and querying.
>> I'm kind of stuck here, how do I make use of worddelimiterfilter in my
>> custom analyzer, I've to use it anyway.
>> In my code I've to make use of worddelimiterfilter and not
>> worddelimiterfilterfactory, right? I don't know whats the use of the other
>> one. Anyway can you guide me getting rid of the above error. And yes I'll
>> change the order of applying the filters as you said.
>>
>> Thanks,
>> KK.
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rc...@gmail.com> wrote:
>>
>> > KK, you got the right idea.
>> >
>> > though I think you might want to change the order, move the stopfilter
>> > before the porter stem filter... otherwise it might not work correctly.
>> >
>> > On Fri, Jun 5, 2009 at 8:05 AM, KK <di...@gmail.com> wrote:
>> >
>> > > Thanks Robert. This is exactly what I did and  its working but delimiter
>> > is
>> > > missing I'm going to add that from solr-nightly.jar
>> > >
>> > > /**
>> > >  * Analyzer for Indian language.
>> > >  */
>> > > public class IndicAnalyzer extends Analyzer {
>> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
>> > >     TokenStream ts = new WhitespaceTokenizer(reader);
>> > >    ts = new PorterStemFilter(ts);
>> > >    ts = new LowerCaseFilter(ts);
>> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> > >    return ts;
>> > >  }
>> > > }
>> > >
>> > > Its able to do stemming/case-folding and supports search for both english
>> > > and indic texts. let me try out the delimiter. Will update you on that.
>> > >
>> > > Thanks a lot.
>> > > KK
>> > >
>> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com> wrote:
>> > >
>> > > > i think you are on the right track... once you build your analyzer, put
>> > > it
>> > > > in your classpath and play around with it in luke and see if it does
>> > what
>> > > > you want.
>> > > >
>> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <di...@gmail.com> wrote:
>> > > >
>> > > > > Hi Robert,
>> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
>> > > > >
>> > > > > public class ThaiAnalyzer extends Analyzer {
>> > > > >  public TokenStream tokenStream(String fieldName, Reader reader) {
>> > > > >      TokenStream ts = new StandardTokenizer(reader);
>> > > > >    ts = new StandardFilter(ts);
>> > > > >    ts = new ThaiWordFilter(ts);
>> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> > > > >    return ts;
>> > > > >  }
>> > > > > }
>> > > > >
>> > > > > Now as you said, I've to use whitespacetokenizer
>> > > > > withworddelimitefilter[solr
>> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
>> > something
>> > > > like
>> > > > > this,
>> > > > > public class IndicAnalyzer extends Analyzer {
>> > > > >  public TokenStream tokenStream(String fieldName, Reader reader) {
>> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
>> > > > >   ts = new WordDelimiterFilter(ts);
>> > > > >   ts = new LowerCaseFilter(ts);
>> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   //
>> > english
>> > > > > stop filter, is this the default one?
>> > > > >   ts = new PorterFilter(ts);
>> > > > >   return ts;
>> > > > >  }
>> > > > > }
>> > > > >
>> > > > > Does this sound OK? I think it will do the job...let me try it out..
>> > > > > I dont need custom filter as per my requirement, at least not for
>> > these
>> > > > > basic things I'm doing? I think so...
>> > > > >
>> > > > > Thanks,
>> > > > > KK.
>> > > > >
>> > > > >
>> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com>
>> > wrote:
>> > > > >
>> > > > > > KK well you can always get some good examples from the lucene
>> > contrib
>> > > > > > codebase.
>> > > > > > For example, look at the DutchAnalyzer, especially:
>> > > > > >
>> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
>> > > > > >
>> > > > > > See how it combines a specified tokenizer with various filters?
>> > this
>> > > is
>> > > > > > what
>> > > > > > you want to do, except of course you want to use different
>> > tokenizer
>> > > > and
>> > > > > > filters.
>> > > > > >
>> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <di...@gmail.com>
>> > > wrote:
>> > > > > >
>> > > > > > > Thanks Muir.
>> > > > > > > Thanks for letting me know that I dont need language identifiers.
>> > > > > > >  I'll have a look and will try to write the analyzer. For my case
>> > I
>> > > > > think
>> > > > > > > it
>> > > > > > > wont be that difficult.
>> > > > > > > BTW, can you point me to some sample codes/tutorials writing
>> > custom
>> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is something
>> > > > htere?
>> > > > > > do
>> > > > > > > let me know.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > > KK.
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rc...@gmail.com>
>> > > > wrote:
>> > > > > > >
>> > > > > > > > KK, for your case, you don't really need to go to the effort of
>> > > > > > detecting
>> > > > > > > > whether fragments are english or not.
>> > > > > > > > Because the English stemmers in lucene will not modify your
>> > Indic
>> > > > > text,
>> > > > > > > and
>> > > > > > > > neither will the LowerCaseFilter.
>> > > > > > > >
>> > > > > > > > what you want to do is create a custom analyzer that works like
>> > > > this
>> > > > > > > >
>> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr
>> > nightly
>> > > > > jar],
>> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Robert
>> > > > > > > >
>> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <dioxide.software@gmail.com
>> > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Thank you all.
>> > > > > > > > > To be frank I was using Solr in the begining half a month
>> > ago.
>> > > > The
>> > > > > > > > > problem[rather bug] with solr was creation of new index on
>> > the
>> > > > fly.
>> > > > > > > > Though
>> > > > > > > > > they have a restful method for teh same, but it was not
>> > > working.
>> > > > If
>> > > > > I
>> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I dont
>> > know
>> > > > his
>> > > > > > > real
>> > > > > > > > > name] was trying to help me. I tried many nightly builds and
>> > > > > spending
>> > > > > > a
>> > > > > > > > > couple of days stuck at that made me think of lucene and I
>> > > > switched
>> > > > > > to
>> > > > > > > > it.
>> > > > > > > > > Now after working with lucene which gives you full control of
>> > > > > > > everything
>> > > > > > > > I
>> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is
>> > similar
>> > > > to
>> > > > > > > > > Window$:Linux, its my view only, though]. Coming back to the
>> > > > point
>> > > > > as
>> > > > > > > Uwe
>> > > > > > > > > mentioned that we can do the same thing in lucene as well,
>> > what
>> > > > is
>> > > > > > > > > available
>> > > > > > > > > in Solr, Solr is based on Lucene only, right?
>> > > > > > > > > I request Uwe to give me some more ideas on using the
>> > analyzers
>> > > > > from
>> > > > > > > solr
>> > > > > > > > > that will do the job for me, handling a mix of both english
>> > and
>> > > > > > > > non-english
>> > > > > > > > > content.
>> > > > > > > > > Muir, can you give me a bit detail description of how to use
>> > > the
>> > > > > > > > > WordDelimiteFilter to do my job.
>> > > > > > > > > On a side note, I was thingking of writing a simple analyzer
>> > > that
>> > > > > > will
>> > > > > > > do
>> > > > > > > > > the following,
>> > > > > > > > > #. If the webpage fragment is non-english[for me its some
>> > > indian
>> > > > > > > > language]
>> > > > > > > > > then index them as such, no stemming/ stop word removal to
>> > > begin
>> > > > > > with.
>> > > > > > > As
>> > > > > > > > I
>> > > > > > > > > know its in UCN unicode something like
>> > > > > \u0021\u0012\u34ae\u0031[just
>> > > > > > a
>> > > > > > > > > sample]
>> > > > > > > > > # If the fragment is english then apply standard anlyzing
>> > > process
>> > > > > for
>> > > > > > > > > english content. I've not thought of quering in the same way
>> > as
>> > > > of
>> > > > > > now
>> > > > > > > > i.e
>> > > > > > > > > mix of non-english and engish words.
>> > > > > > > > > Now to get all this,
>> > > > > > > > >  #1. I need some sort of way which will let me know if the
>> > > > content
>> > > > > is
>> > > > > > > > > english or not. If not english just add the tokens to the
>> > > > document.
>> > > > > > Do
>> > > > > > > we
>> > > > > > > > > really need language identifiers, as i dont have any other
>> > > > content
>> > > > > > that
>> > > > > > > > > uses
>> > > > > > > > > the same script as english other than those \u1234 things for
>> > > my
>> > > > > > indian
>> > > > > > > > > language content. Any smart hack/trick for the same?
>> > > > > > > > >  #2. If the its english apply all normal process and add the
>> > > > > stemmed
>> > > > > > > > token
>> > > > > > > > > to document.
>> > > > > > > > > For all this I was thinking of iterating earch word of the
>> > web
>> > > > page
>> > > > > > and
>> > > > > > > > > apply the above procedure. And finallyadd  the newly created
>> > > > > document
>> > > > > > > to
>> > > > > > > > > the
>> > > > > > > > > index.
>> > > > > > > > >
>> > > > > > > > > I would like some one to guide me in this direction. I'm
>> > pretty
>> > > > > > people
>> > > > > > > > must
>> > > > > > > > > have done similar/same thing earlier, I request them to guide
>> > > me/
>> > > > > > point
>> > > > > > > > me
>> > > > > > > > > to some tutorials for the same.
>> > > > > > > > > Else help me out writing a custom analyzer only if thats not
>> > > > going
>> > > > > to
>> > > > > > > be
>> > > > > > > > > too
>> > > > > > > > > complex. LOL, I'm a new user to lucene and know basics of
>> > Java
>> > > > > > coding.
>> > > > > > > > > Thank you very much.
>> > > > > > > > >
>> > > > > > > > > --KK.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
>> > rcmuir@gmail.com>
>> > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > yes this is true. for starters KK, might be good to startup
>> > > > solr
>> > > > > > and
>> > > > > > > > look
>> > > > > > > > > > at
>> > > > > > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
>> > > > > > > > > >
>> > > > > > > > > > if you want to stick with lucene, the WordDelimiterFilter
>> > is
>> > > > the
>> > > > > > > piece
>> > > > > > > > > you
>> > > > > > > > > > will want for your text, mainly for punctuation but also
>> > for
>> > > > > format
>> > > > > > > > > > characters such as ZWJ/ZWNJ.
>> > > > > > > > > >
>> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
>> > > uwe@thetaphi.de
>> > > > >
>> > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > You can also re-use the solr analyzers, as far as I found
>> > > > out.
>> > > > > > > There
>> > > > > > > > is
>> > > > > > > > > > an
>> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge them.
>> > > > > > > > > > >
>> > > > > > > > > > > -----
>> > > > > > > > > > > Uwe Schindler
>> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > > > > > > > > > http://www.thetaphi.de
>> > > > > > > > > > > eMail: uwe@thetaphi.de
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > > -----Original Message-----
>> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
>> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
>> > > > > > > > > > > > To: java-user@lucene.apache.org
>> > > > > > > > > > > > Subject: Re: How to support stemming and case folding
>> > for
>> > > > > > english
>> > > > > > > > > > content
>> > > > > > > > > > > > mixed with non-english content?
>> > > > > > > > > > > >
>> > > > > > > > > > > > KK, ok, so you only really want to stem the english.
>> > This
>> > > > is
>> > > > > > > good.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Is it possible for you to consider using solr? solr's
>> > > > default
>> > > > > > > > > analyzer
>> > > > > > > > > > > for
>> > > > > > > > > > > > type 'text' will be good for your case. it will do the
>> > > > > > following
>> > > > > > > > > > > > 1. tokenize on whitespace
>> > > > > > > > > > > > 2. handle both indian language and english punctuation
>> > > > > > > > > > > > 3. lowercase the english.
>> > > > > > > > > > > > 4. stem the english.
>> > > > > > > > > > > >
>> > > > > > > > > > > > try a nightly build,
>> > > > > > > > > > > http://people.apache.org/builds/lucene/solr/nightly/
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
>> > > > > dioxide.software@gmail.com
>> > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Muir, thanks for your response.
>> > > > > > > > > > > > > I'm indexing indian language web pages which has got
>> > > > > descent
>> > > > > > > > amount
>> > > > > > > > > > of
>> > > > > > > > > > > > > english content mixed with therein. For the time
>> > being
>> > > > I'm
>> > > > > > not
>> > > > > > > > > going
>> > > > > > > > > > to
>> > > > > > > > > > > > use
>> > > > > > > > > > > > > any stemmers as we don't have standard stemmers for
>> > > > indian
>> > > > > > > > > languages
>> > > > > > > > > > .
>> > > > > > > > > > > > So
>> > > > > > > > > > > > > what I want to do is like this,
>> > > > > > > > > > > > > Say I've a web page having hindi content with 5%
>> > > english
>> > > > > > > content.
>> > > > > > > > > > Then
>> > > > > > > > > > > > for
>> > > > > > > > > > > > > hindi I want to use the basic white space analyzer as
>> > > we
>> > > > > dont
>> > > > > > > > have
>> > > > > > > > > > > > stemmers
>> > > > > > > > > > > > > for this as I mentioned earlier and whereever english
>> > > > > appears
>> > > > > > I
>> > > > > > > > > want
>> > > > > > > > > > > > them
>> > > > > > > > > > > > > to
>> > > > > > > > > > > > > be stemmed tokenized etc[the standard process used
>> > for
>> > > > > > english
>> > > > > > > > > > > content].
>> > > > > > > > > > > > As
>> > > > > > > > > > > > > of now I'm using whitespace analyzer for the full
>> > > content
>> > > > > > which
>> > > > > > > > > > doesnot
>> > > > > > > > > > > > > support case folding, stemming etc for teh content.
>> > So
>> > > if
>> > > > > > there
>> > > > > > > > is
>> > > > > > > > > an
>> > > > > > > > > > > > > english word say "Detection" indexed as such then
>> > > > searching
>> > > > > > for
>> > > > > > > > > > > > detection
>> > > > > > > > > > > > > or
>> > > > > > > > > > > > > detect is not giving any results, which is the
>> > expected
>> > > > > > > behavior,
>> > > > > > > > > but
>> > > > > > > > > > I
>> > > > > > > > > > > > > want
>> > > > > > > > > > > > > this kind of queries to give results.
>> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas on
>> > doing
>> > > > the
>> > > > > > > same.
>> > > > > > > > > And
>> > > > > > > > > > > one
>> > > > > > > > > > > > > more thing, I'm storing the full webpage content
>> > under
>> > > a
>> > > > > > single
>> > > > > > > > > > field,
>> > > > > > > > > > > I
>> > > > > > > > > > > > > hope this will not make any difference, right?
>> > > > > > > > > > > > > It seems I've to use language identifiers, but do we
>> > > > really
>> > > > > > > need
>> > > > > > > > > > that?
>> > > > > > > > > > > > > Because we've only non-english content mixed with
>> > > > > english[and
>> > > > > > > not
>> > > > > > > > > > > french
>> > > > > > > > > > > > or
>> > > > > > > > > > > > > russian etc].
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > What is the best way of approaching the problem? Any
>> > > > > > thoughts!
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > KK.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
>> > > > > > rcmuir@gmail.com>
>> > > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > KK, is all of your latin script text actually
>> > > english?
>> > > > Is
>> > > > > > > there
>> > > > > > > > > > stuff
>> > > > > > > > > > > > > like
>> > > > > > > > > > > > > > german or french mixed in?
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > And for your non-english content (your examples
>> > have
>> > > > been
>> > > > > > > > indian
>> > > > > > > > > > > > writing
>> > > > > > > > > > > > > > systems), is it generally true that if you had
>> > > > > devanagari,
>> > > > > > > you
>> > > > > > > > > can
>> > > > > > > > > > > > assume
>> > > > > > > > > > > > > > its hindi? or is there stuff like marathi mixed in?
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Reason I say this is to invoke the right stemmers,
>> > > you
>> > > > > > really
>> > > > > > > > > need
>> > > > > > > > > > > > some
>> > > > > > > > > > > > > > language detection, but perhaps in your case you
>> > can
>> > > > > cheat
>> > > > > > > and
>> > > > > > > > > > detect
>> > > > > > > > > > > > > this
>> > > > > > > > > > > > > > based on scripts...
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > > Robert
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
>> > > > > > > > dioxide.software@gmail.com>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Hi All,
>> > > > > > > > > > > > > > > I'm indexing some non-english content. But the
>> > page
>> > > > > also
>> > > > > > > > > contains
>> > > > > > > > > > > > > english
>> > > > > > > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer
>> > for
>> > > > all
>> > > > > > > > content
>> > > > > > > > > > and
>> > > > > > > > > > > > I'm
>> > > > > > > > > > > > > > > storing the full webpage content under a single
>> > > > filed.
>> > > > > > Now
>> > > > > > > we
>> > > > > > > > > > > > require
>> > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > support case folding and stemmming for the
>> > english
>> > > > > > content
>> > > > > > > > > > > > intermingled
>> > > > > > > > > > > > > > > with
>> > > > > > > > > > > > > > > non-english content. I must metion that we dont
>> > > have
>> > > > > > > stemming
>> > > > > > > > > and
>> > > > > > > > > > > > case
>> > > > > > > > > > > > > > > folding for these non-english content. I'm stuck
>> > > with
>> > > > > > this.
>> > > > > > > > > Some
>> > > > > > > > > > > one
>> > > > > > > > > > > > do
>> > > > > > > > > > > > > > let
>> > > > > > > > > > > > > > > me know how to proceed for fixing this issue.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > > > KK.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > Robert Muir
>> > > > > > > > > > > > > > rcmuir@gmail.com
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > --
>> > > > > > > > > > > > Robert Muir
>> > > > > > > > > > > > rcmuir@gmail.com
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > >
>> > > ---------------------------------------------------------------------
>> > > > > > > > > > > To unsubscribe, e-mail:
>> > > > > java-user-unsubscribe@lucene.apache.org
>> > > > > > > > > > > For additional commands, e-mail:
>> > > > > > java-user-help@lucene.apache.org
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > --
>> > > > > > > > > > Robert Muir
>> > > > > > > > > > rcmuir@gmail.com
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Robert Muir
>> > > > > > > > rcmuir@gmail.com
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Robert Muir
>> > > > > > rcmuir@gmail.com
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Robert Muir
>> > > > rcmuir@gmail.com
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Robert Muir
>> > rcmuir@gmail.com
>> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

kk as for your first issue, that WordDelimiterFilter is package
protected, one option is to make a copy of the code and change the
class declaration to public.
the other option is to put your entire analyzer in
'org.apache.solr.analysis' package so that you can access it...

for the 2nd issue, yes you need to supply some options to it. the
default options solr applies to type 'text' seemed to work well for me
with indic:

{splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
generateWordParts=1, catenateAll=0, catenateNumbers=1}

On Fri, Jun 5, 2009 at 9:12 AM, KK <di...@gmail.com> wrote:
>
> Thanks Robert. There is one problem though, I'm able to plugin the word
> delimiter filter from solr-nightly jar file. When I tried to do something
> like,
>  TokenStream ts = new WhitespaceTokenizer(reader);
>   ts = new WordDelimiterFilter(ts);
>   ts = new PorterStemmerFilter(ts);
>   ...rest as in the last mail...
>
> It gave me an error saying that
>
> org.apache.solr.analysis.WordDelimiterFilter is not public in
> org.apache.solr.analysis; cannot be accessed from outside package
> import org.apache.solr.analysis.WordDelimiterFilter;
>                               ^
> solrSearch/IndicAnalyzer.java:38: cannot find symbol
> symbol  : class WordDelimiterFilter
> location: class solrSearch.IndicAnalyzer
>    ts = new WordDelimiterFilter(ts);
>             ^
> 2 errors
>
> Then i tried to see the code for worddelimitefiter from solrnightly src and
> found that there are many deprecated constructors though they require a lot
> of parameters alongwith tokenstream. I went through the solr wiki for
> worddelimiterfilterfactory here,
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> and say that there also its specified that we've to mention the parameters
> and both are different for indexing and querying.
> I'm kind of stuck here, how do I make use of worddelimiterfilter in my
> custom analyzer, I've to use it anyway.
> In my code I've to make use of worddelimiterfilter and not
> worddelimiterfilterfactory, right? I don't know whats the use of the other
> one. Anyway can you guide me getting rid of the above error. And yes I'll
> change the order of applying the filters as you said.
>
> Thanks,
> KK.
>
>
>
>
>
>
>
> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rc...@gmail.com> wrote:
>
> > KK, you got the right idea.
> >
> > though I think you might want to change the order, move the stopfilter
> > before the porter stem filter... otherwise it might not work correctly.
> >
> > On Fri, Jun 5, 2009 at 8:05 AM, KK <di...@gmail.com> wrote:
> >
> > > Thanks Robert. This is exactly what I did and  its working but delimiter
> > is
> > > missing I'm going to add that from solr-nightly.jar
> > >
> > > /**
> > >  * Analyzer for Indian language.
> > >  */
> > > public class IndicAnalyzer extends Analyzer {
> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > >     TokenStream ts = new WhitespaceTokenizer(reader);
> > >    ts = new PorterStemFilter(ts);
> > >    ts = new LowerCaseFilter(ts);
> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> > >    return ts;
> > >  }
> > > }
> > >
> > > Its able to do stemming/case-folding and supports search for both english
> > > and indic texts. let me try out the delimiter. Will update you on that.
> > >
> > > Thanks a lot.
> > > KK
> > >
> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com> wrote:
> > >
> > > > i think you are on the right track... once you build your analyzer, put
> > > it
> > > > in your classpath and play around with it in luke and see if it does
> > what
> > > > you want.
> > > >
> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <di...@gmail.com> wrote:
> > > >
> > > > > Hi Robert,
> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> > > > >
> > > > > public class ThaiAnalyzer extends Analyzer {
> > > > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > > > >      TokenStream ts = new StandardTokenizer(reader);
> > > > >    ts = new StandardFilter(ts);
> > > > >    ts = new ThaiWordFilter(ts);
> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> > > > >    return ts;
> > > > >  }
> > > > > }
> > > > >
> > > > > Now as you said, I've to use whitespacetokenizer
> > > > > withworddelimitefilter[solr
> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
> > something
> > > > like
> > > > > this,
> > > > > public class IndicAnalyzer extends Analyzer {
> > > > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> > > > >   ts = new WordDelimiterFilter(ts);
> > > > >   ts = new LowerCaseFilter(ts);
> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   //
> > english
> > > > > stop filter, is this the default one?
> > > > >   ts = new PorterFilter(ts);
> > > > >   return ts;
> > > > >  }
> > > > > }
> > > > >
> > > > > Does this sound OK? I think it will do the job...let me try it out..
> > > > > I dont need custom filter as per my requirement, at least not for
> > these
> > > > > basic things I'm doing? I think so...
> > > > >
> > > > > Thanks,
> > > > > KK.
> > > > >
> > > > >
> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com>
> > wrote:
> > > > >
> > > > > > KK well you can always get some good examples from the lucene
> > contrib
> > > > > > codebase.
> > > > > > For example, look at the DutchAnalyzer, especially:
> > > > > >
> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
> > > > > >
> > > > > > See how it combines a specified tokenizer with various filters?
> > this
> > > is
> > > > > > what
> > > > > > you want to do, except of course you want to use different
> > tokenizer
> > > > and
> > > > > > filters.
> > > > > >
> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <di...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Thanks Muir.
> > > > > > > Thanks for letting me know that I dont need language identifiers.
> > > > > > >  I'll have a look and will try to write the analyzer. For my case
> > I
> > > > > think
> > > > > > > it
> > > > > > > wont be that difficult.
> > > > > > > BTW, can you point me to some sample codes/tutorials writing
> > custom
> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is something
> > > > htere?
> > > > > > do
> > > > > > > let me know.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > KK.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rc...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > KK, for your case, you don't really need to go to the effort of
> > > > > > detecting
> > > > > > > > whether fragments are english or not.
> > > > > > > > Because the English stemmers in lucene will not modify your
> > Indic
> > > > > text,
> > > > > > > and
> > > > > > > > neither will the LowerCaseFilter.
> > > > > > > >
> > > > > > > > what you want to do is create a custom analyzer that works like
> > > > this
> > > > > > > >
> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr
> > nightly
> > > > > jar],
> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Robert
> > > > > > > >
> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <dioxide.software@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Thank you all.
> > > > > > > > > To be frank I was using Solr in the begining half a month
> > ago.
> > > > The
> > > > > > > > > problem[rather bug] with solr was creation of new index on
> > the
> > > > fly.
> > > > > > > > Though
> > > > > > > > > they have a restful method for teh same, but it was not
> > > working.
> > > > If
> > > > > I
> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I dont
> > know
> > > > his
> > > > > > > real
> > > > > > > > > name] was trying to help me. I tried many nightly builds and
> > > > > spending
> > > > > > a
> > > > > > > > > couple of days stuck at that made me think of lucene and I
> > > > switched
> > > > > > to
> > > > > > > > it.
> > > > > > > > > Now after working with lucene which gives you full control of
> > > > > > > everything
> > > > > > > > I
> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is
> > similar
> > > > to
> > > > > > > > > Window$:Linux, its my view only, though]. Coming back to the
> > > > point
> > > > > as
> > > > > > > Uwe
> > > > > > > > > mentioned that we can do the same thing in lucene as well,
> > what
> > > > is
> > > > > > > > > available
> > > > > > > > > in Solr, Solr is based on Lucene only, right?
> > > > > > > > > I request Uwe to give me some more ideas on using the
> > analyzers
> > > > > from
> > > > > > > solr
> > > > > > > > > that will do the job for me, handling a mix of both english
> > and
> > > > > > > > non-english
> > > > > > > > > content.
> > > > > > > > > Muir, can you give me a bit detail description of how to use
> > > the
> > > > > > > > > WordDelimiteFilter to do my job.
> > > > > > > > > On a side note, I was thingking of writing a simple analyzer
> > > that
> > > > > > will
> > > > > > > do
> > > > > > > > > the following,
> > > > > > > > > #. If the webpage fragment is non-english[for me its some
> > > indian
> > > > > > > > language]
> > > > > > > > > then index them as such, no stemming/ stop word removal to
> > > begin
> > > > > > with.
> > > > > > > As
> > > > > > > > I
> > > > > > > > > know its in UCN unicode something like
> > > > > \u0021\u0012\u34ae\u0031[just
> > > > > > a
> > > > > > > > > sample]
> > > > > > > > > # If the fragment is english then apply standard anlyzing
> > > process
> > > > > for
> > > > > > > > > english content. I've not thought of quering in the same way
> > as
> > > > of
> > > > > > now
> > > > > > > > i.e
> > > > > > > > > mix of non-english and engish words.
> > > > > > > > > Now to get all this,
> > > > > > > > >  #1. I need some sort of way which will let me know if the
> > > > content
> > > > > is
> > > > > > > > > english or not. If not english just add the tokens to the
> > > > document.
> > > > > > Do
> > > > > > > we
> > > > > > > > > really need language identifiers, as i dont have any other
> > > > content
> > > > > > that
> > > > > > > > > uses
> > > > > > > > > the same script as english other than those \u1234 things for
> > > my
> > > > > > indian
> > > > > > > > > language content. Any smart hack/trick for the same?
> > > > > > > > >  #2. If the its english apply all normal process and add the
> > > > > stemmed
> > > > > > > > token
> > > > > > > > > to document.
> > > > > > > > > For all this I was thinking of iterating earch word of the
> > web
> > > > page
> > > > > > and
> > > > > > > > > apply the above procedure. And finallyadd  the newly created
> > > > > document
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > index.
> > > > > > > > >
> > > > > > > > > I would like some one to guide me in this direction. I'm
> > pretty
> > > > > > people
> > > > > > > > must
> > > > > > > > > have done similar/same thing earlier, I request them to guide
> > > me/
> > > > > > point
> > > > > > > > me
> > > > > > > > > to some tutorials for the same.
> > > > > > > > > Else help me out writing a custom analyzer only if thats not
> > > > going
> > > > > to
> > > > > > > be
> > > > > > > > > too
> > > > > > > > > complex. LOL, I'm a new user to lucene and know basics of
> > Java
> > > > > > coding.
> > > > > > > > > Thank you very much.
> > > > > > > > >
> > > > > > > > > --KK.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
> > rcmuir@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > yes this is true. for starters KK, might be good to startup
> > > > solr
> > > > > > and
> > > > > > > > look
> > > > > > > > > > at
> > > > > > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > > > > > > >
> > > > > > > > > > if you want to stick with lucene, the WordDelimiterFilter
> > is
> > > > the
> > > > > > > piece
> > > > > > > > > you
> > > > > > > > > > will want for your text, mainly for punctuation but also
> > for
> > > > > format
> > > > > > > > > > characters such as ZWJ/ZWNJ.
> > > > > > > > > >
> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
> > > uwe@thetaphi.de
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > You can also re-use the solr analyzers, as far as I found
> > > > out.
> > > > > > > There
> > > > > > > > is
> > > > > > > > > > an
> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > > > > > > > >
> > > > > > > > > > > -----
> > > > > > > > > > > Uwe Schindler
> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > > > > http://www.thetaphi.de
> > > > > > > > > > > eMail: uwe@thetaphi.de
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > > > > Subject: Re: How to support stemming and case folding
> > for
> > > > > > english
> > > > > > > > > > content
> > > > > > > > > > > > mixed with non-english content?
> > > > > > > > > > > >
> > > > > > > > > > > > KK, ok, so you only really want to stem the english.
> > This
> > > > is
> > > > > > > good.
> > > > > > > > > > > >
> > > > > > > > > > > > Is it possible for you to consider using solr? solr's
> > > > default
> > > > > > > > > analyzer
> > > > > > > > > > > for
> > > > > > > > > > > > type 'text' will be good for your case. it will do the
> > > > > > following
> > > > > > > > > > > > 1. tokenize on whitespace
> > > > > > > > > > > > 2. handle both indian language and english punctuation
> > > > > > > > > > > > 3. lowercase the english.
> > > > > > > > > > > > 4. stem the english.
> > > > > > > > > > > >
> > > > > > > > > > > > try a nightly build,
> > > > > > > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> > > > > dioxide.software@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Muir, thanks for your response.
> > > > > > > > > > > > > I'm indexing indian language web pages which has got
> > > > > descent
> > > > > > > > amount
> > > > > > > > > > of
> > > > > > > > > > > > > english content mixed with therein. For the time
> > being
> > > > I'm
> > > > > > not
> > > > > > > > > going
> > > > > > > > > > to
> > > > > > > > > > > > use
> > > > > > > > > > > > > any stemmers as we don't have standard stemmers for
> > > > indian
> > > > > > > > > languages
> > > > > > > > > > .
> > > > > > > > > > > > So
> > > > > > > > > > > > > what I want to do is like this,
> > > > > > > > > > > > > Say I've a web page having hindi content with 5%
> > > english
> > > > > > > content.
> > > > > > > > > > Then
> > > > > > > > > > > > for
> > > > > > > > > > > > > hindi I want to use the basic white space analyzer as
> > > we
> > > > > dont
> > > > > > > > have
> > > > > > > > > > > > stemmers
> > > > > > > > > > > > > for this as I mentioned earlier and whereever english
> > > > > appears
> > > > > > I
> > > > > > > > > want
> > > > > > > > > > > > them
> > > > > > > > > > > > > to
> > > > > > > > > > > > > be stemmed tokenized etc[the standard process used
> > for
> > > > > > english
> > > > > > > > > > > content].
> > > > > > > > > > > > As
> > > > > > > > > > > > > of now I'm using whitespace analyzer for the full
> > > content
> > > > > > which
> > > > > > > > > > doesnot
> > > > > > > > > > > > > support case folding, stemming etc for teh content.
> > So
> > > if
> > > > > > there
> > > > > > > > is
> > > > > > > > > an
> > > > > > > > > > > > > english word say "Detection" indexed as such then
> > > > searching
> > > > > > for
> > > > > > > > > > > > detection
> > > > > > > > > > > > > or
> > > > > > > > > > > > > detect is not giving any results, which is the
> > expected
> > > > > > > behavior,
> > > > > > > > > but
> > > > > > > > > > I
> > > > > > > > > > > > > want
> > > > > > > > > > > > > this kind of queries to give results.
> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas on
> > doing
> > > > the
> > > > > > > same.
> > > > > > > > > And
> > > > > > > > > > > one
> > > > > > > > > > > > > more thing, I'm storing the full webpage content
> > under
> > > a
> > > > > > single
> > > > > > > > > > field,
> > > > > > > > > > > I
> > > > > > > > > > > > > hope this will not make any difference, right?
> > > > > > > > > > > > > It seems I've to use language identifiers, but do we
> > > > really
> > > > > > > need
> > > > > > > > > > that?
> > > > > > > > > > > > > Because we've only non-english content mixed with
> > > > > english[and
> > > > > > > not
> > > > > > > > > > > french
> > > > > > > > > > > > or
> > > > > > > > > > > > > russian etc].
> > > > > > > > > > > > >
> > > > > > > > > > > > > What is the best way of approaching the problem? Any
> > > > > > thoughts!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > KK.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> > > > > > rcmuir@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > KK, is all of your latin script text actually
> > > english?
> > > > Is
> > > > > > > there
> > > > > > > > > > stuff
> > > > > > > > > > > > > like
> > > > > > > > > > > > > > german or french mixed in?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > And for your non-english content (your examples
> > have
> > > > been
> > > > > > > > indian
> > > > > > > > > > > > writing
> > > > > > > > > > > > > > systems), is it generally true that if you had
> > > > > devanagari,
> > > > > > > you
> > > > > > > > > can
> > > > > > > > > > > > assume
> > > > > > > > > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Reason I say this is to invoke the right stemmers,
> > > you
> > > > > > really
> > > > > > > > > need
> > > > > > > > > > > > some
> > > > > > > > > > > > > > language detection, but perhaps in your case you
> > can
> > > > > cheat
> > > > > > > and
> > > > > > > > > > detect
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > based on scripts...
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Robert
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > > > > > > > dioxide.software@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > > I'm indexing some non-english content. But the
> > page
> > > > > also
> > > > > > > > > contains
> > > > > > > > > > > > > english
> > > > > > > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer
> > for
> > > > all
> > > > > > > > content
> > > > > > > > > > and
> > > > > > > > > > > > I'm
> > > > > > > > > > > > > > > storing the full webpage content under a single
> > > > filed.
> > > > > > Now
> > > > > > > we
> > > > > > > > > > > > require
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > support case folding and stemmming for the
> > english
> > > > > > content
> > > > > > > > > > > > intermingled
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > non-english content. I must metion that we dont
> > > have
> > > > > > > stemming
> > > > > > > > > and
> > > > > > > > > > > > case
> > > > > > > > > > > > > > > folding for these non-english content. I'm stuck
> > > with
> > > > > > this.
> > > > > > > > > Some
> > > > > > > > > > > one
> > > > > > > > > > > > do
> > > > > > > > > > > > > > let
> > > > > > > > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > KK.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Robert Muir
> > > > > > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Robert Muir
> > > > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail:
> > > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > For additional commands, e-mail:
> > > > > > java-user-help@lucene.apache.org
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Robert Muir
> > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Robert Muir
> > > > > > > > rcmuir@gmail.com
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcmuir@gmail.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >



--
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Thanks Robert. There is one problem though, I'm able to plugin the word
delimiter filter from solr-nightly jar file. When I tried to do something
like,
 TokenStream ts = new WhitespaceTokenizer(reader);
   ts = new WordDelimiterFilter(ts);
   ts = new PorterStemmerFilter(ts);
   ...rest as in the last mail...

It gave me an error saying that

org.apache.solr.analysis.WordDelimiterFilter is not public in
org.apache.solr.analysis; cannot be accessed from outside package
import org.apache.solr.analysis.WordDelimiterFilter;
                               ^
solrSearch/IndicAnalyzer.java:38: cannot find symbol
symbol  : class WordDelimiterFilter
location: class solrSearch.IndicAnalyzer
    ts = new WordDelimiterFilter(ts);
             ^
2 errors

Then i tried to see the code for worddelimitefiter from solrnightly src and
found that there are many deprecated constructors though they require a lot
of parameters alongwith tokenstream. I went through the solr wiki for
worddelimiterfilterfactory here,
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
and say that there also its specified that we've to mention the parameters
and both are different for indexing and querying.
I'm kind of stuck here, how do I make use of worddelimiterfilter in my
custom analyzer, I've to use it anyway.
In my code I've to make use of worddelimiterfilter and not
worddelimiterfilterfactory, right? I don't know whats the use of the other
one. Anyway can you guide me getting rid of the above error. And yes I'll
change the order of applying the filters as you said.

Thanks,
KK.







On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rc...@gmail.com> wrote:

> KK, you got the right idea.
>
> though I think you might want to change the order, move the stopfilter
> before the porter stem filter... otherwise it might not work correctly.
>
> On Fri, Jun 5, 2009 at 8:05 AM, KK <di...@gmail.com> wrote:
>
> > Thanks Robert. This is exactly what I did and  its working but delimiter
> is
> > missing I'm going to add that from solr-nightly.jar
> >
> > /**
> >  * Analyzer for Indian language.
> >  */
> > public class IndicAnalyzer extends Analyzer {
> >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >     TokenStream ts = new WhitespaceTokenizer(reader);
> >    ts = new PorterStemFilter(ts);
> >    ts = new LowerCaseFilter(ts);
> >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >    return ts;
> >  }
> > }
> >
> > Its able to do stemming/case-folding and supports search for both english
> > and indic texts. let me try out the delimiter. Will update you on that.
> >
> > Thanks a lot.
> > KK
> >
> > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> > > i think you are on the right track... once you build your analyzer, put
> > it
> > > in your classpath and play around with it in luke and see if it does
> what
> > > you want.
> > >
> > > On Fri, Jun 5, 2009 at 3:19 AM, KK <di...@gmail.com> wrote:
> > >
> > > > Hi Robert,
> > > > This is what I copied from ThaiAnalyzer @ lucene contrib
> > > >
> > > > public class ThaiAnalyzer extends Analyzer {
> > > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > > >      TokenStream ts = new StandardTokenizer(reader);
> > > >    ts = new StandardFilter(ts);
> > > >    ts = new ThaiWordFilter(ts);
> > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> > > >    return ts;
> > > >  }
> > > > }
> > > >
> > > > Now as you said, I've to use whitespacetokenizer
> > > > withworddelimitefilter[solr
> > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
> something
> > > like
> > > > this,
> > > > public class IndicAnalyzer extends Analyzer {
> > > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> > > >   ts = new WordDelimiterFilter(ts);
> > > >   ts = new LowerCaseFilter(ts);
> > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   //
> english
> > > > stop filter, is this the default one?
> > > >   ts = new PorterFilter(ts);
> > > >   return ts;
> > > >  }
> > > > }
> > > >
> > > > Does this sound OK? I think it will do the job...let me try it out..
> > > > I dont need custom filter as per my requirement, at least not for
> these
> > > > basic things I'm doing? I think so...
> > > >
> > > > Thanks,
> > > > KK.
> > > >
> > > >
> > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com>
> wrote:
> > > >
> > > > > KK well you can always get some good examples from the lucene
> contrib
> > > > > codebase.
> > > > > For example, look at the DutchAnalyzer, especially:
> > > > >
> > > > > TokenStream tokenStream(String fieldName, Reader reader)
> > > > >
> > > > > See how it combines a specified tokenizer with various filters?
> this
> > is
> > > > > what
> > > > > you want to do, except of course you want to use different
> tokenizer
> > > and
> > > > > filters.
> > > > >
> > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <di...@gmail.com>
> > wrote:
> > > > >
> > > > > > Thanks Muir.
> > > > > > Thanks for letting me know that I dont need language identifiers.
> > > > > >  I'll have a look and will try to write the analyzer. For my case
> I
> > > > think
> > > > > > it
> > > > > > wont be that difficult.
> > > > > > BTW, can you point me to some sample codes/tutorials writing
> custom
> > > > > > analyzers. I could not find something in LIA2ndEdn. Is something
> > > htere?
> > > > > do
> > > > > > let me know.
> > > > > >
> > > > > > Thanks,
> > > > > > KK.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rc...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > KK, for your case, you don't really need to go to the effort of
> > > > > detecting
> > > > > > > whether fragments are english or not.
> > > > > > > Because the English stemmers in lucene will not modify your
> Indic
> > > > text,
> > > > > > and
> > > > > > > neither will the LowerCaseFilter.
> > > > > > >
> > > > > > > what you want to do is create a custom analyzer that works like
> > > this
> > > > > > >
> > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr
> nightly
> > > > jar],
> > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Robert
> > > > > > >
> > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <dioxide.software@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Thank you all.
> > > > > > > > To be frank I was using Solr in the begining half a month
> ago.
> > > The
> > > > > > > > problem[rather bug] with solr was creation of new index on
> the
> > > fly.
> > > > > > > Though
> > > > > > > > they have a restful method for teh same, but it was not
> > working.
> > > If
> > > > I
> > > > > > > > remember properly one of Solr commiter "Noble Paul"[I dont
> know
> > > his
> > > > > > real
> > > > > > > > name] was trying to help me. I tried many nightly builds and
> > > > spending
> > > > > a
> > > > > > > > couple of days stuck at that made me think of lucene and I
> > > switched
> > > > > to
> > > > > > > it.
> > > > > > > > Now after working with lucene which gives you full control of
> > > > > > everything
> > > > > > > I
> > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is
> similar
> > > to
> > > > > > > > Window$:Linux, its my view only, though]. Coming back to the
> > > point
> > > > as
> > > > > > Uwe
> > > > > > > > mentioned that we can do the same thing in lucene as well,
> what
> > > is
> > > > > > > > available
> > > > > > > > in Solr, Solr is based on Lucene only, right?
> > > > > > > > I request Uwe to give me some more ideas on using the
> analyzers
> > > > from
> > > > > > solr
> > > > > > > > that will do the job for me, handling a mix of both english
> and
> > > > > > > non-english
> > > > > > > > content.
> > > > > > > > Muir, can you give me a bit detail description of how to use
> > the
> > > > > > > > WordDelimiteFilter to do my job.
> > > > > > > > On a side note, I was thingking of writing a simple analyzer
> > that
> > > > > will
> > > > > > do
> > > > > > > > the following,
> > > > > > > > #. If the webpage fragment is non-english[for me its some
> > indian
> > > > > > > language]
> > > > > > > > then index them as such, no stemming/ stop word removal to
> > begin
> > > > > with.
> > > > > > As
> > > > > > > I
> > > > > > > > know its in UCN unicode something like
> > > > \u0021\u0012\u34ae\u0031[just
> > > > > a
> > > > > > > > sample]
> > > > > > > > # If the fragment is english then apply standard anlyzing
> > process
> > > > for
> > > > > > > > english content. I've not thought of quering in the same way
> as
> > > of
> > > > > now
> > > > > > > i.e
> > > > > > > > mix of non-english and engish words.
> > > > > > > > Now to get all this,
> > > > > > > >  #1. I need some sort of way which will let me know if the
> > > content
> > > > is
> > > > > > > > english or not. If not english just add the tokens to the
> > > document.
> > > > > Do
> > > > > > we
> > > > > > > > really need language identifiers, as i dont have any other
> > > content
> > > > > that
> > > > > > > > uses
> > > > > > > > the same script as english other than those \u1234 things for
> > my
> > > > > indian
> > > > > > > > language content. Any smart hack/trick for the same?
> > > > > > > >  #2. If the its english apply all normal process and add the
> > > > stemmed
> > > > > > > token
> > > > > > > > to document.
> > > > > > > > For all this I was thinking of iterating earch word of the
> web
> > > page
> > > > > and
> > > > > > > > apply the above procedure. And finallyadd  the newly created
> > > > document
> > > > > > to
> > > > > > > > the
> > > > > > > > index.
> > > > > > > >
> > > > > > > > I would like some one to guide me in this direction. I'm
> pretty
> > > > > people
> > > > > > > must
> > > > > > > > have done similar/same thing earlier, I request them to guide
> > me/
> > > > > point
> > > > > > > me
> > > > > > > > to some tutorials for the same.
> > > > > > > > Else help me out writing a custom analyzer only if thats not
> > > going
> > > > to
> > > > > > be
> > > > > > > > too
> > > > > > > > complex. LOL, I'm a new user to lucene and know basics of
> Java
> > > > > coding.
> > > > > > > > Thank you very much.
> > > > > > > >
> > > > > > > > --KK.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
> rcmuir@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > yes this is true. for starters KK, might be good to startup
> > > solr
> > > > > and
> > > > > > > look
> > > > > > > > > at
> > > > > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > > > > > >
> > > > > > > > > if you want to stick with lucene, the WordDelimiterFilter
> is
> > > the
> > > > > > piece
> > > > > > > > you
> > > > > > > > > will want for your text, mainly for punctuation but also
> for
> > > > format
> > > > > > > > > characters such as ZWJ/ZWNJ.
> > > > > > > > >
> > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
> > uwe@thetaphi.de
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > You can also re-use the solr analyzers, as far as I found
> > > out.
> > > > > > There
> > > > > > > is
> > > > > > > > > an
> > > > > > > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > > > > > > >
> > > > > > > > > > -----
> > > > > > > > > > Uwe Schindler
> > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > > > http://www.thetaphi.de
> > > > > > > > > > eMail: uwe@thetaphi.de
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > > > Subject: Re: How to support stemming and case folding
> for
> > > > > english
> > > > > > > > > content
> > > > > > > > > > > mixed with non-english content?
> > > > > > > > > > >
> > > > > > > > > > > KK, ok, so you only really want to stem the english.
> This
> > > is
> > > > > > good.
> > > > > > > > > > >
> > > > > > > > > > > Is it possible for you to consider using solr? solr's
> > > default
> > > > > > > > analyzer
> > > > > > > > > > for
> > > > > > > > > > > type 'text' will be good for your case. it will do the
> > > > > following
> > > > > > > > > > > 1. tokenize on whitespace
> > > > > > > > > > > 2. handle both indian language and english punctuation
> > > > > > > > > > > 3. lowercase the english.
> > > > > > > > > > > 4. stem the english.
> > > > > > > > > > >
> > > > > > > > > > > try a nightly build,
> > > > > > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> > > > dioxide.software@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Muir, thanks for your response.
> > > > > > > > > > > > I'm indexing indian language web pages which has got
> > > > descent
> > > > > > > amount
> > > > > > > > > of
> > > > > > > > > > > > english content mixed with therein. For the time
> being
> > > I'm
> > > > > not
> > > > > > > > going
> > > > > > > > > to
> > > > > > > > > > > use
> > > > > > > > > > > > any stemmers as we don't have standard stemmers for
> > > indian
> > > > > > > > languages
> > > > > > > > > .
> > > > > > > > > > > So
> > > > > > > > > > > > what I want to do is like this,
> > > > > > > > > > > > Say I've a web page having hindi content with 5%
> > english
> > > > > > content.
> > > > > > > > > Then
> > > > > > > > > > > for
> > > > > > > > > > > > hindi I want to use the basic white space analyzer as
> > we
> > > > dont
> > > > > > > have
> > > > > > > > > > > stemmers
> > > > > > > > > > > > for this as I mentioned earlier and whereever english
> > > > appears
> > > > > I
> > > > > > > > want
> > > > > > > > > > > them
> > > > > > > > > > > > to
> > > > > > > > > > > > be stemmed tokenized etc[the standard process used
> for
> > > > > english
> > > > > > > > > > content].
> > > > > > > > > > > As
> > > > > > > > > > > > of now I'm using whitespace analyzer for the full
> > content
> > > > > which
> > > > > > > > > doesnot
> > > > > > > > > > > > support case folding, stemming etc for teh content.
> So
> > if
> > > > > there
> > > > > > > is
> > > > > > > > an
> > > > > > > > > > > > english word say "Detection" indexed as such then
> > > searching
> > > > > for
> > > > > > > > > > > detection
> > > > > > > > > > > > or
> > > > > > > > > > > > detect is not giving any results, which is the
> expected
> > > > > > behavior,
> > > > > > > > but
> > > > > > > > > I
> > > > > > > > > > > > want
> > > > > > > > > > > > this kind of queries to give results.
> > > > > > > > > > > > I hope I made it clear. Let me know any ideas on
> doing
> > > the
> > > > > > same.
> > > > > > > > And
> > > > > > > > > > one
> > > > > > > > > > > > more thing, I'm storing the full webpage content
> under
> > a
> > > > > single
> > > > > > > > > field,
> > > > > > > > > > I
> > > > > > > > > > > > hope this will not make any difference, right?
> > > > > > > > > > > > It seems I've to use language identifiers, but do we
> > > really
> > > > > > need
> > > > > > > > > that?
> > > > > > > > > > > > Because we've only non-english content mixed with
> > > > english[and
> > > > > > not
> > > > > > > > > > french
> > > > > > > > > > > or
> > > > > > > > > > > > russian etc].
> > > > > > > > > > > >
> > > > > > > > > > > > What is the best way of approaching the problem? Any
> > > > > thoughts!
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > KK.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> > > > > rcmuir@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > KK, is all of your latin script text actually
> > english?
> > > Is
> > > > > > there
> > > > > > > > > stuff
> > > > > > > > > > > > like
> > > > > > > > > > > > > german or french mixed in?
> > > > > > > > > > > > >
> > > > > > > > > > > > > And for your non-english content (your examples
> have
> > > been
> > > > > > > indian
> > > > > > > > > > > writing
> > > > > > > > > > > > > systems), is it generally true that if you had
> > > > devanagari,
> > > > > > you
> > > > > > > > can
> > > > > > > > > > > assume
> > > > > > > > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Reason I say this is to invoke the right stemmers,
> > you
> > > > > really
> > > > > > > > need
> > > > > > > > > > > some
> > > > > > > > > > > > > language detection, but perhaps in your case you
> can
> > > > cheat
> > > > > > and
> > > > > > > > > detect
> > > > > > > > > > > > this
> > > > > > > > > > > > > based on scripts...
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Robert
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > > > > > > dioxide.software@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > I'm indexing some non-english content. But the
> page
> > > > also
> > > > > > > > contains
> > > > > > > > > > > > english
> > > > > > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer
> for
> > > all
> > > > > > > content
> > > > > > > > > and
> > > > > > > > > > > I'm
> > > > > > > > > > > > > > storing the full webpage content under a single
> > > filed.
> > > > > Now
> > > > > > we
> > > > > > > > > > > require
> > > > > > > > > > > > to
> > > > > > > > > > > > > > support case folding and stemmming for the
> english
> > > > > content
> > > > > > > > > > > intermingled
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > non-english content. I must metion that we dont
> > have
> > > > > > stemming
> > > > > > > > and
> > > > > > > > > > > case
> > > > > > > > > > > > > > folding for these non-english content. I'm stuck
> > with
> > > > > this.
> > > > > > > > Some
> > > > > > > > > > one
> > > > > > > > > > > do
> > > > > > > > > > > > > let
> > > > > > > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > KK.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Robert Muir
> > > > > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Robert Muir
> > > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > For additional commands, e-mail:
> > > > > java-user-help@lucene.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Robert Muir
> > > > > > > > > rcmuir@gmail.com
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Robert Muir
> > > > > > > rcmuir@gmail.com
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

KK, you got the right idea.

though I think you might want to change the order, move the stopfilter
before the porter stem filter... otherwise it might not work correctly.

On Fri, Jun 5, 2009 at 8:05 AM, KK <di...@gmail.com> wrote:

> Thanks Robert. This is exactly what I did and  its working but delimiter is
> missing I'm going to add that from solr-nightly.jar
>
> /**
>  * Analyzer for Indian language.
>  */
> public class IndicAnalyzer extends Analyzer {
>  public TokenStream tokenStream(String fieldName, Reader reader) {
>     TokenStream ts = new WhitespaceTokenizer(reader);
>    ts = new PorterStemFilter(ts);
>    ts = new LowerCaseFilter(ts);
>    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>    return ts;
>  }
> }
>
> Its able to do stemming/case-folding and supports search for both english
> and indic texts. let me try out the delimiter. Will update you on that.
>
> Thanks a lot.
> KK
>
> On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com> wrote:
>
> > i think you are on the right track... once you build your analyzer, put
> it
> > in your classpath and play around with it in luke and see if it does what
> > you want.
> >
> > On Fri, Jun 5, 2009 at 3:19 AM, KK <di...@gmail.com> wrote:
> >
> > > Hi Robert,
> > > This is what I copied from ThaiAnalyzer @ lucene contrib
> > >
> > > public class ThaiAnalyzer extends Analyzer {
> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > >      TokenStream ts = new StandardTokenizer(reader);
> > >    ts = new StandardFilter(ts);
> > >    ts = new ThaiWordFilter(ts);
> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> > >    return ts;
> > >  }
> > > }
> > >
> > > Now as you said, I've to use whitespacetokenizer
> > > withworddelimitefilter[solr
> > > nightly.jar] stop wordremoval, porter stemmer etc , so it is something
> > like
> > > this,
> > > public class IndicAnalyzer extends Analyzer {
> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> > >   ts = new WordDelimiterFilter(ts);
> > >   ts = new LowerCaseFilter(ts);
> > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   // english
> > > stop filter, is this the default one?
> > >   ts = new PorterFilter(ts);
> > >   return ts;
> > >  }
> > > }
> > >
> > > Does this sound OK? I think it will do the job...let me try it out..
> > > I dont need custom filter as per my requirement, at least not for these
> > > basic things I'm doing? I think so...
> > >
> > > Thanks,
> > > KK.
> > >
> > >
> > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com> wrote:
> > >
> > > > KK well you can always get some good examples from the lucene contrib
> > > > codebase.
> > > > For example, look at the DutchAnalyzer, especially:
> > > >
> > > > TokenStream tokenStream(String fieldName, Reader reader)
> > > >
> > > > See how it combines a specified tokenizer with various filters? this
> is
> > > > what
> > > > you want to do, except of course you want to use different tokenizer
> > and
> > > > filters.
> > > >
> > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <di...@gmail.com>
> wrote:
> > > >
> > > > > Thanks Muir.
> > > > > Thanks for letting me know that I dont need language identifiers.
> > > > >  I'll have a look and will try to write the analyzer. For my case I
> > > think
> > > > > it
> > > > > wont be that difficult.
> > > > > BTW, can you point me to some sample codes/tutorials writing custom
> > > > > analyzers. I could not find something in LIA2ndEdn. Is something
> > htere?
> > > > do
> > > > > let me know.
> > > > >
> > > > > Thanks,
> > > > > KK.
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rc...@gmail.com>
> > wrote:
> > > > >
> > > > > > KK, for your case, you don't really need to go to the effort of
> > > > detecting
> > > > > > whether fragments are english or not.
> > > > > > Because the English stemmers in lucene will not modify your Indic
> > > text,
> > > > > and
> > > > > > neither will the LowerCaseFilter.
> > > > > >
> > > > > > what you want to do is create a custom analyzer that works like
> > this
> > > > > >
> > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly
> > > jar],
> > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > > > > >
> > > > > > Thanks,
> > > > > > Robert
> > > > > >
> > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <di...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Thank you all.
> > > > > > > To be frank I was using Solr in the begining half a month ago.
> > The
> > > > > > > problem[rather bug] with solr was creation of new index on the
> > fly.
> > > > > > Though
> > > > > > > they have a restful method for teh same, but it was not
> working.
> > If
> > > I
> > > > > > > remember properly one of Solr commiter "Noble Paul"[I dont know
> > his
> > > > > real
> > > > > > > name] was trying to help me. I tried many nightly builds and
> > > spending
> > > > a
> > > > > > > couple of days stuck at that made me think of lucene and I
> > switched
> > > > to
> > > > > > it.
> > > > > > > Now after working with lucene which gives you full control of
> > > > > everything
> > > > > > I
> > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar
> > to
> > > > > > > Window$:Linux, its my view only, though]. Coming back to the
> > point
> > > as
> > > > > Uwe
> > > > > > > mentioned that we can do the same thing in lucene as well, what
> > is
> > > > > > > available
> > > > > > > in Solr, Solr is based on Lucene only, right?
> > > > > > > I request Uwe to give me some more ideas on using the analyzers
> > > from
> > > > > solr
> > > > > > > that will do the job for me, handling a mix of both english and
> > > > > > non-english
> > > > > > > content.
> > > > > > > Muir, can you give me a bit detail description of how to use
> the
> > > > > > > WordDelimiteFilter to do my job.
> > > > > > > On a side note, I was thingking of writing a simple analyzer
> that
> > > > will
> > > > > do
> > > > > > > the following,
> > > > > > > #. If the webpage fragment is non-english[for me its some
> indian
> > > > > > language]
> > > > > > > then index them as such, no stemming/ stop word removal to
> begin
> > > > with.
> > > > > As
> > > > > > I
> > > > > > > know its in UCN unicode something like
> > > \u0021\u0012\u34ae\u0031[just
> > > > a
> > > > > > > sample]
> > > > > > > # If the fragment is english then apply standard anlyzing
> process
> > > for
> > > > > > > english content. I've not thought of quering in the same way as
> > of
> > > > now
> > > > > > i.e
> > > > > > > mix of non-english and engish words.
> > > > > > > Now to get all this,
> > > > > > >  #1. I need some sort of way which will let me know if the
> > content
> > > is
> > > > > > > english or not. If not english just add the tokens to the
> > document.
> > > > Do
> > > > > we
> > > > > > > really need language identifiers, as i dont have any other
> > content
> > > > that
> > > > > > > uses
> > > > > > > the same script as english other than those \u1234 things for
> my
> > > > indian
> > > > > > > language content. Any smart hack/trick for the same?
> > > > > > >  #2. If the its english apply all normal process and add the
> > > stemmed
> > > > > > token
> > > > > > > to document.
> > > > > > > For all this I was thinking of iterating earch word of the web
> > page
> > > > and
> > > > > > > apply the above procedure. And finallyadd  the newly created
> > > document
> > > > > to
> > > > > > > the
> > > > > > > index.
> > > > > > >
> > > > > > > I would like some one to guide me in this direction. I'm pretty
> > > > people
> > > > > > must
> > > > > > > have done similar/same thing earlier, I request them to guide
> me/
> > > > point
> > > > > > me
> > > > > > > to some tutorials for the same.
> > > > > > > Else help me out writing a custom analyzer only if thats not
> > going
> > > to
> > > > > be
> > > > > > > too
> > > > > > > complex. LOL, I'm a new user to lucene and know basics of Java
> > > > coding.
> > > > > > > Thank you very much.
> > > > > > >
> > > > > > > --KK.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > yes this is true. for starters KK, might be good to startup
> > solr
> > > > and
> > > > > > look
> > > > > > > > at
> > > > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > > > > >
> > > > > > > > if you want to stick with lucene, the WordDelimiterFilter is
> > the
> > > > > piece
> > > > > > > you
> > > > > > > > will want for your text, mainly for punctuation but also for
> > > format
> > > > > > > > characters such as ZWJ/ZWNJ.
> > > > > > > >
> > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
> uwe@thetaphi.de
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > You can also re-use the solr analyzers, as far as I found
> > out.
> > > > > There
> > > > > > is
> > > > > > > > an
> > > > > > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > > > > > >
> > > > > > > > > -----
> > > > > > > > > Uwe Schindler
> > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > > http://www.thetaphi.de
> > > > > > > > > eMail: uwe@thetaphi.de
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > > Subject: Re: How to support stemming and case folding for
> > > > english
> > > > > > > > content
> > > > > > > > > > mixed with non-english content?
> > > > > > > > > >
> > > > > > > > > > KK, ok, so you only really want to stem the english. This
> > is
> > > > > good.
> > > > > > > > > >
> > > > > > > > > > Is it possible for you to consider using solr? solr's
> > default
> > > > > > > analyzer
> > > > > > > > > for
> > > > > > > > > > type 'text' will be good for your case. it will do the
> > > > following
> > > > > > > > > > 1. tokenize on whitespace
> > > > > > > > > > 2. handle both indian language and english punctuation
> > > > > > > > > > 3. lowercase the english.
> > > > > > > > > > 4. stem the english.
> > > > > > > > > >
> > > > > > > > > > try a nightly build,
> > > > > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > > > > >
> > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> > > dioxide.software@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Muir, thanks for your response.
> > > > > > > > > > > I'm indexing indian language web pages which has got
> > > descent
> > > > > > amount
> > > > > > > > of
> > > > > > > > > > > english content mixed with therein. For the time being
> > I'm
> > > > not
> > > > > > > going
> > > > > > > > to
> > > > > > > > > > use
> > > > > > > > > > > any stemmers as we don't have standard stemmers for
> > indian
> > > > > > > languages
> > > > > > > > .
> > > > > > > > > > So
> > > > > > > > > > > what I want to do is like this,
> > > > > > > > > > > Say I've a web page having hindi content with 5%
> english
> > > > > content.
> > > > > > > > Then
> > > > > > > > > > for
> > > > > > > > > > > hindi I want to use the basic white space analyzer as
> we
> > > dont
> > > > > > have
> > > > > > > > > > stemmers
> > > > > > > > > > > for this as I mentioned earlier and whereever english
> > > appears
> > > > I
> > > > > > > want
> > > > > > > > > > them
> > > > > > > > > > > to
> > > > > > > > > > > be stemmed tokenized etc[the standard process used for
> > > > english
> > > > > > > > > content].
> > > > > > > > > > As
> > > > > > > > > > > of now I'm using whitespace analyzer for the full
> content
> > > > which
> > > > > > > > doesnot
> > > > > > > > > > > support case folding, stemming etc for teh content. So
> if
> > > > there
> > > > > > is
> > > > > > > an
> > > > > > > > > > > english word say "Detection" indexed as such then
> > searching
> > > > for
> > > > > > > > > > detection
> > > > > > > > > > > or
> > > > > > > > > > > detect is not giving any results, which is the expected
> > > > > behavior,
> > > > > > > but
> > > > > > > > I
> > > > > > > > > > > want
> > > > > > > > > > > this kind of queries to give results.
> > > > > > > > > > > I hope I made it clear. Let me know any ideas on doing
> > the
> > > > > same.
> > > > > > > And
> > > > > > > > > one
> > > > > > > > > > > more thing, I'm storing the full webpage content under
> a
> > > > single
> > > > > > > > field,
> > > > > > > > > I
> > > > > > > > > > > hope this will not make any difference, right?
> > > > > > > > > > > It seems I've to use language identifiers, but do we
> > really
> > > > > need
> > > > > > > > that?
> > > > > > > > > > > Because we've only non-english content mixed with
> > > english[and
> > > > > not
> > > > > > > > > french
> > > > > > > > > > or
> > > > > > > > > > > russian etc].
> > > > > > > > > > >
> > > > > > > > > > > What is the best way of approaching the problem? Any
> > > > thoughts!
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > KK.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> > > > rcmuir@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > KK, is all of your latin script text actually
> english?
> > Is
> > > > > there
> > > > > > > > stuff
> > > > > > > > > > > like
> > > > > > > > > > > > german or french mixed in?
> > > > > > > > > > > >
> > > > > > > > > > > > And for your non-english content (your examples have
> > been
> > > > > > indian
> > > > > > > > > > writing
> > > > > > > > > > > > systems), is it generally true that if you had
> > > devanagari,
> > > > > you
> > > > > > > can
> > > > > > > > > > assume
> > > > > > > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > > > > > > >
> > > > > > > > > > > > Reason I say this is to invoke the right stemmers,
> you
> > > > really
> > > > > > > need
> > > > > > > > > > some
> > > > > > > > > > > > language detection, but perhaps in your case you can
> > > cheat
> > > > > and
> > > > > > > > detect
> > > > > > > > > > > this
> > > > > > > > > > > > based on scripts...
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Robert
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > > > > > dioxide.software@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > I'm indexing some non-english content. But the page
> > > also
> > > > > > > contains
> > > > > > > > > > > english
> > > > > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer for
> > all
> > > > > > content
> > > > > > > > and
> > > > > > > > > > I'm
> > > > > > > > > > > > > storing the full webpage content under a single
> > filed.
> > > > Now
> > > > > we
> > > > > > > > > > require
> > > > > > > > > > > to
> > > > > > > > > > > > > support case folding and stemmming for the english
> > > > content
> > > > > > > > > > intermingled
> > > > > > > > > > > > > with
> > > > > > > > > > > > > non-english content. I must metion that we dont
> have
> > > > > stemming
> > > > > > > and
> > > > > > > > > > case
> > > > > > > > > > > > > folding for these non-english content. I'm stuck
> with
> > > > this.
> > > > > > > Some
> > > > > > > > > one
> > > > > > > > > > do
> > > > > > > > > > > > let
> > > > > > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > KK.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Robert Muir
> > > > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Robert Muir
> > > > > > > > > > rcmuir@gmail.com
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail:
> > > > java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Robert Muir
> > > > > > > > rcmuir@gmail.com
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcmuir@gmail.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Thanks Robert. This is exactly what I did and  its working but delimiter is
missing I'm going to add that from solr-nightly.jar

/**
 * Analyzer for Indian language.
 */
public class IndicAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = new WhitespaceTokenizer(reader);
    ts = new PorterStemFilter(ts);
    ts = new LowerCaseFilter(ts);
    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
    return ts;
  }
}

Its able to do stemming/case-folding and supports search for both english
and indic texts. let me try out the delimiter. Will update you on that.

Thanks a lot.
KK

On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com> wrote:

> i think you are on the right track... once you build your analyzer, put it
> in your classpath and play around with it in luke and see if it does what
> you want.
>
> On Fri, Jun 5, 2009 at 3:19 AM, KK <di...@gmail.com> wrote:
>
> > Hi Robert,
> > This is what I copied from ThaiAnalyzer @ lucene contrib
> >
> > public class ThaiAnalyzer extends Analyzer {
> >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >      TokenStream ts = new StandardTokenizer(reader);
> >    ts = new StandardFilter(ts);
> >    ts = new ThaiWordFilter(ts);
> >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> >    return ts;
> >  }
> > }
> >
> > Now as you said, I've to use whitespacetokenizer
> > withworddelimitefilter[solr
> > nightly.jar] stop wordremoval, porter stemmer etc , so it is something
> like
> > this,
> > public class IndicAnalyzer extends Analyzer {
> >  public TokenStream tokenStream(String fieldName, Reader reader) {
> >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> >   ts = new WordDelimiterFilter(ts);
> >   ts = new LowerCaseFilter(ts);
> >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   // english
> > stop filter, is this the default one?
> >   ts = new PorterFilter(ts);
> >   return ts;
> >  }
> > }
> >
> > Does this sound OK? I think it will do the job...let me try it out..
> > I dont need custom filter as per my requirement, at least not for these
> > basic things I'm doing? I think so...
> >
> > Thanks,
> > KK.
> >
> >
> > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> > > KK well you can always get some good examples from the lucene contrib
> > > codebase.
> > > For example, look at the DutchAnalyzer, especially:
> > >
> > > TokenStream tokenStream(String fieldName, Reader reader)
> > >
> > > See how it combines a specified tokenizer with various filters? this is
> > > what
> > > you want to do, except of course you want to use different tokenizer
> and
> > > filters.
> > >
> > > On Thu, Jun 4, 2009 at 8:53 AM, KK <di...@gmail.com> wrote:
> > >
> > > > Thanks Muir.
> > > > Thanks for letting me know that I dont need language identifiers.
> > > >  I'll have a look and will try to write the analyzer. For my case I
> > think
> > > > it
> > > > wont be that difficult.
> > > > BTW, can you point me to some sample codes/tutorials writing custom
> > > > analyzers. I could not find something in LIA2ndEdn. Is something
> htere?
> > > do
> > > > let me know.
> > > >
> > > > Thanks,
> > > > KK.
> > > >
> > > >
> > > >
> > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rc...@gmail.com>
> wrote:
> > > >
> > > > > KK, for your case, you don't really need to go to the effort of
> > > detecting
> > > > > whether fragments are english or not.
> > > > > Because the English stemmers in lucene will not modify your Indic
> > text,
> > > > and
> > > > > neither will the LowerCaseFilter.
> > > > >
> > > > > what you want to do is create a custom analyzer that works like
> this
> > > > >
> > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly
> > jar],
> > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > > > >
> > > > > Thanks,
> > > > > Robert
> > > > >
> > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <di...@gmail.com>
> > wrote:
> > > > >
> > > > > > Thank you all.
> > > > > > To be frank I was using Solr in the begining half a month ago.
> The
> > > > > > problem[rather bug] with solr was creation of new index on the
> fly.
> > > > > Though
> > > > > > they have a restful method for teh same, but it was not working.
> If
> > I
> > > > > > remember properly one of Solr commiter "Noble Paul"[I dont know
> his
> > > > real
> > > > > > name] was trying to help me. I tried many nightly builds and
> > spending
> > > a
> > > > > > couple of days stuck at that made me think of lucene and I
> switched
> > > to
> > > > > it.
> > > > > > Now after working with lucene which gives you full control of
> > > > everything
> > > > > I
> > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar
> to
> > > > > > Window$:Linux, its my view only, though]. Coming back to the
> point
> > as
> > > > Uwe
> > > > > > mentioned that we can do the same thing in lucene as well, what
> is
> > > > > > available
> > > > > > in Solr, Solr is based on Lucene only, right?
> > > > > > I request Uwe to give me some more ideas on using the analyzers
> > from
> > > > solr
> > > > > > that will do the job for me, handling a mix of both english and
> > > > > non-english
> > > > > > content.
> > > > > > Muir, can you give me a bit detail description of how to use the
> > > > > > WordDelimiteFilter to do my job.
> > > > > > On a side note, I was thingking of writing a simple analyzer that
> > > will
> > > > do
> > > > > > the following,
> > > > > > #. If the webpage fragment is non-english[for me its some indian
> > > > > language]
> > > > > > then index them as such, no stemming/ stop word removal to begin
> > > with.
> > > > As
> > > > > I
> > > > > > know its in UCN unicode something like
> > \u0021\u0012\u34ae\u0031[just
> > > a
> > > > > > sample]
> > > > > > # If the fragment is english then apply standard anlyzing process
> > for
> > > > > > english content. I've not thought of quering in the same way as
> of
> > > now
> > > > > i.e
> > > > > > mix of non-english and engish words.
> > > > > > Now to get all this,
> > > > > >  #1. I need some sort of way which will let me know if the
> content
> > is
> > > > > > english or not. If not english just add the tokens to the
> document.
> > > Do
> > > > we
> > > > > > really need language identifiers, as i dont have any other
> content
> > > that
> > > > > > uses
> > > > > > the same script as english other than those \u1234 things for my
> > > indian
> > > > > > language content. Any smart hack/trick for the same?
> > > > > >  #2. If the its english apply all normal process and add the
> > stemmed
> > > > > token
> > > > > > to document.
> > > > > > For all this I was thinking of iterating earch word of the web
> page
> > > and
> > > > > > apply the above procedure. And finallyadd  the newly created
> > document
> > > > to
> > > > > > the
> > > > > > index.
> > > > > >
> > > > > > I would like some one to guide me in this direction. I'm pretty
> > > people
> > > > > must
> > > > > > have done similar/same thing earlier, I request them to guide me/
> > > point
> > > > > me
> > > > > > to some tutorials for the same.
> > > > > > Else help me out writing a custom analyzer only if thats not
> going
> > to
> > > > be
> > > > > > too
> > > > > > complex. LOL, I'm a new user to lucene and know basics of Java
> > > coding.
> > > > > > Thank you very much.
> > > > > >
> > > > > > --KK.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > yes this is true. for starters KK, might be good to startup
> solr
> > > and
> > > > > look
> > > > > > > at
> > > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > > > >
> > > > > > > if you want to stick with lucene, the WordDelimiterFilter is
> the
> > > > piece
> > > > > > you
> > > > > > > will want for your text, mainly for punctuation but also for
> > format
> > > > > > > characters such as ZWJ/ZWNJ.
> > > > > > >
> > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uwe@thetaphi.de
> >
> > > > wrote:
> > > > > > >
> > > > > > > > You can also re-use the solr analyzers, as far as I found
> out.
> > > > There
> > > > > is
> > > > > > > an
> > > > > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > > > > >
> > > > > > > > -----
> > > > > > > > Uwe Schindler
> > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > http://www.thetaphi.de
> > > > > > > > eMail: uwe@thetaphi.de
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > Subject: Re: How to support stemming and case folding for
> > > english
> > > > > > > content
> > > > > > > > > mixed with non-english content?
> > > > > > > > >
> > > > > > > > > KK, ok, so you only really want to stem the english. This
> is
> > > > good.
> > > > > > > > >
> > > > > > > > > Is it possible for you to consider using solr? solr's
> default
> > > > > > analyzer
> > > > > > > > for
> > > > > > > > > type 'text' will be good for your case. it will do the
> > > following
> > > > > > > > > 1. tokenize on whitespace
> > > > > > > > > 2. handle both indian language and english punctuation
> > > > > > > > > 3. lowercase the english.
> > > > > > > > > 4. stem the english.
> > > > > > > > >
> > > > > > > > > try a nightly build,
> > > > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > > > >
> > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> > dioxide.software@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Muir, thanks for your response.
> > > > > > > > > > I'm indexing indian language web pages which has got
> > descent
> > > > > amount
> > > > > > > of
> > > > > > > > > > english content mixed with therein. For the time being
> I'm
> > > not
> > > > > > going
> > > > > > > to
> > > > > > > > > use
> > > > > > > > > > any stemmers as we don't have standard stemmers for
> indian
> > > > > > languages
> > > > > > > .
> > > > > > > > > So
> > > > > > > > > > what I want to do is like this,
> > > > > > > > > > Say I've a web page having hindi content with 5% english
> > > > content.
> > > > > > > Then
> > > > > > > > > for
> > > > > > > > > > hindi I want to use the basic white space analyzer as we
> > dont
> > > > > have
> > > > > > > > > stemmers
> > > > > > > > > > for this as I mentioned earlier and whereever english
> > appears
> > > I
> > > > > > want
> > > > > > > > > them
> > > > > > > > > > to
> > > > > > > > > > be stemmed tokenized etc[the standard process used for
> > > english
> > > > > > > > content].
> > > > > > > > > As
> > > > > > > > > > of now I'm using whitespace analyzer for the full content
> > > which
> > > > > > > doesnot
> > > > > > > > > > support case folding, stemming etc for teh content. So if
> > > there
> > > > > is
> > > > > > an
> > > > > > > > > > english word say "Detection" indexed as such then
> searching
> > > for
> > > > > > > > > detection
> > > > > > > > > > or
> > > > > > > > > > detect is not giving any results, which is the expected
> > > > behavior,
> > > > > > but
> > > > > > > I
> > > > > > > > > > want
> > > > > > > > > > this kind of queries to give results.
> > > > > > > > > > I hope I made it clear. Let me know any ideas on doing
> the
> > > > same.
> > > > > > And
> > > > > > > > one
> > > > > > > > > > more thing, I'm storing the full webpage content under a
> > > single
> > > > > > > field,
> > > > > > > > I
> > > > > > > > > > hope this will not make any difference, right?
> > > > > > > > > > It seems I've to use language identifiers, but do we
> really
> > > > need
> > > > > > > that?
> > > > > > > > > > Because we've only non-english content mixed with
> > english[and
> > > > not
> > > > > > > > french
> > > > > > > > > or
> > > > > > > > > > russian etc].
> > > > > > > > > >
> > > > > > > > > > What is the best way of approaching the problem? Any
> > > thoughts!
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > KK.
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> > > rcmuir@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > KK, is all of your latin script text actually english?
> Is
> > > > there
> > > > > > > stuff
> > > > > > > > > > like
> > > > > > > > > > > german or french mixed in?
> > > > > > > > > > >
> > > > > > > > > > > And for your non-english content (your examples have
> been
> > > > > indian
> > > > > > > > > writing
> > > > > > > > > > > systems), is it generally true that if you had
> > devanagari,
> > > > you
> > > > > > can
> > > > > > > > > assume
> > > > > > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > > > > > >
> > > > > > > > > > > Reason I say this is to invoke the right stemmers, you
> > > really
> > > > > > need
> > > > > > > > > some
> > > > > > > > > > > language detection, but perhaps in your case you can
> > cheat
> > > > and
> > > > > > > detect
> > > > > > > > > > this
> > > > > > > > > > > based on scripts...
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Robert
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > > > > dioxide.software@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi All,
> > > > > > > > > > > > I'm indexing some non-english content. But the page
> > also
> > > > > > contains
> > > > > > > > > > english
> > > > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer for
> all
> > > > > content
> > > > > > > and
> > > > > > > > > I'm
> > > > > > > > > > > > storing the full webpage content under a single
> filed.
> > > Now
> > > > we
> > > > > > > > > require
> > > > > > > > > > to
> > > > > > > > > > > > support case folding and stemmming for the english
> > > content
> > > > > > > > > intermingled
> > > > > > > > > > > > with
> > > > > > > > > > > > non-english content. I must metion that we dont have
> > > > stemming
> > > > > > and
> > > > > > > > > case
> > > > > > > > > > > > folding for these non-english content. I'm stuck with
> > > this.
> > > > > > Some
> > > > > > > > one
> > > > > > > > > do
> > > > > > > > > > > let
> > > > > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > KK.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Robert Muir
> > > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Robert Muir
> > > > > > > > > rcmuir@gmail.com
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Robert Muir
> > > > > > > rcmuir@gmail.com
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

i think you are on the right track... once you build your analyzer, put it
in your classpath and play around with it in luke and see if it does what
you want.

On Fri, Jun 5, 2009 at 3:19 AM, KK <di...@gmail.com> wrote:

> Hi Robert,
> This is what I copied from ThaiAnalyzer @ lucene contrib
>
> public class ThaiAnalyzer extends Analyzer {
>  public TokenStream tokenStream(String fieldName, Reader reader) {
>      TokenStream ts = new StandardTokenizer(reader);
>    ts = new StandardFilter(ts);
>    ts = new ThaiWordFilter(ts);
>    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>    return ts;
>  }
> }
>
> Now as you said, I've to use whitespacetokenizer
> withworddelimitefilter[solr
> nightly.jar] stop wordremoval, porter stemmer etc , so it is something like
> this,
> public class IndicAnalyzer extends Analyzer {
>  public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream ts = new WhiteSpaceTokenizer(reader);
>   ts = new WordDelimiterFilter(ts);
>   ts = new LowerCaseFilter(ts);
>   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   // english
> stop filter, is this the default one?
>   ts = new PorterFilter(ts);
>   return ts;
>  }
> }
>
> Does this sound OK? I think it will do the job...let me try it out..
> I dont need custom filter as per my requirement, at least not for these
> basic things I'm doing? I think so...
>
> Thanks,
> KK.
>
>
> On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com> wrote:
>
> > KK well you can always get some good examples from the lucene contrib
> > codebase.
> > For example, look at the DutchAnalyzer, especially:
> >
> > TokenStream tokenStream(String fieldName, Reader reader)
> >
> > See how it combines a specified tokenizer with various filters? this is
> > what
> > you want to do, except of course you want to use different tokenizer and
> > filters.
> >
> > On Thu, Jun 4, 2009 at 8:53 AM, KK <di...@gmail.com> wrote:
> >
> > > Thanks Muir.
> > > Thanks for letting me know that I dont need language identifiers.
> > >  I'll have a look and will try to write the analyzer. For my case I
> think
> > > it
> > > wont be that difficult.
> > > BTW, can you point me to some sample codes/tutorials writing custom
> > > analyzers. I could not find something in LIA2ndEdn. Is something htere?
> > do
> > > let me know.
> > >
> > > Thanks,
> > > KK.
> > >
> > >
> > >
> > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rc...@gmail.com> wrote:
> > >
> > > > KK, for your case, you don't really need to go to the effort of
> > detecting
> > > > whether fragments are english or not.
> > > > Because the English stemmers in lucene will not modify your Indic
> text,
> > > and
> > > > neither will the LowerCaseFilter.
> > > >
> > > > what you want to do is create a custom analyzer that works like this
> > > >
> > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly
> jar],
> > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > > >
> > > > Thanks,
> > > > Robert
> > > >
> > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <di...@gmail.com>
> wrote:
> > > >
> > > > > Thank you all.
> > > > > To be frank I was using Solr in the begining half a month ago. The
> > > > > problem[rather bug] with solr was creation of new index on the fly.
> > > > Though
> > > > > they have a restful method for teh same, but it was not working. If
> I
> > > > > remember properly one of Solr commiter "Noble Paul"[I dont know his
> > > real
> > > > > name] was trying to help me. I tried many nightly builds and
> spending
> > a
> > > > > couple of days stuck at that made me think of lucene and I switched
> > to
> > > > it.
> > > > > Now after working with lucene which gives you full control of
> > > everything
> > > > I
> > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to
> > > > > Window$:Linux, its my view only, though]. Coming back to the point
> as
> > > Uwe
> > > > > mentioned that we can do the same thing in lucene as well, what is
> > > > > available
> > > > > in Solr, Solr is based on Lucene only, right?
> > > > > I request Uwe to give me some more ideas on using the analyzers
> from
> > > solr
> > > > > that will do the job for me, handling a mix of both english and
> > > > non-english
> > > > > content.
> > > > > Muir, can you give me a bit detail description of how to use the
> > > > > WordDelimiteFilter to do my job.
> > > > > On a side note, I was thingking of writing a simple analyzer that
> > will
> > > do
> > > > > the following,
> > > > > #. If the webpage fragment is non-english[for me its some indian
> > > > language]
> > > > > then index them as such, no stemming/ stop word removal to begin
> > with.
> > > As
> > > > I
> > > > > know its in UCN unicode something like
> \u0021\u0012\u34ae\u0031[just
> > a
> > > > > sample]
> > > > > # If the fragment is english then apply standard anlyzing process
> for
> > > > > english content. I've not thought of quering in the same way as of
> > now
> > > > i.e
> > > > > mix of non-english and engish words.
> > > > > Now to get all this,
> > > > >  #1. I need some sort of way which will let me know if the content
> is
> > > > > english or not. If not english just add the tokens to the document.
> > Do
> > > we
> > > > > really need language identifiers, as i dont have any other content
> > that
> > > > > uses
> > > > > the same script as english other than those \u1234 things for my
> > indian
> > > > > language content. Any smart hack/trick for the same?
> > > > >  #2. If the its english apply all normal process and add the
> stemmed
> > > > token
> > > > > to document.
> > > > > For all this I was thinking of iterating earch word of the web page
> > and
> > > > > apply the above procedure. And finallyadd  the newly created
> document
> > > to
> > > > > the
> > > > > index.
> > > > >
> > > > > I would like some one to guide me in this direction. I'm pretty
> > people
> > > > must
> > > > > have done similar/same thing earlier, I request them to guide me/
> > point
> > > > me
> > > > > to some tutorials for the same.
> > > > > Else help me out writing a custom analyzer only if thats not going
> to
> > > be
> > > > > too
> > > > > complex. LOL, I'm a new user to lucene and know basics of Java
> > coding.
> > > > > Thank you very much.
> > > > >
> > > > > --KK.
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com>
> > wrote:
> > > > >
> > > > > > yes this is true. for starters KK, might be good to startup solr
> > and
> > > > look
> > > > > > at
> > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > > >
> > > > > > if you want to stick with lucene, the WordDelimiterFilter is the
> > > piece
> > > > > you
> > > > > > will want for your text, mainly for punctuation but also for
> format
> > > > > > characters such as ZWJ/ZWNJ.
> > > > > >
> > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uw...@thetaphi.de>
> > > wrote:
> > > > > >
> > > > > > > You can also re-use the solr analyzers, as far as I found out.
> > > There
> > > > is
> > > > > > an
> > > > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > > > >
> > > > > > > -----
> > > > > > > Uwe Schindler
> > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > http://www.thetaphi.de
> > > > > > > eMail: uwe@thetaphi.de
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > Subject: Re: How to support stemming and case folding for
> > english
> > > > > > content
> > > > > > > > mixed with non-english content?
> > > > > > > >
> > > > > > > > KK, ok, so you only really want to stem the english. This is
> > > good.
> > > > > > > >
> > > > > > > > Is it possible for you to consider using solr? solr's default
> > > > > analyzer
> > > > > > > for
> > > > > > > > type 'text' will be good for your case. it will do the
> > following
> > > > > > > > 1. tokenize on whitespace
> > > > > > > > 2. handle both indian language and english punctuation
> > > > > > > > 3. lowercase the english.
> > > > > > > > 4. stem the english.
> > > > > > > >
> > > > > > > > try a nightly build,
> > > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > > >
> > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> dioxide.software@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Muir, thanks for your response.
> > > > > > > > > I'm indexing indian language web pages which has got
> descent
> > > > amount
> > > > > > of
> > > > > > > > > english content mixed with therein. For the time being I'm
> > not
> > > > > going
> > > > > > to
> > > > > > > > use
> > > > > > > > > any stemmers as we don't have standard stemmers for indian
> > > > > languages
> > > > > > .
> > > > > > > > So
> > > > > > > > > what I want to do is like this,
> > > > > > > > > Say I've a web page having hindi content with 5% english
> > > content.
> > > > > > Then
> > > > > > > > for
> > > > > > > > > hindi I want to use the basic white space analyzer as we
> dont
> > > > have
> > > > > > > > stemmers
> > > > > > > > > for this as I mentioned earlier and whereever english
> appears
> > I
> > > > > want
> > > > > > > > them
> > > > > > > > > to
> > > > > > > > > be stemmed tokenized etc[the standard process used for
> > english
> > > > > > > content].
> > > > > > > > As
> > > > > > > > > of now I'm using whitespace analyzer for the full content
> > which
> > > > > > doesnot
> > > > > > > > > support case folding, stemming etc for teh content. So if
> > there
> > > > is
> > > > > an
> > > > > > > > > english word say "Detection" indexed as such then searching
> > for
> > > > > > > > detection
> > > > > > > > > or
> > > > > > > > > detect is not giving any results, which is the expected
> > > behavior,
> > > > > but
> > > > > > I
> > > > > > > > > want
> > > > > > > > > this kind of queries to give results.
> > > > > > > > > I hope I made it clear. Let me know any ideas on doing the
> > > same.
> > > > > And
> > > > > > > one
> > > > > > > > > more thing, I'm storing the full webpage content under a
> > single
> > > > > > field,
> > > > > > > I
> > > > > > > > > hope this will not make any difference, right?
> > > > > > > > > It seems I've to use language identifiers, but do we really
> > > need
> > > > > > that?
> > > > > > > > > Because we've only non-english content mixed with
> english[and
> > > not
> > > > > > > french
> > > > > > > > or
> > > > > > > > > russian etc].
> > > > > > > > >
> > > > > > > > > What is the best way of approaching the problem? Any
> > thoughts!
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > KK.
> > > > > > > > >
> > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> > rcmuir@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > KK, is all of your latin script text actually english? Is
> > > there
> > > > > > stuff
> > > > > > > > > like
> > > > > > > > > > german or french mixed in?
> > > > > > > > > >
> > > > > > > > > > And for your non-english content (your examples have been
> > > > indian
> > > > > > > > writing
> > > > > > > > > > systems), is it generally true that if you had
> devanagari,
> > > you
> > > > > can
> > > > > > > > assume
> > > > > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > > > > >
> > > > > > > > > > Reason I say this is to invoke the right stemmers, you
> > really
> > > > > need
> > > > > > > > some
> > > > > > > > > > language detection, but perhaps in your case you can
> cheat
> > > and
> > > > > > detect
> > > > > > > > > this
> > > > > > > > > > based on scripts...
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Robert
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > > > dioxide.software@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi All,
> > > > > > > > > > > I'm indexing some non-english content. But the page
> also
> > > > > contains
> > > > > > > > > english
> > > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer for all
> > > > content
> > > > > > and
> > > > > > > > I'm
> > > > > > > > > > > storing the full webpage content under a single filed.
> > Now
> > > we
> > > > > > > > require
> > > > > > > > > to
> > > > > > > > > > > support case folding and stemmming for the english
> > content
> > > > > > > > intermingled
> > > > > > > > > > > with
> > > > > > > > > > > non-english content. I must metion that we dont have
> > > stemming
> > > > > and
> > > > > > > > case
> > > > > > > > > > > folding for these non-english content. I'm stuck with
> > this.
> > > > > Some
> > > > > > > one
> > > > > > > > do
> > > > > > > > > > let
> > > > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > KK.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Robert Muir
> > > > > > > > > > rcmuir@gmail.com
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Robert Muir
> > > > > > > > rcmuir@gmail.com
> > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcmuir@gmail.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Hi Robert,
This is what I copied from ThaiAnalyzer @ lucene contrib

public class ThaiAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream ts = new StandardTokenizer(reader);
    ts = new StandardFilter(ts);
    ts = new ThaiWordFilter(ts);
    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
    return ts;
  }
}

Now as you said, I've to use whitespacetokenizer withworddelimitefilter[solr
nightly.jar] stop wordremoval, porter stemmer etc , so it is something like
this,
public class IndicAnalyzer extends Analyzer {
 public TokenStream tokenStream(String fieldName, Reader reader) {
   TokenStream ts = new WhiteSpaceTokenizer(reader);
   ts = new WordDelimiterFilter(ts);
   ts = new LowerCaseFilter(ts);
   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   // english
stop filter, is this the default one?
   ts = new PorterFilter(ts);
   return ts;
 }
}

Does this sound OK? I think it will do the job...let me try it out..
I dont need custom filter as per my requirement, at least not for these
basic things I'm doing? I think so...

Thanks,
KK.


On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com> wrote:

> KK well you can always get some good examples from the lucene contrib
> codebase.
> For example, look at the DutchAnalyzer, especially:
>
> TokenStream tokenStream(String fieldName, Reader reader)
>
> See how it combines a specified tokenizer with various filters? this is
> what
> you want to do, except of course you want to use different tokenizer and
> filters.
>
> On Thu, Jun 4, 2009 at 8:53 AM, KK <di...@gmail.com> wrote:
>
> > Thanks Muir.
> > Thanks for letting me know that I dont need language identifiers.
> >  I'll have a look and will try to write the analyzer. For my case I think
> > it
> > wont be that difficult.
> > BTW, can you point me to some sample codes/tutorials writing custom
> > analyzers. I could not find something in LIA2ndEdn. Is something htere?
> do
> > let me know.
> >
> > Thanks,
> > KK.
> >
> >
> >
> > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> > > KK, for your case, you don't really need to go to the effort of
> detecting
> > > whether fragments are english or not.
> > > Because the English stemmers in lucene will not modify your Indic text,
> > and
> > > neither will the LowerCaseFilter.
> > >
> > > what you want to do is create a custom analyzer that works like this
> > >
> > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly jar],
> > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > >
> > > Thanks,
> > > Robert
> > >
> > > On Thu, Jun 4, 2009 at 8:28 AM, KK <di...@gmail.com> wrote:
> > >
> > > > Thank you all.
> > > > To be frank I was using Solr in the begining half a month ago. The
> > > > problem[rather bug] with solr was creation of new index on the fly.
> > > Though
> > > > they have a restful method for teh same, but it was not working. If I
> > > > remember properly one of Solr commiter "Noble Paul"[I dont know his
> > real
> > > > name] was trying to help me. I tried many nightly builds and spending
> a
> > > > couple of days stuck at that made me think of lucene and I switched
> to
> > > it.
> > > > Now after working with lucene which gives you full control of
> > everything
> > > I
> > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to
> > > > Window$:Linux, its my view only, though]. Coming back to the point as
> > Uwe
> > > > mentioned that we can do the same thing in lucene as well, what is
> > > > available
> > > > in Solr, Solr is based on Lucene only, right?
> > > > I request Uwe to give me some more ideas on using the analyzers from
> > solr
> > > > that will do the job for me, handling a mix of both english and
> > > non-english
> > > > content.
> > > > Muir, can you give me a bit detail description of how to use the
> > > > WordDelimiteFilter to do my job.
> > > > On a side note, I was thingking of writing a simple analyzer that
> will
> > do
> > > > the following,
> > > > #. If the webpage fragment is non-english[for me its some indian
> > > language]
> > > > then index them as such, no stemming/ stop word removal to begin
> with.
> > As
> > > I
> > > > know its in UCN unicode something like \u0021\u0012\u34ae\u0031[just
> a
> > > > sample]
> > > > # If the fragment is english then apply standard anlyzing process for
> > > > english content. I've not thought of quering in the same way as of
> now
> > > i.e
> > > > mix of non-english and engish words.
> > > > Now to get all this,
> > > >  #1. I need some sort of way which will let me know if the content is
> > > > english or not. If not english just add the tokens to the document.
> Do
> > we
> > > > really need language identifiers, as i dont have any other content
> that
> > > > uses
> > > > the same script as english other than those \u1234 things for my
> indian
> > > > language content. Any smart hack/trick for the same?
> > > >  #2. If the its english apply all normal process and add the stemmed
> > > token
> > > > to document.
> > > > For all this I was thinking of iterating earch word of the web page
> and
> > > > apply the above procedure. And finallyadd  the newly created document
> > to
> > > > the
> > > > index.
> > > >
> > > > I would like some one to guide me in this direction. I'm pretty
> people
> > > must
> > > > have done similar/same thing earlier, I request them to guide me/
> point
> > > me
> > > > to some tutorials for the same.
> > > > Else help me out writing a custom analyzer only if thats not going to
> > be
> > > > too
> > > > complex. LOL, I'm a new user to lucene and know basics of Java
> coding.
> > > > Thank you very much.
> > > >
> > > > --KK.
> > > >
> > > >
> > > >
> > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com>
> wrote:
> > > >
> > > > > yes this is true. for starters KK, might be good to startup solr
> and
> > > look
> > > > > at
> > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > >
> > > > > if you want to stick with lucene, the WordDelimiterFilter is the
> > piece
> > > > you
> > > > > will want for your text, mainly for punctuation but also for format
> > > > > characters such as ZWJ/ZWNJ.
> > > > >
> > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uw...@thetaphi.de>
> > wrote:
> > > > >
> > > > > > You can also re-use the solr analyzers, as far as I found out.
> > There
> > > is
> > > > > an
> > > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > > >
> > > > > > -----
> > > > > > Uwe Schindler
> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > http://www.thetaphi.de
> > > > > > eMail: uwe@thetaphi.de
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > > To: java-user@lucene.apache.org
> > > > > > > Subject: Re: How to support stemming and case folding for
> english
> > > > > content
> > > > > > > mixed with non-english content?
> > > > > > >
> > > > > > > KK, ok, so you only really want to stem the english. This is
> > good.
> > > > > > >
> > > > > > > Is it possible for you to consider using solr? solr's default
> > > > analyzer
> > > > > > for
> > > > > > > type 'text' will be good for your case. it will do the
> following
> > > > > > > 1. tokenize on whitespace
> > > > > > > 2. handle both indian language and english punctuation
> > > > > > > 3. lowercase the english.
> > > > > > > 4. stem the english.
> > > > > > >
> > > > > > > try a nightly build,
> > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > >
> > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <dioxide.software@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Muir, thanks for your response.
> > > > > > > > I'm indexing indian language web pages which has got descent
> > > amount
> > > > > of
> > > > > > > > english content mixed with therein. For the time being I'm
> not
> > > > going
> > > > > to
> > > > > > > use
> > > > > > > > any stemmers as we don't have standard stemmers for indian
> > > > languages
> > > > > .
> > > > > > > So
> > > > > > > > what I want to do is like this,
> > > > > > > > Say I've a web page having hindi content with 5% english
> > content.
> > > > > Then
> > > > > > > for
> > > > > > > > hindi I want to use the basic white space analyzer as we dont
> > > have
> > > > > > > stemmers
> > > > > > > > for this as I mentioned earlier and whereever english appears
> I
> > > > want
> > > > > > > them
> > > > > > > > to
> > > > > > > > be stemmed tokenized etc[the standard process used for
> english
> > > > > > content].
> > > > > > > As
> > > > > > > > of now I'm using whitespace analyzer for the full content
> which
> > > > > doesnot
> > > > > > > > support case folding, stemming etc for teh content. So if
> there
> > > is
> > > > an
> > > > > > > > english word say "Detection" indexed as such then searching
> for
> > > > > > > detection
> > > > > > > > or
> > > > > > > > detect is not giving any results, which is the expected
> > behavior,
> > > > but
> > > > > I
> > > > > > > > want
> > > > > > > > this kind of queries to give results.
> > > > > > > > I hope I made it clear. Let me know any ideas on doing the
> > same.
> > > > And
> > > > > > one
> > > > > > > > more thing, I'm storing the full webpage content under a
> single
> > > > > field,
> > > > > > I
> > > > > > > > hope this will not make any difference, right?
> > > > > > > > It seems I've to use language identifiers, but do we really
> > need
> > > > > that?
> > > > > > > > Because we've only non-english content mixed with english[and
> > not
> > > > > > french
> > > > > > > or
> > > > > > > > russian etc].
> > > > > > > >
> > > > > > > > What is the best way of approaching the problem? Any
> thoughts!
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > KK.
> > > > > > > >
> > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> rcmuir@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > KK, is all of your latin script text actually english? Is
> > there
> > > > > stuff
> > > > > > > > like
> > > > > > > > > german or french mixed in?
> > > > > > > > >
> > > > > > > > > And for your non-english content (your examples have been
> > > indian
> > > > > > > writing
> > > > > > > > > systems), is it generally true that if you had devanagari,
> > you
> > > > can
> > > > > > > assume
> > > > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > > > >
> > > > > > > > > Reason I say this is to invoke the right stemmers, you
> really
> > > > need
> > > > > > > some
> > > > > > > > > language detection, but perhaps in your case you can cheat
> > and
> > > > > detect
> > > > > > > > this
> > > > > > > > > based on scripts...
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Robert
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > > dioxide.software@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > > I'm indexing some non-english content. But the page also
> > > > contains
> > > > > > > > english
> > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer for all
> > > content
> > > > > and
> > > > > > > I'm
> > > > > > > > > > storing the full webpage content under a single filed.
> Now
> > we
> > > > > > > require
> > > > > > > > to
> > > > > > > > > > support case folding and stemmming for the english
> content
> > > > > > > intermingled
> > > > > > > > > > with
> > > > > > > > > > non-english content. I must metion that we dont have
> > stemming
> > > > and
> > > > > > > case
> > > > > > > > > > folding for these non-english content. I'm stuck with
> this.
> > > > Some
> > > > > > one
> > > > > > > do
> > > > > > > > > let
> > > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > KK.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Robert Muir
> > > > > > > > > rcmuir@gmail.com
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Robert Muir
> > > > > > > rcmuir@gmail.com
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Hello Robert,
I was thinking of kind of chaining analyzers, does this sound logical?
Currently I'm using the whitespace analyzer which tokenizes on whitespaces
only. As you mentioned earlier I dont need to use language identifiers,
which means I've to pass the full content through say first via
whitespaceanalyzer, then via analyzer for lowercasing, then analyzer for
stemming[do we have such analyzers or are they filters?I think they are
filters, then I've to go for custom analyzer]. As you said whitespace one
will act on the full content, lowercasing will aplly only on the english
part leaving my indic content untouched, right? and finally the stemming
will also apply to english words only and leaving the rest as such, right?
If passing the content through multiple analyzer is allowed then I thing I
can do else I have to have a custom analyzer that does exactly the same
thing. What you say?

Thanks,
KK.



On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rc...@gmail.com> wrote:

> KK well you can always get some good examples from the lucene contrib
> codebase.
> For example, look at the DutchAnalyzer, especially:
>
> TokenStream tokenStream(String fieldName, Reader reader)
>
> See how it combines a specified tokenizer with various filters? this is
> what
> you want to do, except of course you want to use different tokenizer and
> filters.
>
> On Thu, Jun 4, 2009 at 8:53 AM, KK <di...@gmail.com> wrote:
>
> > Thanks Muir.
> > Thanks for letting me know that I dont need language identifiers.
> >  I'll have a look and will try to write the analyzer. For my case I think
> > it
> > wont be that difficult.
> > BTW, can you point me to some sample codes/tutorials writing custom
> > analyzers. I could not find something in LIA2ndEdn. Is something htere?
> do
> > let me know.
> >
> > Thanks,
> > KK.
> >
> >
> >
> > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> > > KK, for your case, you don't really need to go to the effort of
> detecting
> > > whether fragments are english or not.
> > > Because the English stemmers in lucene will not modify your Indic text,
> > and
> > > neither will the LowerCaseFilter.
> > >
> > > what you want to do is create a custom analyzer that works like this
> > >
> > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly jar],
> > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > >
> > > Thanks,
> > > Robert
> > >
> > > On Thu, Jun 4, 2009 at 8:28 AM, KK <di...@gmail.com> wrote:
> > >
> > > > Thank you all.
> > > > To be frank I was using Solr in the begining half a month ago. The
> > > > problem[rather bug] with solr was creation of new index on the fly.
> > > Though
> > > > they have a restful method for teh same, but it was not working. If I
> > > > remember properly one of Solr commiter "Noble Paul"[I dont know his
> > real
> > > > name] was trying to help me. I tried many nightly builds and spending
> a
> > > > couple of days stuck at that made me think of lucene and I switched
> to
> > > it.
> > > > Now after working with lucene which gives you full control of
> > everything
> > > I
> > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to
> > > > Window$:Linux, its my view only, though]. Coming back to the point as
> > Uwe
> > > > mentioned that we can do the same thing in lucene as well, what is
> > > > available
> > > > in Solr, Solr is based on Lucene only, right?
> > > > I request Uwe to give me some more ideas on using the analyzers from
> > solr
> > > > that will do the job for me, handling a mix of both english and
> > > non-english
> > > > content.
> > > > Muir, can you give me a bit detail description of how to use the
> > > > WordDelimiteFilter to do my job.
> > > > On a side note, I was thingking of writing a simple analyzer that
> will
> > do
> > > > the following,
> > > > #. If the webpage fragment is non-english[for me its some indian
> > > language]
> > > > then index them as such, no stemming/ stop word removal to begin
> with.
> > As
> > > I
> > > > know its in UCN unicode something like \u0021\u0012\u34ae\u0031[just
> a
> > > > sample]
> > > > # If the fragment is english then apply standard anlyzing process for
> > > > english content. I've not thought of quering in the same way as of
> now
> > > i.e
> > > > mix of non-english and engish words.
> > > > Now to get all this,
> > > >  #1. I need some sort of way which will let me know if the content is
> > > > english or not. If not english just add the tokens to the document.
> Do
> > we
> > > > really need language identifiers, as i dont have any other content
> that
> > > > uses
> > > > the same script as english other than those \u1234 things for my
> indian
> > > > language content. Any smart hack/trick for the same?
> > > >  #2. If the its english apply all normal process and add the stemmed
> > > token
> > > > to document.
> > > > For all this I was thinking of iterating earch word of the web page
> and
> > > > apply the above procedure. And finallyadd  the newly created document
> > to
> > > > the
> > > > index.
> > > >
> > > > I would like some one to guide me in this direction. I'm pretty
> people
> > > must
> > > > have done similar/same thing earlier, I request them to guide me/
> point
> > > me
> > > > to some tutorials for the same.
> > > > Else help me out writing a custom analyzer only if thats not going to
> > be
> > > > too
> > > > complex. LOL, I'm a new user to lucene and know basics of Java
> coding.
> > > > Thank you very much.
> > > >
> > > > --KK.
> > > >
> > > >
> > > >
> > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com>
> wrote:
> > > >
> > > > > yes this is true. for starters KK, might be good to startup solr
> and
> > > look
> > > > > at
> > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > >
> > > > > if you want to stick with lucene, the WordDelimiterFilter is the
> > piece
> > > > you
> > > > > will want for your text, mainly for punctuation but also for format
> > > > > characters such as ZWJ/ZWNJ.
> > > > >
> > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uw...@thetaphi.de>
> > wrote:
> > > > >
> > > > > > You can also re-use the solr analyzers, as far as I found out.
> > There
> > > is
> > > > > an
> > > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > > >
> > > > > > -----
> > > > > > Uwe Schindler
> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > http://www.thetaphi.de
> > > > > > eMail: uwe@thetaphi.de
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > > To: java-user@lucene.apache.org
> > > > > > > Subject: Re: How to support stemming and case folding for
> english
> > > > > content
> > > > > > > mixed with non-english content?
> > > > > > >
> > > > > > > KK, ok, so you only really want to stem the english. This is
> > good.
> > > > > > >
> > > > > > > Is it possible for you to consider using solr? solr's default
> > > > analyzer
> > > > > > for
> > > > > > > type 'text' will be good for your case. it will do the
> following
> > > > > > > 1. tokenize on whitespace
> > > > > > > 2. handle both indian language and english punctuation
> > > > > > > 3. lowercase the english.
> > > > > > > 4. stem the english.
> > > > > > >
> > > > > > > try a nightly build,
> > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > >
> > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <dioxide.software@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Muir, thanks for your response.
> > > > > > > > I'm indexing indian language web pages which has got descent
> > > amount
> > > > > of
> > > > > > > > english content mixed with therein. For the time being I'm
> not
> > > > going
> > > > > to
> > > > > > > use
> > > > > > > > any stemmers as we don't have standard stemmers for indian
> > > > languages
> > > > > .
> > > > > > > So
> > > > > > > > what I want to do is like this,
> > > > > > > > Say I've a web page having hindi content with 5% english
> > content.
> > > > > Then
> > > > > > > for
> > > > > > > > hindi I want to use the basic white space analyzer as we dont
> > > have
> > > > > > > stemmers
> > > > > > > > for this as I mentioned earlier and whereever english appears
> I
> > > > want
> > > > > > > them
> > > > > > > > to
> > > > > > > > be stemmed tokenized etc[the standard process used for
> english
> > > > > > content].
> > > > > > > As
> > > > > > > > of now I'm using whitespace analyzer for the full content
> which
> > > > > doesnot
> > > > > > > > support case folding, stemming etc for teh content. So if
> there
> > > is
> > > > an
> > > > > > > > english word say "Detection" indexed as such then searching
> for
> > > > > > > detection
> > > > > > > > or
> > > > > > > > detect is not giving any results, which is the expected
> > behavior,
> > > > but
> > > > > I
> > > > > > > > want
> > > > > > > > this kind of queries to give results.
> > > > > > > > I hope I made it clear. Let me know any ideas on doing the
> > same.
> > > > And
> > > > > > one
> > > > > > > > more thing, I'm storing the full webpage content under a
> single
> > > > > field,
> > > > > > I
> > > > > > > > hope this will not make any difference, right?
> > > > > > > > It seems I've to use language identifiers, but do we really
> > need
> > > > > that?
> > > > > > > > Because we've only non-english content mixed with english[and
> > not
> > > > > > french
> > > > > > > or
> > > > > > > > russian etc].
> > > > > > > >
> > > > > > > > What is the best way of approaching the problem? Any
> thoughts!
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > KK.
> > > > > > > >
> > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> rcmuir@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > KK, is all of your latin script text actually english? Is
> > there
> > > > > stuff
> > > > > > > > like
> > > > > > > > > german or french mixed in?
> > > > > > > > >
> > > > > > > > > And for your non-english content (your examples have been
> > > indian
> > > > > > > writing
> > > > > > > > > systems), is it generally true that if you had devanagari,
> > you
> > > > can
> > > > > > > assume
> > > > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > > > >
> > > > > > > > > Reason I say this is to invoke the right stemmers, you
> really
> > > > need
> > > > > > > some
> > > > > > > > > language detection, but perhaps in your case you can cheat
> > and
> > > > > detect
> > > > > > > > this
> > > > > > > > > based on scripts...
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Robert
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > > dioxide.software@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > > I'm indexing some non-english content. But the page also
> > > > contains
> > > > > > > > english
> > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer for all
> > > content
> > > > > and
> > > > > > > I'm
> > > > > > > > > > storing the full webpage content under a single filed.
> Now
> > we
> > > > > > > require
> > > > > > > > to
> > > > > > > > > > support case folding and stemmming for the english
> content
> > > > > > > intermingled
> > > > > > > > > > with
> > > > > > > > > > non-english content. I must metion that we dont have
> > stemming
> > > > and
> > > > > > > case
> > > > > > > > > > folding for these non-english content. I'm stuck with
> this.
> > > > Some
> > > > > > one
> > > > > > > do
> > > > > > > > > let
> > > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > KK.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Robert Muir
> > > > > > > > > rcmuir@gmail.com
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Robert Muir
> > > > > > > rcmuir@gmail.com
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

KK well you can always get some good examples from the lucene contrib
codebase.
For example, look at the DutchAnalyzer, especially:

TokenStream tokenStream(String fieldName, Reader reader)

See how it combines a specified tokenizer with various filters? this is what
you want to do, except of course you want to use different tokenizer and
filters.

On Thu, Jun 4, 2009 at 8:53 AM, KK <di...@gmail.com> wrote:

> Thanks Muir.
> Thanks for letting me know that I dont need language identifiers.
>  I'll have a look and will try to write the analyzer. For my case I think
> it
> wont be that difficult.
> BTW, can you point me to some sample codes/tutorials writing custom
> analyzers. I could not find something in LIA2ndEdn. Is something htere? do
> let me know.
>
> Thanks,
> KK.
>
>
>
> On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rc...@gmail.com> wrote:
>
> > KK, for your case, you don't really need to go to the effort of detecting
> > whether fragments are english or not.
> > Because the English stemmers in lucene will not modify your Indic text,
> and
> > neither will the LowerCaseFilter.
> >
> > what you want to do is create a custom analyzer that works like this
> >
> > -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly jar],
> > LowerCaseFilter, StopFilter, and PorterStemFilter-
> >
> > Thanks,
> > Robert
> >
> > On Thu, Jun 4, 2009 at 8:28 AM, KK <di...@gmail.com> wrote:
> >
> > > Thank you all.
> > > To be frank I was using Solr in the begining half a month ago. The
> > > problem[rather bug] with solr was creation of new index on the fly.
> > Though
> > > they have a restful method for teh same, but it was not working. If I
> > > remember properly one of Solr commiter "Noble Paul"[I dont know his
> real
> > > name] was trying to help me. I tried many nightly builds and spending a
> > > couple of days stuck at that made me think of lucene and I switched to
> > it.
> > > Now after working with lucene which gives you full control of
> everything
> > I
> > > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to
> > > Window$:Linux, its my view only, though]. Coming back to the point as
> Uwe
> > > mentioned that we can do the same thing in lucene as well, what is
> > > available
> > > in Solr, Solr is based on Lucene only, right?
> > > I request Uwe to give me some more ideas on using the analyzers from
> solr
> > > that will do the job for me, handling a mix of both english and
> > non-english
> > > content.
> > > Muir, can you give me a bit detail description of how to use the
> > > WordDelimiteFilter to do my job.
> > > On a side note, I was thingking of writing a simple analyzer that will
> do
> > > the following,
> > > #. If the webpage fragment is non-english[for me its some indian
> > language]
> > > then index them as such, no stemming/ stop word removal to begin with.
> As
> > I
> > > know its in UCN unicode something like \u0021\u0012\u34ae\u0031[just a
> > > sample]
> > > # If the fragment is english then apply standard anlyzing process for
> > > english content. I've not thought of quering in the same way as of now
> > i.e
> > > mix of non-english and engish words.
> > > Now to get all this,
> > >  #1. I need some sort of way which will let me know if the content is
> > > english or not. If not english just add the tokens to the document. Do
> we
> > > really need language identifiers, as i dont have any other content that
> > > uses
> > > the same script as english other than those \u1234 things for my indian
> > > language content. Any smart hack/trick for the same?
> > >  #2. If the its english apply all normal process and add the stemmed
> > token
> > > to document.
> > > For all this I was thinking of iterating earch word of the web page and
> > > apply the above procedure. And finallyadd  the newly created document
> to
> > > the
> > > index.
> > >
> > > I would like some one to guide me in this direction. I'm pretty people
> > must
> > > have done similar/same thing earlier, I request them to guide me/ point
> > me
> > > to some tutorials for the same.
> > > Else help me out writing a custom analyzer only if thats not going to
> be
> > > too
> > > complex. LOL, I'm a new user to lucene and know basics of Java coding.
> > > Thank you very much.
> > >
> > > --KK.
> > >
> > >
> > >
> > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com> wrote:
> > >
> > > > yes this is true. for starters KK, might be good to startup solr and
> > look
> > > > at
> > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > >
> > > > if you want to stick with lucene, the WordDelimiterFilter is the
> piece
> > > you
> > > > will want for your text, mainly for punctuation but also for format
> > > > characters such as ZWJ/ZWNJ.
> > > >
> > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uw...@thetaphi.de>
> wrote:
> > > >
> > > > > You can also re-use the solr analyzers, as far as I found out.
> There
> > is
> > > > an
> > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > >
> > > > > -----
> > > > > Uwe Schindler
> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > http://www.thetaphi.de
> > > > > eMail: uwe@thetaphi.de
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > To: java-user@lucene.apache.org
> > > > > > Subject: Re: How to support stemming and case folding for english
> > > > content
> > > > > > mixed with non-english content?
> > > > > >
> > > > > > KK, ok, so you only really want to stem the english. This is
> good.
> > > > > >
> > > > > > Is it possible for you to consider using solr? solr's default
> > > analyzer
> > > > > for
> > > > > > type 'text' will be good for your case. it will do the following
> > > > > > 1. tokenize on whitespace
> > > > > > 2. handle both indian language and english punctuation
> > > > > > 3. lowercase the english.
> > > > > > 4. stem the english.
> > > > > >
> > > > > > try a nightly build,
> > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > >
> > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <di...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Muir, thanks for your response.
> > > > > > > I'm indexing indian language web pages which has got descent
> > amount
> > > > of
> > > > > > > english content mixed with therein. For the time being I'm not
> > > going
> > > > to
> > > > > > use
> > > > > > > any stemmers as we don't have standard stemmers for indian
> > > languages
> > > > .
> > > > > > So
> > > > > > > what I want to do is like this,
> > > > > > > Say I've a web page having hindi content with 5% english
> content.
> > > > Then
> > > > > > for
> > > > > > > hindi I want to use the basic white space analyzer as we dont
> > have
> > > > > > stemmers
> > > > > > > for this as I mentioned earlier and whereever english appears I
> > > want
> > > > > > them
> > > > > > > to
> > > > > > > be stemmed tokenized etc[the standard process used for english
> > > > > content].
> > > > > > As
> > > > > > > of now I'm using whitespace analyzer for the full content which
> > > > doesnot
> > > > > > > support case folding, stemming etc for teh content. So if there
> > is
> > > an
> > > > > > > english word say "Detection" indexed as such then searching for
> > > > > > detection
> > > > > > > or
> > > > > > > detect is not giving any results, which is the expected
> behavior,
> > > but
> > > > I
> > > > > > > want
> > > > > > > this kind of queries to give results.
> > > > > > > I hope I made it clear. Let me know any ideas on doing the
> same.
> > > And
> > > > > one
> > > > > > > more thing, I'm storing the full webpage content under a single
> > > > field,
> > > > > I
> > > > > > > hope this will not make any difference, right?
> > > > > > > It seems I've to use language identifiers, but do we really
> need
> > > > that?
> > > > > > > Because we've only non-english content mixed with english[and
> not
> > > > > french
> > > > > > or
> > > > > > > russian etc].
> > > > > > >
> > > > > > > What is the best way of approaching the problem? Any thoughts!
> > > > > > >
> > > > > > > Thanks,
> > > > > > > KK.
> > > > > > >
> > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rc...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > KK, is all of your latin script text actually english? Is
> there
> > > > stuff
> > > > > > > like
> > > > > > > > german or french mixed in?
> > > > > > > >
> > > > > > > > And for your non-english content (your examples have been
> > indian
> > > > > > writing
> > > > > > > > systems), is it generally true that if you had devanagari,
> you
> > > can
> > > > > > assume
> > > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > > >
> > > > > > > > Reason I say this is to invoke the right stemmers, you really
> > > need
> > > > > > some
> > > > > > > > language detection, but perhaps in your case you can cheat
> and
> > > > detect
> > > > > > > this
> > > > > > > > based on scripts...
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Robert
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > dioxide.software@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > > I'm indexing some non-english content. But the page also
> > > contains
> > > > > > > english
> > > > > > > > > content. As of now I'm using WhitespaceAnalyzer for all
> > content
> > > > and
> > > > > > I'm
> > > > > > > > > storing the full webpage content under a single filed. Now
> we
> > > > > > require
> > > > > > > to
> > > > > > > > > support case folding and stemmming for the english content
> > > > > > intermingled
> > > > > > > > > with
> > > > > > > > > non-english content. I must metion that we dont have
> stemming
> > > and
> > > > > > case
> > > > > > > > > folding for these non-english content. I'm stuck with this.
> > > Some
> > > > > one
> > > > > > do
> > > > > > > > let
> > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > KK.
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Robert Muir
> > > > > > > > rcmuir@gmail.com
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcmuir@gmail.com
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Thanks Muir.
Thanks for letting me know that I dont need language identifiers.
 I'll have a look and will try to write the analyzer. For my case I think it
wont be that difficult.
BTW, can you point me to some sample codes/tutorials writing custom
analyzers. I could not find something in LIA2ndEdn. Is something htere? do
let me know.

Thanks,
KK.



On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rc...@gmail.com> wrote:

> KK, for your case, you don't really need to go to the effort of detecting
> whether fragments are english or not.
> Because the English stemmers in lucene will not modify your Indic text, and
> neither will the LowerCaseFilter.
>
> what you want to do is create a custom analyzer that works like this
>
> -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly jar],
> LowerCaseFilter, StopFilter, and PorterStemFilter-
>
> Thanks,
> Robert
>
> On Thu, Jun 4, 2009 at 8:28 AM, KK <di...@gmail.com> wrote:
>
> > Thank you all.
> > To be frank I was using Solr in the begining half a month ago. The
> > problem[rather bug] with solr was creation of new index on the fly.
> Though
> > they have a restful method for teh same, but it was not working. If I
> > remember properly one of Solr commiter "Noble Paul"[I dont know his real
> > name] was trying to help me. I tried many nightly builds and spending a
> > couple of days stuck at that made me think of lucene and I switched to
> it.
> > Now after working with lucene which gives you full control of everything
> I
> > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to
> > Window$:Linux, its my view only, though]. Coming back to the point as Uwe
> > mentioned that we can do the same thing in lucene as well, what is
> > available
> > in Solr, Solr is based on Lucene only, right?
> > I request Uwe to give me some more ideas on using the analyzers from solr
> > that will do the job for me, handling a mix of both english and
> non-english
> > content.
> > Muir, can you give me a bit detail description of how to use the
> > WordDelimiteFilter to do my job.
> > On a side note, I was thingking of writing a simple analyzer that will do
> > the following,
> > #. If the webpage fragment is non-english[for me its some indian
> language]
> > then index them as such, no stemming/ stop word removal to begin with. As
> I
> > know its in UCN unicode something like \u0021\u0012\u34ae\u0031[just a
> > sample]
> > # If the fragment is english then apply standard anlyzing process for
> > english content. I've not thought of quering in the same way as of now
> i.e
> > mix of non-english and engish words.
> > Now to get all this,
> >  #1. I need some sort of way which will let me know if the content is
> > english or not. If not english just add the tokens to the document. Do we
> > really need language identifiers, as i dont have any other content that
> > uses
> > the same script as english other than those \u1234 things for my indian
> > language content. Any smart hack/trick for the same?
> >  #2. If the its english apply all normal process and add the stemmed
> token
> > to document.
> > For all this I was thinking of iterating earch word of the web page and
> > apply the above procedure. And finallyadd  the newly created document to
> > the
> > index.
> >
> > I would like some one to guide me in this direction. I'm pretty people
> must
> > have done similar/same thing earlier, I request them to guide me/ point
> me
> > to some tutorials for the same.
> > Else help me out writing a custom analyzer only if thats not going to be
> > too
> > complex. LOL, I'm a new user to lucene and know basics of Java coding.
> > Thank you very much.
> >
> > --KK.
> >
> >
> >
> > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> > > yes this is true. for starters KK, might be good to startup solr and
> look
> > > at
> > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > >
> > > if you want to stick with lucene, the WordDelimiterFilter is the piece
> > you
> > > will want for your text, mainly for punctuation but also for format
> > > characters such as ZWJ/ZWNJ.
> > >
> > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> > >
> > > > You can also re-use the solr analyzers, as far as I found out. There
> is
> > > an
> > > > issue in jIRA/discussion on java-dev to merge them.
> > > >
> > > > -----
> > > > Uwe Schindler
> > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > http://www.thetaphi.de
> > > > eMail: uwe@thetaphi.de
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > To: java-user@lucene.apache.org
> > > > > Subject: Re: How to support stemming and case folding for english
> > > content
> > > > > mixed with non-english content?
> > > > >
> > > > > KK, ok, so you only really want to stem the english. This is good.
> > > > >
> > > > > Is it possible for you to consider using solr? solr's default
> > analyzer
> > > > for
> > > > > type 'text' will be good for your case. it will do the following
> > > > > 1. tokenize on whitespace
> > > > > 2. handle both indian language and english punctuation
> > > > > 3. lowercase the english.
> > > > > 4. stem the english.
> > > > >
> > > > > try a nightly build,
> > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > >
> > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <di...@gmail.com>
> > wrote:
> > > > >
> > > > > > Muir, thanks for your response.
> > > > > > I'm indexing indian language web pages which has got descent
> amount
> > > of
> > > > > > english content mixed with therein. For the time being I'm not
> > going
> > > to
> > > > > use
> > > > > > any stemmers as we don't have standard stemmers for indian
> > languages
> > > .
> > > > > So
> > > > > > what I want to do is like this,
> > > > > > Say I've a web page having hindi content with 5% english content.
> > > Then
> > > > > for
> > > > > > hindi I want to use the basic white space analyzer as we dont
> have
> > > > > stemmers
> > > > > > for this as I mentioned earlier and whereever english appears I
> > want
> > > > > them
> > > > > > to
> > > > > > be stemmed tokenized etc[the standard process used for english
> > > > content].
> > > > > As
> > > > > > of now I'm using whitespace analyzer for the full content which
> > > doesnot
> > > > > > support case folding, stemming etc for teh content. So if there
> is
> > an
> > > > > > english word say "Detection" indexed as such then searching for
> > > > > detection
> > > > > > or
> > > > > > detect is not giving any results, which is the expected behavior,
> > but
> > > I
> > > > > > want
> > > > > > this kind of queries to give results.
> > > > > > I hope I made it clear. Let me know any ideas on doing the same.
> > And
> > > > one
> > > > > > more thing, I'm storing the full webpage content under a single
> > > field,
> > > > I
> > > > > > hope this will not make any difference, right?
> > > > > > It seems I've to use language identifiers, but do we really need
> > > that?
> > > > > > Because we've only non-english content mixed with english[and not
> > > > french
> > > > > or
> > > > > > russian etc].
> > > > > >
> > > > > > What is the best way of approaching the problem? Any thoughts!
> > > > > >
> > > > > > Thanks,
> > > > > > KK.
> > > > > >
> > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rc...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > KK, is all of your latin script text actually english? Is there
> > > stuff
> > > > > > like
> > > > > > > german or french mixed in?
> > > > > > >
> > > > > > > And for your non-english content (your examples have been
> indian
> > > > > writing
> > > > > > > systems), is it generally true that if you had devanagari, you
> > can
> > > > > assume
> > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > >
> > > > > > > Reason I say this is to invoke the right stemmers, you really
> > need
> > > > > some
> > > > > > > language detection, but perhaps in your case you can cheat and
> > > detect
> > > > > > this
> > > > > > > based on scripts...
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Robert
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> dioxide.software@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi All,
> > > > > > > > I'm indexing some non-english content. But the page also
> > contains
> > > > > > english
> > > > > > > > content. As of now I'm using WhitespaceAnalyzer for all
> content
> > > and
> > > > > I'm
> > > > > > > > storing the full webpage content under a single filed. Now we
> > > > > require
> > > > > > to
> > > > > > > > support case folding and stemmming for the english content
> > > > > intermingled
> > > > > > > > with
> > > > > > > > non-english content. I must metion that we dont have stemming
> > and
> > > > > case
> > > > > > > > folding for these non-english content. I'm stuck with this.
> > Some
> > > > one
> > > > > do
> > > > > > > let
> > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > KK.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Robert Muir
> > > > > > > rcmuir@gmail.com
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

KK, for your case, you don't really need to go to the effort of detecting
whether fragments are english or not.
Because the English stemmers in lucene will not modify your Indic text, and
neither will the LowerCaseFilter.

what you want to do is create a custom analyzer that works like this

-WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly jar],
LowerCaseFilter, StopFilter, and PorterStemFilter-

Thanks,
Robert

On Thu, Jun 4, 2009 at 8:28 AM, KK <di...@gmail.com> wrote:

> Thank you all.
> To be frank I was using Solr in the begining half a month ago. The
> problem[rather bug] with solr was creation of new index on the fly. Though
> they have a restful method for teh same, but it was not working. If I
> remember properly one of Solr commiter "Noble Paul"[I dont know his real
> name] was trying to help me. I tried many nightly builds and spending a
> couple of days stuck at that made me think of lucene and I switched to it.
> Now after working with lucene which gives you full control of everything I
> don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to
> Window$:Linux, its my view only, though]. Coming back to the point as Uwe
> mentioned that we can do the same thing in lucene as well, what is
> available
> in Solr, Solr is based on Lucene only, right?
> I request Uwe to give me some more ideas on using the analyzers from solr
> that will do the job for me, handling a mix of both english and non-english
> content.
> Muir, can you give me a bit detail description of how to use the
> WordDelimiteFilter to do my job.
> On a side note, I was thingking of writing a simple analyzer that will do
> the following,
> #. If the webpage fragment is non-english[for me its some indian language]
> then index them as such, no stemming/ stop word removal to begin with. As I
> know its in UCN unicode something like \u0021\u0012\u34ae\u0031[just a
> sample]
> # If the fragment is english then apply standard anlyzing process for
> english content. I've not thought of quering in the same way as of now i.e
> mix of non-english and engish words.
> Now to get all this,
>  #1. I need some sort of way which will let me know if the content is
> english or not. If not english just add the tokens to the document. Do we
> really need language identifiers, as i dont have any other content that
> uses
> the same script as english other than those \u1234 things for my indian
> language content. Any smart hack/trick for the same?
>  #2. If the its english apply all normal process and add the stemmed token
> to document.
> For all this I was thinking of iterating earch word of the web page and
> apply the above procedure. And finallyadd  the newly created document to
> the
> index.
>
> I would like some one to guide me in this direction. I'm pretty people must
> have done similar/same thing earlier, I request them to guide me/ point me
> to some tutorials for the same.
> Else help me out writing a custom analyzer only if thats not going to be
> too
> complex. LOL, I'm a new user to lucene and know basics of Java coding.
> Thank you very much.
>
> --KK.
>
>
>
> On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com> wrote:
>
> > yes this is true. for starters KK, might be good to startup solr and look
> > at
> > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> >
> > if you want to stick with lucene, the WordDelimiterFilter is the piece
> you
> > will want for your text, mainly for punctuation but also for format
> > characters such as ZWJ/ZWNJ.
> >
> > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
> >
> > > You can also re-use the solr analyzers, as far as I found out. There is
> > an
> > > issue in jIRA/discussion on java-dev to merge them.
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe@thetaphi.de
> > >
> > >
> > > > -----Original Message-----
> > > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Re: How to support stemming and case folding for english
> > content
> > > > mixed with non-english content?
> > > >
> > > > KK, ok, so you only really want to stem the english. This is good.
> > > >
> > > > Is it possible for you to consider using solr? solr's default
> analyzer
> > > for
> > > > type 'text' will be good for your case. it will do the following
> > > > 1. tokenize on whitespace
> > > > 2. handle both indian language and english punctuation
> > > > 3. lowercase the english.
> > > > 4. stem the english.
> > > >
> > > > try a nightly build,
> > > http://people.apache.org/builds/lucene/solr/nightly/
> > > >
> > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <di...@gmail.com>
> wrote:
> > > >
> > > > > Muir, thanks for your response.
> > > > > I'm indexing indian language web pages which has got descent amount
> > of
> > > > > english content mixed with therein. For the time being I'm not
> going
> > to
> > > > use
> > > > > any stemmers as we don't have standard stemmers for indian
> languages
> > .
> > > > So
> > > > > what I want to do is like this,
> > > > > Say I've a web page having hindi content with 5% english content.
> > Then
> > > > for
> > > > > hindi I want to use the basic white space analyzer as we dont have
> > > > stemmers
> > > > > for this as I mentioned earlier and whereever english appears I
> want
> > > > them
> > > > > to
> > > > > be stemmed tokenized etc[the standard process used for english
> > > content].
> > > > As
> > > > > of now I'm using whitespace analyzer for the full content which
> > doesnot
> > > > > support case folding, stemming etc for teh content. So if there is
> an
> > > > > english word say "Detection" indexed as such then searching for
> > > > detection
> > > > > or
> > > > > detect is not giving any results, which is the expected behavior,
> but
> > I
> > > > > want
> > > > > this kind of queries to give results.
> > > > > I hope I made it clear. Let me know any ideas on doing the same.
> And
> > > one
> > > > > more thing, I'm storing the full webpage content under a single
> > field,
> > > I
> > > > > hope this will not make any difference, right?
> > > > > It seems I've to use language identifiers, but do we really need
> > that?
> > > > > Because we've only non-english content mixed with english[and not
> > > french
> > > > or
> > > > > russian etc].
> > > > >
> > > > > What is the best way of approaching the problem? Any thoughts!
> > > > >
> > > > > Thanks,
> > > > > KK.
> > > > >
> > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rc...@gmail.com>
> > wrote:
> > > > >
> > > > > > KK, is all of your latin script text actually english? Is there
> > stuff
> > > > > like
> > > > > > german or french mixed in?
> > > > > >
> > > > > > And for your non-english content (your examples have been indian
> > > > writing
> > > > > > systems), is it generally true that if you had devanagari, you
> can
> > > > assume
> > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > >
> > > > > > Reason I say this is to invoke the right stemmers, you really
> need
> > > > some
> > > > > > language detection, but perhaps in your case you can cheat and
> > detect
> > > > > this
> > > > > > based on scripts...
> > > > > >
> > > > > > Thanks,
> > > > > > Robert
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <di...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > > I'm indexing some non-english content. But the page also
> contains
> > > > > english
> > > > > > > content. As of now I'm using WhitespaceAnalyzer for all content
> > and
> > > > I'm
> > > > > > > storing the full webpage content under a single filed. Now we
> > > > require
> > > > > to
> > > > > > > support case folding and stemmming for the english content
> > > > intermingled
> > > > > > > with
> > > > > > > non-english content. I must metion that we dont have stemming
> and
> > > > case
> > > > > > > folding for these non-english content. I'm stuck with this.
> Some
> > > one
> > > > do
> > > > > > let
> > > > > > > me know how to proceed for fixing this issue.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > KK.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcmuir@gmail.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

uwe what KK needs here is 'proper unicode handling'.

since the latest WordDelimiterFilter has pretty good handling of unicode
categories, combining this with WhiteSpaceTokenizer effectively gives you a
pretty good solution for unicode tokenization.

KK doesn't need detection of anything, the porter stem filter will simply
leave the indic text alone... so it will just work.

On Thu, Jun 4, 2009 at 8:40 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

> > I request Uwe to give me some more ideas on using the analyzers from solr
> > that will do the job for me, handling a mix of both english and non-
> > english content.
>
> Look here:
>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h
> tml<http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h%0Atml>
>
> As you see, the Solr analyzers are just standard Lucene analyzers. So you
> can drop the solr core jar into your project and just use them :-)
>
> Currently I am not sure which one is the analyzer Robert means, that can do
> english stemming and detecting non-english parts, but there is to look for
> it.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
Robert Muir
rcmuir@gmail.com

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Uwe, thanks for your lightening fast reponse :-).
 I'm looking into that and let me see how far I can go...Also I request Muir
to point me to the exact analyzer he mentiioned in thr previous mail.

Thanks,
KK

On Thu, Jun 4, 2009 at 6:10 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> > I request Uwe to give me some more ideas on using the analyzers from solr
> > that will do the job for me, handling a mix of both english and non-
> > english content.
>
> Look here:
>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h
> tml<http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h%0Atml>
>
> As you see, the Solr analyzers are just standard Lucene analyzers. So you
> can drop the solr core jar into your project and just use them :-)
>
> Currently I am not sure which one is the analyzer Robert means, that can do
> english stemming and detecting non-english parts, but there is to look for
> it.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: How to support stemming and case folding for english content mixed with non-english content?

Posted by Uwe Schindler <uw...@thetaphi.de>.

> I request Uwe to give me some more ideas on using the analyzers from solr
> that will do the job for me, handling a mix of both english and non-
> english content.

Look here:
http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h
tml

As you see, the Solr analyzers are just standard Lucene analyzers. So you
can drop the solr core jar into your project and just use them :-)

Currently I am not sure which one is the analyzer Robert means, that can do
english stemming and detecting non-english parts, but there is to look for
it.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Thank you all.
To be frank I was using Solr in the begining half a month ago. The
problem[rather bug] with solr was creation of new index on the fly. Though
they have a restful method for teh same, but it was not working. If I
remember properly one of Solr commiter "Noble Paul"[I dont know his real
name] was trying to help me. I tried many nightly builds and spending a
couple of days stuck at that made me think of lucene and I switched to it.
Now after working with lucene which gives you full control of everything I
don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to
Window$:Linux, its my view only, though]. Coming back to the point as Uwe
mentioned that we can do the same thing in lucene as well, what is available
in Solr, Solr is based on Lucene only, right?
I request Uwe to give me some more ideas on using the analyzers from solr
that will do the job for me, handling a mix of both english and non-english
content.
Muir, can you give me a bit detail description of how to use the
WordDelimiteFilter to do my job.
On a side note, I was thingking of writing a simple analyzer that will do
the following,
#. If the webpage fragment is non-english[for me its some indian language]
then index them as such, no stemming/ stop word removal to begin with. As I
know its in UCN unicode something like \u0021\u0012\u34ae\u0031[just a
sample]
# If the fragment is english then apply standard anlyzing process for
english content. I've not thought of quering in the same way as of now i.e
mix of non-english and engish words.
Now to get all this,
 #1. I need some sort of way which will let me know if the content is
english or not. If not english just add the tokens to the document. Do we
really need language identifiers, as i dont have any other content that uses
the same script as english other than those \u1234 things for my indian
language content. Any smart hack/trick for the same?
 #2. If the its english apply all normal process and add the stemmed token
to document.
For all this I was thinking of iterating earch word of the web page and
apply the above procedure. And finallyadd  the newly created document to the
index.

I would like some one to guide me in this direction. I'm pretty people must
have done similar/same thing earlier, I request them to guide me/ point me
to some tutorials for the same.
Else help me out writing a custom analyzer only if thats not going to be too
complex. LOL, I'm a new user to lucene and know basics of Java coding.
Thank you very much.

--KK.

On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rc...@gmail.com> wrote:

> yes this is true. for starters KK, might be good to startup solr and look
> at
> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
>
> if you want to stick with lucene, the WordDelimiterFilter is the piece you
> will want for your text, mainly for punctuation but also for format
> characters such as ZWJ/ZWNJ.
>
> On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>
> > You can also re-use the solr analyzers, as far as I found out. There is
> an
> > issue in jIRA/discussion on java-dev to merge them.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: Robert Muir [mailto:rcmuir@gmail.com]
> > > Sent: Thursday, June 04, 2009 1:18 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: How to support stemming and case folding for english
> content
> > > mixed with non-english content?
> > >
> > > KK, ok, so you only really want to stem the english. This is good.
> > >
> > > Is it possible for you to consider using solr? solr's default analyzer
> > for
> > > type 'text' will be good for your case. it will do the following
> > > 1. tokenize on whitespace
> > > 2. handle both indian language and english punctuation
> > > 3. lowercase the english.
> > > 4. stem the english.
> > >
> > > try a nightly build,
> > http://people.apache.org/builds/lucene/solr/nightly/
> > >
> > > On Thu, Jun 4, 2009 at 1:12 AM, KK <di...@gmail.com> wrote:
> > >
> > > > Muir, thanks for your response.
> > > > I'm indexing indian language web pages which has got descent amount
> of
> > > > english content mixed with therein. For the time being I'm not going
> to
> > > use
> > > > any stemmers as we don't have standard stemmers for indian languages
> .
> > > So
> > > > what I want to do is like this,
> > > > Say I've a web page having hindi content with 5% english content.
> Then
> > > for
> > > > hindi I want to use the basic white space analyzer as we dont have
> > > stemmers
> > > > for this as I mentioned earlier and whereever english appears I want
> > > them
> > > > to
> > > > be stemmed tokenized etc[the standard process used for english
> > content].
> > > As
> > > > of now I'm using whitespace analyzer for the full content which
> doesnot
> > > > support case folding, stemming etc for teh content. So if there is an
> > > > english word say "Detection" indexed as such then searching for
> > > detection
> > > > or
> > > > detect is not giving any results, which is the expected behavior, but
> I
> > > > want
> > > > this kind of queries to give results.
> > > > I hope I made it clear. Let me know any ideas on doing the same. And
> > one
> > > > more thing, I'm storing the full webpage content under a single
> field,
> > I
> > > > hope this will not make any difference, right?
> > > > It seems I've to use language identifiers, but do we really need
> that?
> > > > Because we've only non-english content mixed with english[and not
> > french
> > > or
> > > > russian etc].
> > > >
> > > > What is the best way of approaching the problem? Any thoughts!
> > > >
> > > > Thanks,
> > > > KK.
> > > >
> > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rc...@gmail.com>
> wrote:
> > > >
> > > > > KK, is all of your latin script text actually english? Is there
> stuff
> > > > like
> > > > > german or french mixed in?
> > > > >
> > > > > And for your non-english content (your examples have been indian
> > > writing
> > > > > systems), is it generally true that if you had devanagari, you can
> > > assume
> > > > > its hindi? or is there stuff like marathi mixed in?
> > > > >
> > > > > Reason I say this is to invoke the right stemmers, you really need
> > > some
> > > > > language detection, but perhaps in your case you can cheat and
> detect
> > > > this
> > > > > based on scripts...
> > > > >
> > > > > Thanks,
> > > > > Robert
> > > > >
> > > > >
> > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <di...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi All,
> > > > > > I'm indexing some non-english content. But the page also contains
> > > > english
> > > > > > content. As of now I'm using WhitespaceAnalyzer for all content
> and
> > > I'm
> > > > > > storing the full webpage content under a single filed. Now we
> > > require
> > > > to
> > > > > > support case folding and stemmming for the english content
> > > intermingled
> > > > > > with
> > > > > > non-english content. I must metion that we dont have stemming and
> > > case
> > > > > > folding for these non-english content. I'm stuck with this. Some
> > one
> > > do
> > > > > let
> > > > > > me know how to proceed for fixing this issue.
> > > > > >
> > > > > > Thanks,
> > > > > > KK.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

yes this is true. for starters KK, might be good to startup solr and look at
http://localhost:8983/solr/admin/analysis.jsp?highlight=on

if you want to stick with lucene, the WordDelimiterFilter is the piece you
will want for your text, mainly for punctuation but also for format
characters such as ZWJ/ZWNJ.

On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

> You can also re-use the solr analyzers, as far as I found out. There is an
> issue in jIRA/discussion on java-dev to merge them.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Robert Muir [mailto:rcmuir@gmail.com]
> > Sent: Thursday, June 04, 2009 1:18 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: How to support stemming and case folding for english content
> > mixed with non-english content?
> >
> > KK, ok, so you only really want to stem the english. This is good.
> >
> > Is it possible for you to consider using solr? solr's default analyzer
> for
> > type 'text' will be good for your case. it will do the following
> > 1. tokenize on whitespace
> > 2. handle both indian language and english punctuation
> > 3. lowercase the english.
> > 4. stem the english.
> >
> > try a nightly build,
> http://people.apache.org/builds/lucene/solr/nightly/
> >
> > On Thu, Jun 4, 2009 at 1:12 AM, KK <di...@gmail.com> wrote:
> >
> > > Muir, thanks for your response.
> > > I'm indexing indian language web pages which has got descent amount of
> > > english content mixed with therein. For the time being I'm not going to
> > use
> > > any stemmers as we don't have standard stemmers for indian languages .
> > So
> > > what I want to do is like this,
> > > Say I've a web page having hindi content with 5% english content. Then
> > for
> > > hindi I want to use the basic white space analyzer as we dont have
> > stemmers
> > > for this as I mentioned earlier and whereever english appears I want
> > them
> > > to
> > > be stemmed tokenized etc[the standard process used for english
> content].
> > As
> > > of now I'm using whitespace analyzer for the full content which doesnot
> > > support case folding, stemming etc for teh content. So if there is an
> > > english word say "Detection" indexed as such then searching for
> > detection
> > > or
> > > detect is not giving any results, which is the expected behavior, but I
> > > want
> > > this kind of queries to give results.
> > > I hope I made it clear. Let me know any ideas on doing the same. And
> one
> > > more thing, I'm storing the full webpage content under a single field,
> I
> > > hope this will not make any difference, right?
> > > It seems I've to use language identifiers, but do we really need that?
> > > Because we've only non-english content mixed with english[and not
> french
> > or
> > > russian etc].
> > >
> > > What is the best way of approaching the problem? Any thoughts!
> > >
> > > Thanks,
> > > KK.
> > >
> > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rc...@gmail.com> wrote:
> > >
> > > > KK, is all of your latin script text actually english? Is there stuff
> > > like
> > > > german or french mixed in?
> > > >
> > > > And for your non-english content (your examples have been indian
> > writing
> > > > systems), is it generally true that if you had devanagari, you can
> > assume
> > > > its hindi? or is there stuff like marathi mixed in?
> > > >
> > > > Reason I say this is to invoke the right stemmers, you really need
> > some
> > > > language detection, but perhaps in your case you can cheat and detect
> > > this
> > > > based on scripts...
> > > >
> > > > Thanks,
> > > > Robert
> > > >
> > > >
> > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <di...@gmail.com>
> > wrote:
> > > >
> > > > > Hi All,
> > > > > I'm indexing some non-english content. But the page also contains
> > > english
> > > > > content. As of now I'm using WhitespaceAnalyzer for all content and
> > I'm
> > > > > storing the full webpage content under a single filed. Now we
> > require
> > > to
> > > > > support case folding and stemmming for the english content
> > intermingled
> > > > > with
> > > > > non-english content. I must metion that we dont have stemming and
> > case
> > > > > folding for these non-english content. I'm stuck with this. Some
> one
> > do
> > > > let
> > > > > me know how to proceed for fixing this issue.
> > > > >
> > > > > Thanks,
> > > > > KK.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

RE: How to support stemming and case folding for english content mixed with non-english content?

Posted by Uwe Schindler <uw...@thetaphi.de>.

You can also re-use the solr analyzers, as far as I found out. There is an
issue in jIRA/discussion on java-dev to merge them.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Thursday, June 04, 2009 1:18 PM
> To: java-user@lucene.apache.org
> Subject: Re: How to support stemming and case folding for english content
> mixed with non-english content?
> 
> KK, ok, so you only really want to stem the english. This is good.
> 
> Is it possible for you to consider using solr? solr's default analyzer for
> type 'text' will be good for your case. it will do the following
> 1. tokenize on whitespace
> 2. handle both indian language and english punctuation
> 3. lowercase the english.
> 4. stem the english.
> 
> try a nightly build, http://people.apache.org/builds/lucene/solr/nightly/
> 
> On Thu, Jun 4, 2009 at 1:12 AM, KK <di...@gmail.com> wrote:
> 
> > Muir, thanks for your response.
> > I'm indexing indian language web pages which has got descent amount of
> > english content mixed with therein. For the time being I'm not going to
> use
> > any stemmers as we don't have standard stemmers for indian languages .
> So
> > what I want to do is like this,
> > Say I've a web page having hindi content with 5% english content. Then
> for
> > hindi I want to use the basic white space analyzer as we dont have
> stemmers
> > for this as I mentioned earlier and whereever english appears I want
> them
> > to
> > be stemmed tokenized etc[the standard process used for english content].
> As
> > of now I'm using whitespace analyzer for the full content which doesnot
> > support case folding, stemming etc for teh content. So if there is an
> > english word say "Detection" indexed as such then searching for
> detection
> > or
> > detect is not giving any results, which is the expected behavior, but I
> > want
> > this kind of queries to give results.
> > I hope I made it clear. Let me know any ideas on doing the same. And one
> > more thing, I'm storing the full webpage content under a single field, I
> > hope this will not make any difference, right?
> > It seems I've to use language identifiers, but do we really need that?
> > Because we've only non-english content mixed with english[and not french
> or
> > russian etc].
> >
> > What is the best way of approaching the problem? Any thoughts!
> >
> > Thanks,
> > KK.
> >
> > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> > > KK, is all of your latin script text actually english? Is there stuff
> > like
> > > german or french mixed in?
> > >
> > > And for your non-english content (your examples have been indian
> writing
> > > systems), is it generally true that if you had devanagari, you can
> assume
> > > its hindi? or is there stuff like marathi mixed in?
> > >
> > > Reason I say this is to invoke the right stemmers, you really need
> some
> > > language detection, but perhaps in your case you can cheat and detect
> > this
> > > based on scripts...
> > >
> > > Thanks,
> > > Robert
> > >
> > >
> > > On Wed, Jun 3, 2009 at 10:15 AM, KK <di...@gmail.com>
> wrote:
> > >
> > > > Hi All,
> > > > I'm indexing some non-english content. But the page also contains
> > english
> > > > content. As of now I'm using WhitespaceAnalyzer for all content and
> I'm
> > > > storing the full webpage content under a single filed. Now we
> require
> > to
> > > > support case folding and stemmming for the english content
> intermingled
> > > > with
> > > > non-english content. I must metion that we dont have stemming and
> case
> > > > folding for these non-english content. I'm stuck with this. Some one
> do
> > > let
> > > > me know how to proceed for fixing this issue.
> > > >
> > > > Thanks,
> > > > KK.
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> 
> 
> 
> --
> Robert Muir
> rcmuir@gmail.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by Robert Muir <rc...@gmail.com>.

KK, ok, so you only really want to stem the english. This is good.

Is it possible for you to consider using solr? solr's default analyzer for
type 'text' will be good for your case. it will do the following
1. tokenize on whitespace
2. handle both indian language and english punctuation
3. lowercase the english.
4. stem the english.

try a nightly build, http://people.apache.org/builds/lucene/solr/nightly/

On Thu, Jun 4, 2009 at 1:12 AM, KK <di...@gmail.com> wrote:

> Muir, thanks for your response.
> I'm indexing indian language web pages which has got descent amount of
> english content mixed with therein. For the time being I'm not going to use
> any stemmers as we don't have standard stemmers for indian languages . So
> what I want to do is like this,
> Say I've a web page having hindi content with 5% english content. Then for
> hindi I want to use the basic white space analyzer as we dont have stemmers
> for this as I mentioned earlier and whereever english appears I want them
> to
> be stemmed tokenized etc[the standard process used for english content]. As
> of now I'm using whitespace analyzer for the full content which doesnot
> support case folding, stemming etc for teh content. So if there is an
> english word say "Detection" indexed as such then searching for detection
> or
> detect is not giving any results, which is the expected behavior, but I
> want
> this kind of queries to give results.
> I hope I made it clear. Let me know any ideas on doing the same. And one
> more thing, I'm storing the full webpage content under a single field, I
> hope this will not make any difference, right?
> It seems I've to use language identifiers, but do we really need that?
> Because we've only non-english content mixed with english[and not french or
> russian etc].
>
> What is the best way of approaching the problem? Any thoughts!
>
> Thanks,
> KK.
>
> On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rc...@gmail.com> wrote:
>
> > KK, is all of your latin script text actually english? Is there stuff
> like
> > german or french mixed in?
> >
> > And for your non-english content (your examples have been indian writing
> > systems), is it generally true that if you had devanagari, you can assume
> > its hindi? or is there stuff like marathi mixed in?
> >
> > Reason I say this is to invoke the right stemmers, you really need some
> > language detection, but perhaps in your case you can cheat and detect
> this
> > based on scripts...
> >
> > Thanks,
> > Robert
> >
> >
> > On Wed, Jun 3, 2009 at 10:15 AM, KK <di...@gmail.com> wrote:
> >
> > > Hi All,
> > > I'm indexing some non-english content. But the page also contains
> english
> > > content. As of now I'm using WhitespaceAnalyzer for all content and I'm
> > > storing the full webpage content under a single filed. Now we require
> to
> > > support case folding and stemmming for the english content intermingled
> > > with
> > > non-english content. I must metion that we dont have stemming and case
> > > folding for these non-english content. I'm stuck with this. Some one do
> > let
> > > me know how to proceed for fixing this issue.
> > >
> > > Thanks,
> > > KK.
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Re: How to support stemming and case folding for english content mixed with non-english content?

Posted by KK <di...@gmail.com>.

Muir, thanks for your response.
I'm indexing indian language web pages which has got descent amount of
english content mixed with therein. For the time being I'm not going to use
any stemmers as we don't have standard stemmers for indian languages . So
what I want to do is like this,
Say I've a web page having hindi content with 5% english content. Then for
hindi I want to use the basic white space analyzer as we dont have stemmers
for this as I mentioned earlier and whereever english appears I want them to
be stemmed tokenized etc[the standard process used for english content]. As
of now I'm using whitespace analyzer for the full content which doesnot
support case folding, stemming etc for teh content. So if there is an
english word say "Detection" indexed as such then searching for detection or
detect is not giving any results, which is the expected behavior, but I want
this kind of queries to give results.
I hope I made it clear. Let me know any ideas on doing the same. And one
more thing, I'm storing the full webpage content under a single field, I
hope this will not make any difference, right?
It seems I've to use language identifiers, but do we really need that?
Because we've only non-english content mixed with english[and not french or
russian etc].

What is the best way of approaching the problem? Any thoughts!

Thanks,
KK.

On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <rc...@gmail.com> wrote:

> KK, is all of your latin script text actually english? Is there stuff like
> german or french mixed in?
>
> And for your non-english content (your examples have been indian writing
> systems), is it generally true that if you had devanagari, you can assume
> its hindi? or is there stuff like marathi mixed in?
>
> Reason I say this is to invoke the right stemmers, you really need some
> language detection, but perhaps in your case you can cheat and detect this
> based on scripts...
>
> Thanks,
> Robert
>
>
> On Wed, Jun 3, 2009 at 10:15 AM, KK <di...@gmail.com> wrote:
>
> > Hi All,
> > I'm indexing some non-english content. But the page also contains english
> > content. As of now I'm using WhitespaceAnalyzer for all content and I'm
> > storing the full webpage content under a single filed. Now we require to
> > support case folding and stemmming for the english content intermingled
> > with
> > non-english content. I must metion that we dont have stemming and case
> > folding for these non-english content. I'm stuck with this. Some one do
> let
> > me know how to proceed for fixing this issue.
> >
> > Thanks,
> > KK.
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>