You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by revathy arun <re...@gmail.com> on 2009/02/16 09:30:47 UTC

indexing Chienese langage

Hi,

When I index chinese content using chinese tokenizer and analyzer in solr
1.3 ,some of the chinese text files are getting indexed but others are not.

Since chinese has got many different language subtypes as in standard
chinese,simplified chinese etc which of these does the chinese tokenizer
support and is there any method to find the type of  chiense language  from
the file?

Rgds

Re: indexing Chienese langage

Posted by James liu <li...@gmail.com>.

first: u not have to restart solr,,,u can use new data to replace old data
and call solr to use new search..u can find something in shell script which
with solr

two: u not have to restart solr,,,just keep id is same..example: old
id:1,title:hi, new id:1,title:welcome,,just index new data,,it will delete
old data and insert new doc,,,like replace,,but it will use more time and
resouce.

u can find indexed doc number from solr admin page.


On Fri, Jun 5, 2009 at 7:42 AM, Fer-Bj <fe...@gmail.com> wrote:

>
> What we usually do to reindex is:
>
> 1. stop solr
> 2. rmdir -r data  (that is to remove everything in  /opt/solr/data/
> 3. mkdir data
> 4. start solr
> 5. start reindex.....   with this we're sure about not having old copies or
> index..
>
> To check the index size we do:
> cd data
> du -sh
>
>
>
> Otis Gospodnetic wrote:
> >
> >
> > I can't tell what that analyzer does, but I'm guessing it uses n-grams?
> > Maybe consider trying https://issues.apache.org/jira/browse/LUCENE-1629
> > instead?
> >
> >  Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Fer-Bj <fe...@gmail.com>
> >> To: solr-user@lucene.apache.org
> >> Sent: Thursday, June 4, 2009 2:20:03 AM
> >> Subject: Re: indexing Chienese langage
> >>
> >>
> >> We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after
> >> reindexing
> >> the index size went from 1.5 Gb to 2.7 Gb.
> >>
> >> Is that some expected behavior ?
> >>
> >> Is there any switch or trick to avoid having a double + index file size?
> >>
> >> Koji Sekiguchi-2 wrote:
> >> >
> >> > CharFilter can normalize (convert) traditional chinese to simplified
> >> > chinese or vice versa,
> >> > if you define mapping.txt. Here is the sample of Chinese character
> >> > normalization:
> >> >
> >> >
> >>
> https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
> >> >
> >> > See SOLR-822 for the detail:
> >> >
> >> > https://issues.apache.org/jira/browse/SOLR-822
> >> >
> >> > Koji
> >> >
> >> >
> >> > revathy arun wrote:
> >> >> Hi,
> >> >>
> >> >> When I index chinese content using chinese tokenizer and analyzer in
> >> solr
> >> >> 1.3 ,some of the chinese text files are getting indexed but others
> are
> >> >> not.
> >> >>
> >> >> Since chinese has got many different language subtypes as in standard
> >> >> chinese,simplified chinese etc which of these does the chinese
> >> tokenizer
> >> >> support and is there any method to find the type of  chiense language
> >> >> from
> >> >> the file?
> >> >>
> >> >> Rgds
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/indexing-Chienese-langage-tp22033302p23879730.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
regards
j.L ( I live in Shanghai, China)

Re: indexing Chienese langage

Posted by Fer-Bj <fe...@gmail.com>.

What we usually do to reindex is:

1. stop solr
2. rmdir -r data  (that is to remove everything in  /opt/solr/data/
3. mkdir data
4. start solr
5. start reindex.....   with this we're sure about not having old copies or
index..

To check the index size we do:
cd data
du -sh



Otis Gospodnetic wrote:
> 
> 
> I can't tell what that analyzer does, but I'm guessing it uses n-grams?
> Maybe consider trying https://issues.apache.org/jira/browse/LUCENE-1629
> instead?
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Fer-Bj <fe...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Thursday, June 4, 2009 2:20:03 AM
>> Subject: Re: indexing Chienese langage
>> 
>> 
>> We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after
>> reindexing
>> the index size went from 1.5 Gb to 2.7 Gb.
>> 
>> Is that some expected behavior ?
>> 
>> Is there any switch or trick to avoid having a double + index file size?
>> 
>> Koji Sekiguchi-2 wrote:
>> > 
>> > CharFilter can normalize (convert) traditional chinese to simplified 
>> > chinese or vice versa,
>> > if you define mapping.txt. Here is the sample of Chinese character 
>> > normalization:
>> > 
>> > 
>> https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
>> > 
>> > See SOLR-822 for the detail:
>> > 
>> > https://issues.apache.org/jira/browse/SOLR-822
>> > 
>> > Koji
>> > 
>> > 
>> > revathy arun wrote:
>> >> Hi,
>> >>
>> >> When I index chinese content using chinese tokenizer and analyzer in
>> solr
>> >> 1.3 ,some of the chinese text files are getting indexed but others are
>> >> not.
>> >>
>> >> Since chinese has got many different language subtypes as in standard
>> >> chinese,simplified chinese etc which of these does the chinese
>> tokenizer
>> >> support and is there any method to find the type of  chiense language 
>> >> from
>> >> the file?
>> >>
>> >> Rgds
>> >>
>> >>  
>> > 
>> > 
>> > 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/indexing-Chienese-langage-tp22033302p23879730.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing Chienese langage

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I can't tell what that analyzer does, but I'm guessing it uses n-grams?
Maybe consider trying https://issues.apache.org/jira/browse/LUCENE-1629 instead?

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Fer-Bj <fe...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, June 4, 2009 2:20:03 AM
> Subject: Re: indexing Chienese langage
> 
> 
> We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after reindexing
> the index size went from 1.5 Gb to 2.7 Gb.
> 
> Is that some expected behavior ?
> 
> Is there any switch or trick to avoid having a double + index file size?
> 
> Koji Sekiguchi-2 wrote:
> > 
> > CharFilter can normalize (convert) traditional chinese to simplified 
> > chinese or vice versa,
> > if you define mapping.txt. Here is the sample of Chinese character 
> > normalization:
> > 
> > 
> https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
> > 
> > See SOLR-822 for the detail:
> > 
> > https://issues.apache.org/jira/browse/SOLR-822
> > 
> > Koji
> > 
> > 
> > revathy arun wrote:
> >> Hi,
> >>
> >> When I index chinese content using chinese tokenizer and analyzer in solr
> >> 1.3 ,some of the chinese text files are getting indexed but others are
> >> not.
> >>
> >> Since chinese has got many different language subtypes as in standard
> >> chinese,simplified chinese etc which of these does the chinese tokenizer
> >> support and is there any method to find the type of  chiense language 
> >> from
> >> the file?
> >>
> >> Rgds
> >>
> >>  
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing Chienese langage

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, are you quite sure that you emptied the index first and didn'tjust add
all the documents a second time to the index?

Also, when you say the index almost doubled, were you looking only
at the size of the *directory*? SOLR might have been holding a copy
of the old index open while you built a new one...

Best
Erick

On Thu, Jun 4, 2009 at 2:20 AM, Fer-Bj <fe...@gmail.com> wrote:

>
> We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after reindexing
> the index size went from 1.5 Gb to 2.7 Gb.
>
> Is that some expected behavior ?
>
> Is there any switch or trick to avoid having a double + index file size?
>
> Koji Sekiguchi-2 wrote:
> >
> > CharFilter can normalize (convert) traditional chinese to simplified
> > chinese or vice versa,
> > if you define mapping.txt. Here is the sample of Chinese character
> > normalization:
> >
> >
> https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
> >
> > See SOLR-822 for the detail:
> >
> > https://issues.apache.org/jira/browse/SOLR-822
> >
> > Koji
> >
> >
> > revathy arun wrote:
> >> Hi,
> >>
> >> When I index chinese content using chinese tokenizer and analyzer in
> solr
> >> 1.3 ,some of the chinese text files are getting indexed but others are
> >> not.
> >>
> >> Since chinese has got many different language subtypes as in standard
> >> chinese,simplified chinese etc which of these does the chinese tokenizer
> >> support and is there any method to find the type of  chiense language
> >> from
> >> the file?
> >>
> >> Rgds
> >>
> >>
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: indexing Chienese langage

Posted by Fer-Bj <fe...@gmail.com>.

We are trying SOLR 1.3 with Paoding Chinese Analyzer , and after reindexing
the index size went from 1.5 Gb to 2.7 Gb.

Is that some expected behavior ?

Is there any switch or trick to avoid having a double + index file size?

Koji Sekiguchi-2 wrote:
> 
> CharFilter can normalize (convert) traditional chinese to simplified 
> chinese or vice versa,
> if you define mapping.txt. Here is the sample of Chinese character 
> normalization:
> 
> https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
> 
> See SOLR-822 for the detail:
> 
> https://issues.apache.org/jira/browse/SOLR-822
> 
> Koji
> 
> 
> revathy arun wrote:
>> Hi,
>>
>> When I index chinese content using chinese tokenizer and analyzer in solr
>> 1.3 ,some of the chinese text files are getting indexed but others are
>> not.
>>
>> Since chinese has got many different language subtypes as in standard
>> chinese,simplified chinese etc which of these does the chinese tokenizer
>> support and is there any method to find the type of  chiense language 
>> from
>> the file?
>>
>> Rgds
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/indexing-Chienese-langage-tp22033302p23864358.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing Chienese langage

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

CharFilter can normalize (convert) traditional chinese to simplified 
chinese or vice versa,
if you define mapping.txt. Here is the sample of Chinese character 
normalization:

https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG

See SOLR-822 for the detail:

https://issues.apache.org/jira/browse/SOLR-822

Koji


revathy arun wrote:
> Hi,
>
> When I index chinese content using chinese tokenizer and analyzer in solr
> 1.3 ,some of the chinese text files are getting indexed but others are not.
>
> Since chinese has got many different language subtypes as in standard
> chinese,simplified chinese etc which of these does the chinese tokenizer
> support and is there any method to find the type of  chiense language  from
> the file?
>
> Rgds
>
>

Re: indexing Chienese langage

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,

While some of the characters in simplified and traditional Chinese do differ, the Chinese tokenizer doesn't care - it simply creates ngram tokens.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 




________________________________
From: revathy arun <re...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 4:30:47 PM
Subject: indexing Chienese langage

Hi,

When I index chinese content using chinese tokenizer and analyzer in solr
1.3 ,some of the chinese text files are getting indexed but others are not.

Since chinese has got many different language subtypes as in standard
chinese,simplified chinese etc which of these does the chinese tokenizer
support and is there any method to find the type of  chiense language  from
the file?

Rgds

Re: indexing Chienese langage

Posted by James liu <li...@gmail.com>.

On Mon, Feb 16, 2009 at 4:30 PM, revathy arun <re...@gmail.com> wrote:

> Hi,
>
> When I index chinese content using chinese tokenizer and analyzer in solr
> 1.3 ,some of the chinese text files are getting indexed but others are not.
>

are u sure ur analyzer can do it good?

if not sure, u can use analzyer link in solr admin page to check it


>
> Since chinese has got many different language subtypes as in standard
> chinese,simplified chinese etc which of these does the chinese tokenizer
> support and is there any method to find the type of  chiense language  from
> the file?
>
> Rgds
>



-- 
regards
j.L ( I live in Shanghai, China)