You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2011/12/16 23:44:56 UTC

Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

The ICUTokenizer now adds a script attribute for tokens (as do Standard Tokenizer and a couple of others (LUCENE-2911)  For example "Tibetan" or "Han".   If the Shingle filter had some provision to only make token n-grams when the script attribute matched some specified script, it would solve both the need to produce character bigrams for CJK ( Han)  and syllable bigrams for Tibetan.  We already opened an issue to create overlapping bigrams for CJK (LUCENE-2906) .

Would it make sense to open an issue for modifying the Shingle filter to have configurable script-specific behavior, or is this just another use case for LUCENE 2906?

If it is another use case for LUCENE 2906, then perhaps we need to change the summary of the issue to generalize it beyond CJK.

Any suggestions ?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

RE: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Thanks Robert,



>>Another idea apart from your solution would be to add a tailoring for
>>tibetan that sets some special attribute indicating 'word-final
>>syllable'. Then this information is not 'lost' and downstream can do
>>the right thing.

>>...So essentially before doing anything like that, it would be
>>best to know 'the rules of the game' before thinking about any design.

So the ICUTokenizer would have to add that word-final syllable attribute based on some rules and then a downstream filter could use the attributes to constuct bigrams without creating "stupid" bigrams.

If we end up doing the project, we will be working with people who have expertise in Tibetan and hopefully would be able to tell us the "rules of the game"  

Tom

_______________________________________


Another idea apart from your solution would be to add a tailoring for
tibetan that sets some special attribute indicating 'word-final
syllable'. Then this information is not 'lost' and downstream can do
the right thing.
Its not a difficult thing to do for the tokenizer, but we would need
more details: a quick glance at some stuff on tibetan punctuation
indicates its not 'this simple': for some syllables sometimes the
punctuation is omitted. Honestly i don't know why this is, maybe it
means there are some syllables that only appear in word-final
position? If so, such important clues should also trigger this
attribute. So essentially before doing anything like that, it would be
best to know 'the rules of the game' before thinking about any design.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Dec 16, 2011 at 7:32 PM, Burton-West, Tom <tb...@umich.edu> wrote:

> Unfortunately, it sounds like the ICUTokenizer will segment on the Tibetan phrase separators but downstream filters won't know that, so we couldn't have a downstream filter that avoided bigramming across a phrase separator. On the other hand it might be that "stupid" overlapping bigrams don't hurt retrieval compared to treating syllables as if they were words i.e. syllable unigrams. ( I've not been able to find much published research in English on the issue, and many of the references are to articles in Chinese language publications. I'm pretty much relying on the article by Hackett and Oard)
>

Yeah thats the one I was referring to. I think its a good article but
the methods there are "rough" so we don't know for sure.

Again from my intuition I agree with it, and the solution you mention
might be good, but my general opinion is that its not simple to make
this a general thing where you just supply a list of scripts and it
'does its thing'.

Another idea apart from your solution would be to add a tailoring for
tibetan that sets some special attribute indicating 'word-final
syllable'. Then this information is not 'lost' and downstream can do
the right thing.
Its not a difficult thing to do for the tokenizer, but we would need
more details: a quick glance at some stuff on tibetan punctuation
indicates its not 'this simple': for some syllables sometimes the
punctuation is omitted. Honestly i don't know why this is, maybe it
means there are some syllables that only appear in word-final
position? If so, such important clues should also trigger this
attribute. So essentially before doing anything like that, it would be
best to know 'the rules of the game' before thinking about any design.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Hi Robert,

Thanks for the quick and thoughtful response. 

I didn't realize these complexities and thought maybe there was an easy solution :)

We may be involved in a project that involves Tibetan text and given our current resources and priorities, we would stick it in the same field as the other 400+ languages.  I was hoping that with the script attribute output by the ICUTokenizer, we could figure out something to do script/language specific processing for Tibetan without adversely affecting anything else. 

>>. I suppose to inhibit stupid bigrams you would *not*shingle across shad as well

Unfortunately, it sounds like the ICUTokenizer will segment on the Tibetan phrase separators but downstream filters won't know that, so we couldn't have a downstream filter that avoided bigramming across a phrase separator. On the other hand it might be that "stupid" overlapping bigrams don't hurt retrieval compared to treating syllables as if they were words i.e. syllable unigrams. ( I've not been able to find much published research in English on the issue, and many of the references are to articles in Chinese language publications.  I'm pretty much relying on the article by Hackett and Oard) 

Tom

Hackett, P. G., & Oard, D. W. (2000). Comparison of word-based and syllable-based retrieval for Tibetan (poster session). In Proceedings of the fifth international workshop on on Information retrieval with Asian languages - IRAL '00 (pp. 197-198). Presented at the the fifth international workshop on, Hong Kong, China. doi:10.1145/355214.355242

http://dl.acm.org/citation.cfm?doid=355214.355242

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Friday, December 16, 2011 6:45 PM
To: dev@lucene.apache.org
Subject: Re: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

On Fri, Dec 16, 2011 at 5:44 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> The ICUTokenizer now adds a script attribute for tokens (as do Standard
> Tokenizer and a couple of others (LUCENE-2911)  For example “Tibetan” or
> “Han”.   If the Shingle filter had some provision to only make token n-grams
> when the script attribute matched some specified script, it would solve both
> the need to produce character bigrams for CJK ( Han) and syllable bigrams
> for Tibetan.  We already opened an issue to create overlapping bigrams for
> CJK (LUCENE-2906) .

Not sure it totally would because there are key important differences,
and a few complications:
1. CJKTokenizer today creates bigrams in runs "cjk" text where this is
something like: [IHK]+ (run of ideographic, hiragana, katakana). There
are different variations on this available too, like only bigram I+
and do something else with the katakana (like keep as word). Seems
like the verdict from previous studies is that there are options there
and they tend to both work well. But one thing is still for sure, I
think it would bad here to form bigrams across what was not contiguous
text (e.g. across sentence boundaries). Finally, some CJK
normalization (such as halfwidth/fullwidth conversion) is not 1:1
replacement and so really the process here should at least be aware of
this and consider some sequences of half-width-kana as a single
'character'.
2. Unlike the CJK case, where you bigram a "run", Tibetan separates
syllables with special punctuation (tsheg among other things). The
reason you have syllables as output from these tokenizers is because
of this reason. So this is already a fundamentally different bigram
algorithm, because its not longer contiguous runs, instead syllables
often had something in between, and depending upon what that something
is tells you if its e.g. a syllable separator or something more like a
phrase separator. I suppose to inhibit stupid bigrams you would *not*
shingle across shad as well.. how to generalize that? The verdict for
this language definitely isn't out here, I've only see some very
initial rough work on this language and we aren't totally sure this
works well on average.
3. Other "complex" languages besides these are also emitting syllables
"at best", too: Thai,Lao,Myanmar,Khmer? Shouldn't we bigram those too?
Except, one implementation (ICUTokenizer) is emitting syllables here
(what type of syllable depends upon the current implementation, too!),
and the other (StandardTokenizer) is emitting whole phrases as words.
Would be great to bigram the former (we think!), but even more
horrible to do it to the latter. I put "we think" here because there
has really been no work done here, so its just intuition/guessing.
And to make matters worse, we have a filter in contrib
(ThaiWordFilter) that relies upon the specifics of how
StandardTokenizer screws up Thai tokenization so it can 'retokenize'.

>
> Would it make sense to open an issue for modifying the Shingle filter to
> have configurable script-specific behavior, or is this just another use case
> for LUCENE 2906?
>
> If it is another use case for LUCENE 2906, then perhaps we need to change
> the summary of the issue to generalize it beyond CJK.
>
> Any suggestions ?
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Shingle filter that reads the script attribute from ICUTokenizer and LUCENE-2906

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Dec 16, 2011 at 5:44 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> The ICUTokenizer now adds a script attribute for tokens (as do Standard
> Tokenizer and a couple of others (LUCENE-2911)  For example “Tibetan” or
> “Han”.   If the Shingle filter had some provision to only make token n-grams
> when the script attribute matched some specified script, it would solve both
> the need to produce character bigrams for CJK ( Han) and syllable bigrams
> for Tibetan.  We already opened an issue to create overlapping bigrams for
> CJK (LUCENE-2906) .

Not sure it totally would because there are key important differences,
and a few complications:
1. CJKTokenizer today creates bigrams in runs "cjk" text where this is
something like: [IHK]+ (run of ideographic, hiragana, katakana). There
are different variations on this available too, like only bigram I+
and do something else with the katakana (like keep as word). Seems
like the verdict from previous studies is that there are options there
and they tend to both work well. But one thing is still for sure, I
think it would bad here to form bigrams across what was not contiguous
text (e.g. across sentence boundaries). Finally, some CJK
normalization (such as halfwidth/fullwidth conversion) is not 1:1
replacement and so really the process here should at least be aware of
this and consider some sequences of half-width-kana as a single
'character'.
2. Unlike the CJK case, where you bigram a "run", Tibetan separates
syllables with special punctuation (tsheg among other things). The
reason you have syllables as output from these tokenizers is because
of this reason. So this is already a fundamentally different bigram
algorithm, because its not longer contiguous runs, instead syllables
often had something in between, and depending upon what that something
is tells you if its e.g. a syllable separator or something more like a
phrase separator. I suppose to inhibit stupid bigrams you would *not*
shingle across shad as well.. how to generalize that? The verdict for
this language definitely isn't out here, I've only see some very
initial rough work on this language and we aren't totally sure this
works well on average.
3. Other "complex" languages besides these are also emitting syllables
"at best", too: Thai,Lao,Myanmar,Khmer? Shouldn't we bigram those too?
Except, one implementation (ICUTokenizer) is emitting syllables here
(what type of syllable depends upon the current implementation, too!),
and the other (StandardTokenizer) is emitting whole phrases as words.
Would be great to bigram the former (we think!), but even more
horrible to do it to the latter. I put "we think" here because there
has really been no work done here, so its just intuition/guessing.
And to make matters worse, we have a filter in contrib
(ThaiWordFilter) that relies upon the specifics of how
StandardTokenizer screws up Thai tokenization so it can 'retokenize'.

>
> Would it make sense to open an issue for modifying the Shingle filter to
> have configurable script-specific behavior, or is this just another use case
> for LUCENE 2906?
>
> If it is another use case for LUCENE 2906, then perhaps we need to change
> the summary of the issue to generalize it beyond CJK.
>
> Any suggestions ?
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org