You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Dawid Weiss <da...@gmail.com> on 2011/04/01 13:58:21 UTC

add(CharSequence) in automaton builder

Mike, can you remember what ordering is required for
add(CharSequence)? I see it requires INPUT_TYPE.BYTE4

assert fst.getInputType() == FST.INPUT_TYPE.BYTE4;

but this would imply the order of full unicode codepoints on the
input? Is this what String comparators do by default (I doubt, but
wanted to check if you know first).

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: add(CharSequence) in automaton builder

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> sorry, since you were talking about the charsequence api to builder, i
> assumed for a second you were working with chars/Strings, and forgot
> about how this is confusingly mixed with, yet distinct from, the whole
> BYTE1/BYTE4 selection in builder :)

I am working with strings because that's what the Lookup API is
providing... which I think should change, but it's something for
another round of patches. The BYTE1/BYTE4 is confusing and I believe
at least some sort of documentation should be added there to clarify
what it's for and how
it should be used.  Again -- something to clarify as part of another task.

I should have that Lookup impl. ready tomorrow, had to reiterate over
certain things first and it took me longer than expected.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: add(CharSequence) in automaton builder

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Apr 1, 2011 at 8:29 AM, Robert Muir <rc...@gmail.com> wrote:

> sorry, since you were talking about the charsequence api to builder, i
> assumed for a second you were working with chars/Strings, and forgot
> about how this is confusingly mixed with, yet distinct from, the whole
> BYTE1/BYTE4 selection in builder :)

It IS really confusing!

Really, the Builder & FST need to be parameterized also on the input
type (it's already parameterized on the output type), but confronting
the required generics to accomplish this was..... scary.

Mike

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: add(CharSequence) in automaton builder

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Apr 1, 2011 at 8:25 AM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:

> Yes, this is what I also figured out. The unicode code point order is
> also impl. in BytesRef.getUTF8SortedAsUnicodeComparator, correct? For
> what I need I'll use raw utf8 byte order, it doesn't matter as long as
> it's consistent.
>

yes, if you are already working with bytes, definitely just stay with
binary order (utf8 and utf32 are the same order, its only
utf16/String/chars that are wackos)

sorry, since you were talking about the charsequence api to builder, i
assumed for a second you were working with chars/Strings, and forgot
about how this is confusingly mixed with, yet distinct from, the whole
BYTE1/BYTE4 selection in builder :)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: add(CharSequence) in automaton builder

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> (sorry not mike, but) you are right, String.compareTo() compares in

He, he, thanks Robert. We have these anti-child-abuse commercials on
tv right now "you never know who's on the other side"... how
appropriate for this situation.

> utf-16 order by default. this is not consistent with the order the FST
> builder expects (utf8/utf32 order)

Yes, this is what I also figured out. The unicode code point order is
also impl. in BytesRef.getUTF8SortedAsUnicodeComparator, correct? For
what I need I'll use raw utf8 byte order, it doesn't matter as long as
it's consistent.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: add(CharSequence) in automaton builder

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Apr 1, 2011 at 7:58 AM, Dawid Weiss <da...@gmail.com> wrote:
> Mike, can you remember what ordering is required for
> add(CharSequence)? I see it requires INPUT_TYPE.BYTE4
>
> assert fst.getInputType() == FST.INPUT_TYPE.BYTE4;
>
> but this would imply the order of full unicode codepoints on the
> input? Is this what String comparators do by default (I doubt, but
> wanted to check if you know first).
>

(sorry not mike, but) you are right, String.compareTo() compares in
utf-16 order by default. this is not consistent with the order the FST
builder expects (utf8/utf32 order)

So if you are going to order the terms before passing them to Builder,
you should either use a utf-16-in-utf-8-order comparator* (or simply
use codePointAt and friends and compare those ints, probably
slower...)

different ways of impl'ing the comparator below:
* http://icu-project.org/docs/papers/utf16_code_point_order.html
* http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org