You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Dawid Weiss <da...@cs.put.poznan.pl> on 2012/03/30 20:13:50 UTC

Fst - arc output?

I'm doing this on trunk:


        WFSTCompletionLookup cl = new WFSTCompletionLookup();
        TermFreq [] input = new TermFreq [] {
            new TermFreq("cat",   0),
            new TermFreq("chat",  2),
            new TermFreq("fat",   3),
            new TermFreq("feat",  1),
            new TermFreq("sea",   0),
            new TermFreq("seat",  3),
            new TermFreq("swat",  0),
            new TermFreq("sweat", 3),
        };
        cl.build(new TermFreqArrayIterator(input));

        StringWriter sw = new StringWriter();
        Field f = cl.getClass().getDeclaredField("fst");
        f.setAccessible(true);
        FST<?> fst = (FST<?>) f.get(cl);
        Util.toDot(fst, sw, false, true);

        for (TermFreq tf : input) {
            List<LookupResult> lookup =
cl.lookup(tf.term.utf8ToString(), true, 1);
            System.out.println(lookup.get(0));
        }

        System.out.println(sw.toString());

The output is attached. How come the first arcs have such high outputs?

Dawid

Re: Fst - arc output?

Posted by Robert Muir <rc...@gmail.com>.

Well I think your picture demonstrates why its hard to beat? all the
'huge numbers' are in the root arcs which are easily cached (and
cached for latin-1 by default).

So its bad in that it bloats the FST, but ultimately bloats it where
it doesn't hurt.

Of course if someone wants to use this for say, CJK, then we need to
add an option to cache more root arcs (just like Kuromoji does). If it
costs you 1 or 2MB ram to have a faster suggester... i think thats an
ok tradeoff.

but yeah, if there is a more efficient way thats also not bloated, i'm
interested!

On Fri, Mar 30, 2012 at 2:23 PM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
> Clear, I didn't think of that and had positive weights in my mind
> only. Thanks for explanation.
>
> Dawid
>
> On Fri, Mar 30, 2012 at 8:17 PM, Robert Muir <rc...@gmail.com> wrote:
>> I think its listed in the documentation (at least the jira issue!).
>> Now that shortestPaths() is generified to any outputs, please feel
>> free to find a more compact/faster output!!!!!!!
>>
>> Basically I experimented with tons of ideas (but the simplest ugliest
>> hack was most successful it seemed?)
>>
>> we take the cost, and encode it as Integer.MAX_VALUE - cost to turn it
>> into a "weight".
>>
>> This is bad because its huge vints, but other approaches i tried were
>> sketchier, requiring encoding of negatives and such (see the recent
>> JIRA issue about that which i tripped in the process!)
>>
>> I think i wrote 5 or 6 outputs impls trying to make this cleaner
>> before just deciding to commit the simple approach...
>>
>> On Fri, Mar 30, 2012 at 2:13 PM, Dawid Weiss
>> <da...@cs.put.poznan.pl> wrote:
>>> I'm doing this on trunk:
>>>
>>>
>>>        WFSTCompletionLookup cl = new WFSTCompletionLookup();
>>>        TermFreq [] input = new TermFreq [] {
>>>            new TermFreq("cat",   0),
>>>            new TermFreq("chat",  2),
>>>            new TermFreq("fat",   3),
>>>            new TermFreq("feat",  1),
>>>            new TermFreq("sea",   0),
>>>            new TermFreq("seat",  3),
>>>            new TermFreq("swat",  0),
>>>            new TermFreq("sweat", 3),
>>>        };
>>>        cl.build(new TermFreqArrayIterator(input));
>>>
>>>        StringWriter sw = new StringWriter();
>>>        Field f = cl.getClass().getDeclaredField("fst");
>>>        f.setAccessible(true);
>>>        FST<?> fst = (FST<?>) f.get(cl);
>>>        Util.toDot(fst, sw, false, true);
>>>
>>>        for (TermFreq tf : input) {
>>>            List<LookupResult> lookup =
>>> cl.lookup(tf.term.utf8ToString(), true, 1);
>>>            System.out.println(lookup.get(0));
>>>        }
>>>
>>>        System.out.println(sw.toString());
>>>
>>> The output is attached. How come the first arcs have such high outputs?
>>>
>>> Dawid
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>>
>> --
>> lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Fst - arc output?

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Clear, I didn't think of that and had positive weights in my mind
only. Thanks for explanation.

Dawid

On Fri, Mar 30, 2012 at 8:17 PM, Robert Muir <rc...@gmail.com> wrote:
> I think its listed in the documentation (at least the jira issue!).
> Now that shortestPaths() is generified to any outputs, please feel
> free to find a more compact/faster output!!!!!!!
>
> Basically I experimented with tons of ideas (but the simplest ugliest
> hack was most successful it seemed?)
>
> we take the cost, and encode it as Integer.MAX_VALUE - cost to turn it
> into a "weight".
>
> This is bad because its huge vints, but other approaches i tried were
> sketchier, requiring encoding of negatives and such (see the recent
> JIRA issue about that which i tripped in the process!)
>
> I think i wrote 5 or 6 outputs impls trying to make this cleaner
> before just deciding to commit the simple approach...
>
> On Fri, Mar 30, 2012 at 2:13 PM, Dawid Weiss
> <da...@cs.put.poznan.pl> wrote:
>> I'm doing this on trunk:
>>
>>
>>        WFSTCompletionLookup cl = new WFSTCompletionLookup();
>>        TermFreq [] input = new TermFreq [] {
>>            new TermFreq("cat",   0),
>>            new TermFreq("chat",  2),
>>            new TermFreq("fat",   3),
>>            new TermFreq("feat",  1),
>>            new TermFreq("sea",   0),
>>            new TermFreq("seat",  3),
>>            new TermFreq("swat",  0),
>>            new TermFreq("sweat", 3),
>>        };
>>        cl.build(new TermFreqArrayIterator(input));
>>
>>        StringWriter sw = new StringWriter();
>>        Field f = cl.getClass().getDeclaredField("fst");
>>        f.setAccessible(true);
>>        FST<?> fst = (FST<?>) f.get(cl);
>>        Util.toDot(fst, sw, false, true);
>>
>>        for (TermFreq tf : input) {
>>            List<LookupResult> lookup =
>> cl.lookup(tf.term.utf8ToString(), true, 1);
>>            System.out.println(lookup.get(0));
>>        }
>>
>>        System.out.println(sw.toString());
>>
>> The output is attached. How come the first arcs have such high outputs?
>>
>> Dawid
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Fst - arc output?

Posted by Robert Muir <rc...@gmail.com>.

I think its listed in the documentation (at least the jira issue!).
Now that shortestPaths() is generified to any outputs, please feel
free to find a more compact/faster output!!!!!!!

Basically I experimented with tons of ideas (but the simplest ugliest
hack was most successful it seemed?)

we take the cost, and encode it as Integer.MAX_VALUE - cost to turn it
into a "weight".

This is bad because its huge vints, but other approaches i tried were
sketchier, requiring encoding of negatives and such (see the recent
JIRA issue about that which i tripped in the process!)

I think i wrote 5 or 6 outputs impls trying to make this cleaner
before just deciding to commit the simple approach...

On Fri, Mar 30, 2012 at 2:13 PM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
> I'm doing this on trunk:
>
>
>        WFSTCompletionLookup cl = new WFSTCompletionLookup();
>        TermFreq [] input = new TermFreq [] {
>            new TermFreq("cat",   0),
>            new TermFreq("chat",  2),
>            new TermFreq("fat",   3),
>            new TermFreq("feat",  1),
>            new TermFreq("sea",   0),
>            new TermFreq("seat",  3),
>            new TermFreq("swat",  0),
>            new TermFreq("sweat", 3),
>        };
>        cl.build(new TermFreqArrayIterator(input));
>
>        StringWriter sw = new StringWriter();
>        Field f = cl.getClass().getDeclaredField("fst");
>        f.setAccessible(true);
>        FST<?> fst = (FST<?>) f.get(cl);
>        Util.toDot(fst, sw, false, true);
>
>        for (TermFreq tf : input) {
>            List<LookupResult> lookup =
> cl.lookup(tf.term.utf8ToString(), true, 1);
>            System.out.println(lookup.get(0));
>        }
>
>        System.out.println(sw.toString());
>
> The output is attached. How come the first arcs have such high outputs?
>
> Dawid
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org