You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2010/12/03 11:31:10 UTC

[jira] Created: (LUCENE-2792) Add a simple FST impl to Lucene

Add a simple FST impl to Lucene
-------------------------------

                 Key: LUCENE-2792
                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Index
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 4.0



I implemented the algo described at
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
incrementally building a finite state transducer (FST) from sorted
inputs.

This is not a fully general FST impl -- it's only able to build up an
FST incrementally from input/output pairs that are pre-sorted.

Currently the inputs are BytesRefs, and the outputs are pluggable --
NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
ByteSequenceOutput maps to a BytesRef.

The implementation has a low memory overhead, so that it can handle a
fairly large set of terms.  For example, it can build the FSA for the
9.8M terms from a 10M document wikipedia index in ~8 seconds (on
beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.

It packs the FST as-it-builds into a compact byte[], and then exposes
the API to read nodes/arcs directly from the byte[].  The FST can be
quickly saved/loaded to/from a Directory since it's just a big byte[].

The format is similar to what Morfologik uses
(http://sourceforge.net/projects/morfologik/).

I think there are a number of possible places we can use this in
Lucene.  For example, I think many apps could hold the entire terms
dict in RAM, either at the multi-reader level or maybe per-segment
(mapping to file offset or to something else custom to the app), which
may possibly be a good speedup for certain MTQs (though, because the
format is packed into a byte[], there is a decode cost when visiting
arcs).

The builder can also prune as it goes, so you get a prefix trie pruned
according to how many terms run through the nodes, which makes it
faster and even less memory consuming.  This may be useful as a
replacement for our current binary search terms index since it can
achieve higher term density for the same RAM consumption of our
current index.

As an initial usage to make sure this is exercised, I cutover the
SimpleText codec, which currently fully loads all terms into a
TreeMap (and has caused intermittent OOME in some tests), to use an FST
instead.  SimpleText uses a PairOutputs which is able to "pair up" any
two other outputs, since it needs to map each input term to an int
docFreq and long filePosition.

All tests pass w/ SimpleText forced codec, and I think this is
committable except I'd love to get some help w/ the generics
(confession to the policeman: I had to add
@SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
parameterized by its output type (Integer, BytesRef, etc.).

I even added a new @nightly test that makes a largeish set of random
terms and tests the resulting FST on different outputs :)

I think it would also be easy to make a variant that uses char[]
instead of byte[] as its inputs, so we could eg use this during analysis
(Robert's idea).  It's already be easy to have a CharSequence
output type since the outputs are pluggable.

Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
Morfologik -- http://sourceforge.net/projects/morfologik/)
was very helpful iterating with me on this (thank you!).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970697#action_12970697 ] 

Robert Muir commented on LUCENE-2792:
-------------------------------------

In revision 1045012, i reverted the generic grow() for the classes that are unrelated to this issue.

Its still being used in 3 places by the FST stuff in this patch, and this should be reviewed separately if these areas are performance sensitive.


> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966597#action_12966597 ] 

Jason Rutherglen commented on LUCENE-2792:
------------------------------------------

Can the FST be made concurrent for use in LUCENE-2312 as the terms dict implementation?

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2792:
---------------------------------------

    Attachment: LUCENE-2792.patch

Patch.

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970251#action_12970251 ] 

Jason Rutherglen commented on LUCENE-2792:
------------------------------------------

{quote}Just an idea, maybe you could use something like ConcurrentSkipListMap for this
its only in java 6 but you can poach from apache harmony...
{quote}

Right, I'm planning on using CSLM and sorted int[]s.  Aren't we on Java 6 for trunk?  However, yes, it'd be nice to support JDK 1.5, I'll look into if Harmony's CSLM can be pulled out.

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966691#action_12966691 ] 

Michael Busch commented on LUCENE-2792:
---------------------------------------

Cool stuff, Mike!

Could we use this for more efficient wildcard search?  E.g. could we add posting lists for inner nodes to the index?

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2792:
---------------------------------------

    Attachment: LUCENE-2792.patch

New patch, after substantial iterations w/ the Generics Policeman (thanks!) using Mercurial.

It's much cleaner now -- the classes are parameterized by the output type of the FST.

I will commit soon...

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2792:
--------------------------------

    Attachment: SimpleBench.java

Uwe, unfortunately even without the getComponentType its still much worse.
I'm attaching my benchmark (times in milliseconds):
method1: new+arraycopy
method2: original Array.grow
method3: Array.grow without getComponentType

-client 
method1: 221
method2: 2672
method3: 1731
method1: 225
method2: 2674
method3: 1727
method1: 212
method2: 2697
method3: 1698

-server
method1: 160
method2: 166
method3: 167
method1: 159
method2: 168
method3: 170
method1: 162
method2: 172
method3: 171


> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792-ArrayUtil-grow.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, SimpleBench.java
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2792:
----------------------------------

    Attachment: LUCENE-2792.patch

Mike: Here some further, small cleanups in the grow() parts.

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970932#action_12970932 ] 

Michael McCandless commented on LUCENE-2792:
--------------------------------------------

I think we should remove this generic ArrayUtil.grow?  It's too dangerous that we'll accidentally use it in a hotspot some time going forward...?

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792-ArrayUtil-grow.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, SimpleBench.java
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2792:
---------------------------------------

    Attachment: FSTExample.png

I'm attaching an example FST (PNG image) so we can visualize what an FST does.

Really an FST is like a SortedMap<BytesRef,X>, where X (the "outputs") is pluggable.

This particular FST maps a canned set of words to the int ord (0, 1, 2, ...) of the word.

Arcs that are "final" end in a T instead of an arrow, which means if you stop at that arc the input you've traversed is "accepted".  (The blue arcs just indicate arcs that were compressed with a NEXT flag when stored in the byte[].)

To get the total output you sum the outputs you encounter.  For example, the input "stop" hits output 4 on the 's' and output 1 on the o so its ord is 5.

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970447#action_12970447 ] 

Uwe Schindler commented on LUCENE-2792:
---------------------------------------

Hi Mike,
my problem is that I have no time to closely look into it and I don't understand the whole thing :-) (you cannot always understand everything *g*). So I have no idea how the generics *should* look like. I don't even understand all the SuppressWarnings currently in the code, because in my opinion, generics warnings cannot occur at all those places. From a first look it seems that a few places using untyped Output, but for that the methods need to be generified (a generic <T> before the return type like in AttributeSource). The "copy" in one of the TODOs cannot be avoided by generics, because the generics are type erasure, so it would not help in that method (only that you may not need to add so many casts, which on the other hand the compiler will add).

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2792:
----------------------------------

    Attachment: LUCENE-2792-ArrayUtil-grow.patch

Here the patch, just to test performance. Robert?

Silly: It again creates stupid unchecked warnings in FST classes (this was the main treason for the grow method). If this patch helps I will try to improve, maybe by relaxing generics in ArrayUtil.

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792-ArrayUtil-grow.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-2792.
----------------------------------------

    Resolution: Fixed

Thanks Uwe!

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970461#action_12970461 ] 

Michael McCandless commented on LUCENE-2792:
--------------------------------------------

This looks great Uwe, thanks!

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2792:
----------------------------------

    Attachment:     (was: LUCENE-2792.patch)

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966600#action_12966600 ] 

Michael McCandless commented on LUCENE-2792:
--------------------------------------------

bq. Can the FST be made concurrent for use in LUCENE-2312 as the terms dict implementation?

Maybe?  Ie the write ops are things like "append a new arc to this node", so eg if we use a concurrent linked list per node?

Also, the impl here isn't helpful for the RT case, since it requires you add the terms in sorted order.  But, there is a non-sorted-order version too, however, they will be substantially more RAM consuming since they'd use a objects for node/arcs (this impl doesn't).

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966604#action_12966604 ] 

Robert Muir commented on LUCENE-2792:
-------------------------------------

bq. Can the FST be made concurrent for use in LUCENE-2312 as the terms dict implementation?

Just an idea, maybe you could use something like ConcurrentSkipListMap for this
until mike makes "fst 2.0" or something?

its only in java 6 but you can poach from apache harmony...


> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970941#action_12970941 ] 

Uwe Schindler commented on LUCENE-2792:
---------------------------------------

Give me few days to think about it. We still use it in FST, but mostly because the generics are ugly otherwise. Maybe I have a good idea. But we can revert in 3.x? I will do that, if you also thinks so.

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792-ArrayUtil-grow.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, SimpleBench.java
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970738#action_12970738 ] 

Uwe Schindler commented on LUCENE-2792:
---------------------------------------

Thanks Robert for the test with -client. We all use -server and did not see a difference (I made tests yesterday with Java 6). 

This may also be a reason for the slow performance of Collections.sort() without our improvements, because AbstractCollection.toArray(T[]) exactly uses this code to create the array instance (in fact, I copied the resize code from Harmony).

bq. Here's the link for reference: http://bugs.sun.com/view_bug.do?bug_id=6525802

As far as I see is the slow part a.getClass().getComponentType(). What do you think about the following method signature, that would make it also 100% type safe and does not have the negative performance impact?:

{code}
public static <T> T[] grow(T[] array, Class<T> componentType, int size)
{code}

In code you would use:

{code}
String[] sa=...
ArrayUtils.grow(sa, String.class, 20);
{code}

I can supply a patch for testing. Array.newInstance is one of the most used methods in lots of code, and the actual growing is done seldom (because of oversize()).

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler resolved LUCENE-2792.
-----------------------------------

    Resolution: Fixed

Reverted generic ArrayUtil.grow(T[]) in trunk and 3.x (rev. 1045319).

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792-ArrayUtil-grow.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, SimpleBench.java
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2792:
----------------------------------

    Attachment: LUCENE-2792.patch

Last patch missed some generics in NoOutputs and PositiveInts

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970604#action_12970604 ] 

Uwe Schindler commented on LUCENE-2792:
---------------------------------------

I will merge the ArrayUtils imporvements to 3.x after the commit (its minor, but very useful).

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966817#action_12966817 ] 

Michael McCandless commented on LUCENE-2792:
--------------------------------------------

bq. Could we use this for more efficient wildcard search? 

Possibly -- we need to explore this.  It remains to be seen if the CPU cost of traversing the FST is offset by the fact that the FST can keep more (possibly, all) terms in memory.

Also we'd need to figure out whether this'd work at the top-level reader (eg as an initial filter to enumerate all terms matching the MTQ), or, per-segment as the terms index (which'd be much more RAM costly).

bq. E.g. could we add posting lists for inner nodes to the index?

Hmm... what are the "inner nodes"?

I guess we could do something like Pulsing codec, where postings for low doc-freq terms are simply stored in an FST, and then only the higher freq terms remain on disk (optionally of course since whether this is feasible depends on the app).  Then eg lookups against a primary key field would be entirely in RAM.

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Reopened: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reopened LUCENE-2792:
---------------------------------


Reopening: the generic ArrayUtil.grow here (with getComponentType etc) is more than 10x slower than the old code on -client.
Yes, its the same performance on -server, but not everyone runs with that.


> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970700#action_12970700 ] 

Robert Muir commented on LUCENE-2792:
-------------------------------------

Here's the link for reference: http://bugs.sun.com/view_bug.do?bug_id=6525802

This getComponentType is intrinsic only on -server, on -client it is native (slow).

Personally I think we should remove this generic grow() completely to avoid performance issues (I kept it, but its only used by FST)

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966510#action_12966510 ] 

Robert Muir commented on LUCENE-2792:
-------------------------------------

This sounds awesome!

It would be neat to write intersect(FST, Automaton), and somehow optionally use it if available.
Such a thing might be cleaner now that MTQ.getTermsEnum takes a Terms structure


> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2792:
---------------------------------------

    Attachment: LUCENE-2792.patch

New patch!  I think it's ready to commit.

I switched the core FST input from byte to int so that the FST can now map any IntsRef to any output.  I also added sugar for adding eg CharSequence, char[] as utf32 to the FST, and also running (looking up the output for a given input).  I made separate BytesRefFSTEnum and IntsRefFSTEnum.  I changed the "nocommit switch to generics" to TODOs :), but it'd be nice to somehow cutover to generics for the FST's output type.

I also added a test case that indexes docs from the new test line docs file, builds an FST, then verifies the resulting TermsEnum vs BytesRefFSTEnum behave the same.

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970950#action_12970950 ] 

Michael McCandless commented on LUCENE-2792:
--------------------------------------------

It would be best if we could keep the API but w/ no trap for -client java users.

But, I think we should at least revert from FST's Builder.UnCompiledNode arc reallocation; that's a hot spot...


> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792-ArrayUtil-grow.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, SimpleBench.java
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2792:
--------------------------------

    Attachment: LUCENE-2792.patch

As a start, i generified the PairOutputs.. will try to help with the other classes later today.

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2792) Add a simple FST impl to Lucene

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2792:
----------------------------------

    Attachment: LUCENE-2792.patch

Here a first generifid patch. There are several problems:

- FST<T> as parameter to Outputs.write() and Outputs.read() breaks generics in PairOutput (I added nocommit, this is really why we have the generics, without that this bug would never be visible). Mike said, that there should be really used IndexInput and IndexOutput
- The testcase is broken, as it dynamically casts types depending on int constants. The test should be rewritten to use typed inner classes (and some code duplication)

I may have missed more generics violations, but it now compiles correctly. Javac does not detect all missing parameters, so I have to review again, but for now I want to post the patch.

> Add a simple FST impl to Lucene
> -------------------------------
>
>                 Key: LUCENE-2792
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2792
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: FSTExample.png, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch, LUCENE-2792.patch
>
>
> I implemented the algo described at
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
> incrementally building a finite state transducer (FST) from sorted
> inputs.
> This is not a fully general FST impl -- it's only able to build up an
> FST incrementally from input/output pairs that are pre-sorted.
> Currently the inputs are BytesRefs, and the outputs are pluggable --
> NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
> ByteSequenceOutput maps to a BytesRef.
> The implementation has a low memory overhead, so that it can handle a
> fairly large set of terms.  For example, it can build the FSA for the
> 9.8M terms from a 10M document wikipedia index in ~8 seconds (on
> beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.
> It packs the FST as-it-builds into a compact byte[], and then exposes
> the API to read nodes/arcs directly from the byte[].  The FST can be
> quickly saved/loaded to/from a Directory since it's just a big byte[].
> The format is similar to what Morfologik uses
> (http://sourceforge.net/projects/morfologik/).
> I think there are a number of possible places we can use this in
> Lucene.  For example, I think many apps could hold the entire terms
> dict in RAM, either at the multi-reader level or maybe per-segment
> (mapping to file offset or to something else custom to the app), which
> may possibly be a good speedup for certain MTQs (though, because the
> format is packed into a byte[], there is a decode cost when visiting
> arcs).
> The builder can also prune as it goes, so you get a prefix trie pruned
> according to how many terms run through the nodes, which makes it
> faster and even less memory consuming.  This may be useful as a
> replacement for our current binary search terms index since it can
> achieve higher term density for the same RAM consumption of our
> current index.
> As an initial usage to make sure this is exercised, I cutover the
> SimpleText codec, which currently fully loads all terms into a
> TreeMap (and has caused intermittent OOME in some tests), to use an FST
> instead.  SimpleText uses a PairOutputs which is able to "pair up" any
> two other outputs, since it needs to map each input term to an int
> docFreq and long filePosition.
> All tests pass w/ SimpleText forced codec, and I think this is
> committable except I'd love to get some help w/ the generics
> (confession to the policeman: I had to add
> @SuppressWarnings({"unchecked"})) all over!!  Ideally an FST is
> parameterized by its output type (Integer, BytesRef, etc.).
> I even added a new @nightly test that makes a largeish set of random
> terms and tests the resulting FST on different outputs :)
> I think it would also be easy to make a variant that uses char[]
> instead of byte[] as its inputs, so we could eg use this during analysis
> (Robert's idea).  It's already be easy to have a CharSequence
> output type since the outputs are pluggable.
> Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and
> Morfologik -- http://sourceforge.net/projects/morfologik/)
> was very helpful iterating with me on this (thank you!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org