You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by José Tomás Atria <jt...@gmail.com> on 2016/03/24 00:09:21 UTC

Persistence/Serialization of Automaton

Hello!

Is it possible to serialize Lucene's Automata? I see that the javadoc for
the original BRICS package indicates that instances of Automaton implement
Serialzable, but this is not the case with the Automaton class in Lucene 5+.

I assume it is possible, considering that a FSA is basically just a set of
states and transitions, but how would I go about (1) extracting that data
from an instance of automaton and (2) recreating the original automaton
given a set of transitions and states as it would be possible to obtain
them from a live instance?

Alternatively, maybe there is some other place where this is implemented?
How can I persist lucene's automata?

thanks,
jta

-- 
entia non sunt multiplicanda praeter necessitatem

Re: Persistence/Serialization of Automaton

Posted by Erick Erickson <er...@gmail.com>.
BTW, anything Mike says is _vastly_ more accurate than anything I can
come up with, he...er...wrote much of the code.

On Thu, Mar 24, 2016 at 11:02 AM, José Tomás Atria <jt...@gmail.com> wrote:
> Ah, awesome. I'll go read the code and see what I come up with. Thanks for
> the help :)
>
> jta
>
> On Thu, Mar 24, 2016 at 1:42 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> You don't need nextState/nextTransition for serializing, unless you
>> want to unserialize and then "resume" building an automaton.
>>
>> Those are only used while building an automaton.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Mar 24, 2016 at 1:20 PM, José Tomás Atria <jt...@gmail.com>
>> wrote:
>> > Hi Mike,
>> >
>> > Thanks for your reply. I was assuming what you mention about automata
>> being
>> > just a couple of int arrays, so I went and looked at the code for
>> > Automaton.copy( Automaton other ), and that is in fact what the code
>> copies
>> > from the other Automaton:
>> > int[] states
>> > int[] transitions
>> >
>> > But I got confused, because the copying code makes references to
>> something
>> > that looks like state variables in the source object:
>> > int nextState
>> > int nextTransition
>> >
>> > So I'm not sure if it's possible to, for example, reconstruct an
>> automaton
>> > merely from the states and transitions int[], or if I also need to pay
>> > attention to the nextState and nextTransition values, that I have no idea
>> > what they are, or if they are immutable, etc. I have been using factory
>> > methods to construct all of my automata from strings, so I don't
>> understand
>> > what this states mean, and whether they are relevant for the automaton's
>> > _definition_ per opposed to their construction or execution.
>> >
>> > Thanks!
>> > jta
>> >
>> >
>> >
>> > On Thu, Mar 24, 2016 at 12:54 PM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >> Lucene no longer has Serializable on its classes: the
>> >> cross-java-version implications are too difficult.  So we expect/rely
>> >> on the user layer above Lucene to handle any serialization needs.
>> >>
>> >> That said, serializing an automaton should be quite simple since the
>> >> data structure is just int node IDs, marked as accept nodes or not,
>> >> with connecting transitions that have min/max labels.  You could write
>> >> that to your own byte stream and re-build the automaton on
>> >> deserializing.
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Thu, Mar 24, 2016 at 12:08 PM, Erick Erickson
>> >> <er...@gmail.com> wrote:
>> >> > I'm really out of my league here, but some of the suggester stuff
>> >> > builds an image on disk and some of the implementations use FSTs,
>> >> > which are at least in the ballpark.
>> >> >
>> >> > What I'm saying here is that the code may already be in place, or at
>> >> > least a place to start.
>> >> >
>> >> > And I have to ask, "why do you want to do this in the first place?".
>> >> > What is the problem you're trying to solve anyway?
>> >> >
>> >> > Best,
>> >> > Erick
>> >> >
>> >> > On Thu, Mar 24, 2016 at 6:57 AM, McKinley, James T
>> >> > <ja...@cengage.com> wrote:
>> >> >> Here's an archive link from this mailing list regarding serializing
>> >> queries, I guess this would work for Automaton objects as well.
>> >> >>
>> >> >>
>> >>
>> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201603.mbox/browser
>> >> >>
>> >> >> Hope it helps.
>> >> >>
>> >> >> Jim
>> >> >> ________________________________________
>> >> >> From: José Tomás Atria <jt...@gmail.com>
>> >> >> Sent: 23 March 2016 19:09
>> >> >> To: java-user@lucene.apache.org
>> >> >> Subject: Persistence/Serialization of Automaton
>> >> >>
>> >> >> Hello!
>> >> >>
>> >> >> Is it possible to serialize Lucene's Automata? I see that the javadoc
>> >> for
>> >> >> the original BRICS package indicates that instances of Automaton
>> >> implement
>> >> >> Serialzable, but this is not the case with the Automaton class in
>> >> Lucene 5+.
>> >> >>
>> >> >> I assume it is possible, considering that a FSA is basically just a
>> set
>> >> of
>> >> >> states and transitions, but how would I go about (1) extracting that
>> >> data
>> >> >> from an instance of automaton and (2) recreating the original
>> automaton
>> >> >> given a set of transitions and states as it would be possible to
>> obtain
>> >> >> them from a live instance?
>> >> >>
>> >> >> Alternatively, maybe there is some other place where this is
>> >> implemented?
>> >> >> How can I persist lucene's automata?
>> >> >>
>> >> >> thanks,
>> >> >> jta
>> >> >>
>> >> >> --
>> >> >> entia non sunt multiplicanda praeter necessitatem
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > entia non sunt multiplicanda praeter necessitatem
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> entia non sunt multiplicanda praeter necessitatem

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Persistence/Serialization of Automaton

Posted by José Tomás Atria <jt...@gmail.com>.
Ah, awesome. I'll go read the code and see what I come up with. Thanks for
the help :)

jta

On Thu, Mar 24, 2016 at 1:42 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> You don't need nextState/nextTransition for serializing, unless you
> want to unserialize and then "resume" building an automaton.
>
> Those are only used while building an automaton.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Mar 24, 2016 at 1:20 PM, José Tomás Atria <jt...@gmail.com>
> wrote:
> > Hi Mike,
> >
> > Thanks for your reply. I was assuming what you mention about automata
> being
> > just a couple of int arrays, so I went and looked at the code for
> > Automaton.copy( Automaton other ), and that is in fact what the code
> copies
> > from the other Automaton:
> > int[] states
> > int[] transitions
> >
> > But I got confused, because the copying code makes references to
> something
> > that looks like state variables in the source object:
> > int nextState
> > int nextTransition
> >
> > So I'm not sure if it's possible to, for example, reconstruct an
> automaton
> > merely from the states and transitions int[], or if I also need to pay
> > attention to the nextState and nextTransition values, that I have no idea
> > what they are, or if they are immutable, etc. I have been using factory
> > methods to construct all of my automata from strings, so I don't
> understand
> > what this states mean, and whether they are relevant for the automaton's
> > _definition_ per opposed to their construction or execution.
> >
> > Thanks!
> > jta
> >
> >
> >
> > On Thu, Mar 24, 2016 at 12:54 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> Lucene no longer has Serializable on its classes: the
> >> cross-java-version implications are too difficult.  So we expect/rely
> >> on the user layer above Lucene to handle any serialization needs.
> >>
> >> That said, serializing an automaton should be quite simple since the
> >> data structure is just int node IDs, marked as accept nodes or not,
> >> with connecting transitions that have min/max labels.  You could write
> >> that to your own byte stream and re-build the automaton on
> >> deserializing.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Mar 24, 2016 at 12:08 PM, Erick Erickson
> >> <er...@gmail.com> wrote:
> >> > I'm really out of my league here, but some of the suggester stuff
> >> > builds an image on disk and some of the implementations use FSTs,
> >> > which are at least in the ballpark.
> >> >
> >> > What I'm saying here is that the code may already be in place, or at
> >> > least a place to start.
> >> >
> >> > And I have to ask, "why do you want to do this in the first place?".
> >> > What is the problem you're trying to solve anyway?
> >> >
> >> > Best,
> >> > Erick
> >> >
> >> > On Thu, Mar 24, 2016 at 6:57 AM, McKinley, James T
> >> > <ja...@cengage.com> wrote:
> >> >> Here's an archive link from this mailing list regarding serializing
> >> queries, I guess this would work for Automaton objects as well.
> >> >>
> >> >>
> >>
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201603.mbox/browser
> >> >>
> >> >> Hope it helps.
> >> >>
> >> >> Jim
> >> >> ________________________________________
> >> >> From: José Tomás Atria <jt...@gmail.com>
> >> >> Sent: 23 March 2016 19:09
> >> >> To: java-user@lucene.apache.org
> >> >> Subject: Persistence/Serialization of Automaton
> >> >>
> >> >> Hello!
> >> >>
> >> >> Is it possible to serialize Lucene's Automata? I see that the javadoc
> >> for
> >> >> the original BRICS package indicates that instances of Automaton
> >> implement
> >> >> Serialzable, but this is not the case with the Automaton class in
> >> Lucene 5+.
> >> >>
> >> >> I assume it is possible, considering that a FSA is basically just a
> set
> >> of
> >> >> states and transitions, but how would I go about (1) extracting that
> >> data
> >> >> from an instance of automaton and (2) recreating the original
> automaton
> >> >> given a set of transitions and states as it would be possible to
> obtain
> >> >> them from a live instance?
> >> >>
> >> >> Alternatively, maybe there is some other place where this is
> >> implemented?
> >> >> How can I persist lucene's automata?
> >> >>
> >> >> thanks,
> >> >> jta
> >> >>
> >> >> --
> >> >> entia non sunt multiplicanda praeter necessitatem
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > entia non sunt multiplicanda praeter necessitatem
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
entia non sunt multiplicanda praeter necessitatem

Re: Persistence/Serialization of Automaton

Posted by Michael McCandless <lu...@mikemccandless.com>.
You don't need nextState/nextTransition for serializing, unless you
want to unserialize and then "resume" building an automaton.

Those are only used while building an automaton.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Mar 24, 2016 at 1:20 PM, José Tomás Atria <jt...@gmail.com> wrote:
> Hi Mike,
>
> Thanks for your reply. I was assuming what you mention about automata being
> just a couple of int arrays, so I went and looked at the code for
> Automaton.copy( Automaton other ), and that is in fact what the code copies
> from the other Automaton:
> int[] states
> int[] transitions
>
> But I got confused, because the copying code makes references to something
> that looks like state variables in the source object:
> int nextState
> int nextTransition
>
> So I'm not sure if it's possible to, for example, reconstruct an automaton
> merely from the states and transitions int[], or if I also need to pay
> attention to the nextState and nextTransition values, that I have no idea
> what they are, or if they are immutable, etc. I have been using factory
> methods to construct all of my automata from strings, so I don't understand
> what this states mean, and whether they are relevant for the automaton's
> _definition_ per opposed to their construction or execution.
>
> Thanks!
> jta
>
>
>
> On Thu, Mar 24, 2016 at 12:54 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Lucene no longer has Serializable on its classes: the
>> cross-java-version implications are too difficult.  So we expect/rely
>> on the user layer above Lucene to handle any serialization needs.
>>
>> That said, serializing an automaton should be quite simple since the
>> data structure is just int node IDs, marked as accept nodes or not,
>> with connecting transitions that have min/max labels.  You could write
>> that to your own byte stream and re-build the automaton on
>> deserializing.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Mar 24, 2016 at 12:08 PM, Erick Erickson
>> <er...@gmail.com> wrote:
>> > I'm really out of my league here, but some of the suggester stuff
>> > builds an image on disk and some of the implementations use FSTs,
>> > which are at least in the ballpark.
>> >
>> > What I'm saying here is that the code may already be in place, or at
>> > least a place to start.
>> >
>> > And I have to ask, "why do you want to do this in the first place?".
>> > What is the problem you're trying to solve anyway?
>> >
>> > Best,
>> > Erick
>> >
>> > On Thu, Mar 24, 2016 at 6:57 AM, McKinley, James T
>> > <ja...@cengage.com> wrote:
>> >> Here's an archive link from this mailing list regarding serializing
>> queries, I guess this would work for Automaton objects as well.
>> >>
>> >>
>> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201603.mbox/browser
>> >>
>> >> Hope it helps.
>> >>
>> >> Jim
>> >> ________________________________________
>> >> From: José Tomás Atria <jt...@gmail.com>
>> >> Sent: 23 March 2016 19:09
>> >> To: java-user@lucene.apache.org
>> >> Subject: Persistence/Serialization of Automaton
>> >>
>> >> Hello!
>> >>
>> >> Is it possible to serialize Lucene's Automata? I see that the javadoc
>> for
>> >> the original BRICS package indicates that instances of Automaton
>> implement
>> >> Serialzable, but this is not the case with the Automaton class in
>> Lucene 5+.
>> >>
>> >> I assume it is possible, considering that a FSA is basically just a set
>> of
>> >> states and transitions, but how would I go about (1) extracting that
>> data
>> >> from an instance of automaton and (2) recreating the original automaton
>> >> given a set of transitions and states as it would be possible to obtain
>> >> them from a live instance?
>> >>
>> >> Alternatively, maybe there is some other place where this is
>> implemented?
>> >> How can I persist lucene's automata?
>> >>
>> >> thanks,
>> >> jta
>> >>
>> >> --
>> >> entia non sunt multiplicanda praeter necessitatem
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> entia non sunt multiplicanda praeter necessitatem

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Persistence/Serialization of Automaton

Posted by José Tomás Atria <jt...@gmail.com>.
Hi Mike,

Thanks for your reply. I was assuming what you mention about automata being
just a couple of int arrays, so I went and looked at the code for
Automaton.copy( Automaton other ), and that is in fact what the code copies
from the other Automaton:
int[] states
int[] transitions

But I got confused, because the copying code makes references to something
that looks like state variables in the source object:
int nextState
int nextTransition

So I'm not sure if it's possible to, for example, reconstruct an automaton
merely from the states and transitions int[], or if I also need to pay
attention to the nextState and nextTransition values, that I have no idea
what they are, or if they are immutable, etc. I have been using factory
methods to construct all of my automata from strings, so I don't understand
what this states mean, and whether they are relevant for the automaton's
_definition_ per opposed to their construction or execution.

Thanks!
jta



On Thu, Mar 24, 2016 at 12:54 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Lucene no longer has Serializable on its classes: the
> cross-java-version implications are too difficult.  So we expect/rely
> on the user layer above Lucene to handle any serialization needs.
>
> That said, serializing an automaton should be quite simple since the
> data structure is just int node IDs, marked as accept nodes or not,
> with connecting transitions that have min/max labels.  You could write
> that to your own byte stream and re-build the automaton on
> deserializing.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Mar 24, 2016 at 12:08 PM, Erick Erickson
> <er...@gmail.com> wrote:
> > I'm really out of my league here, but some of the suggester stuff
> > builds an image on disk and some of the implementations use FSTs,
> > which are at least in the ballpark.
> >
> > What I'm saying here is that the code may already be in place, or at
> > least a place to start.
> >
> > And I have to ask, "why do you want to do this in the first place?".
> > What is the problem you're trying to solve anyway?
> >
> > Best,
> > Erick
> >
> > On Thu, Mar 24, 2016 at 6:57 AM, McKinley, James T
> > <ja...@cengage.com> wrote:
> >> Here's an archive link from this mailing list regarding serializing
> queries, I guess this would work for Automaton objects as well.
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201603.mbox/browser
> >>
> >> Hope it helps.
> >>
> >> Jim
> >> ________________________________________
> >> From: José Tomás Atria <jt...@gmail.com>
> >> Sent: 23 March 2016 19:09
> >> To: java-user@lucene.apache.org
> >> Subject: Persistence/Serialization of Automaton
> >>
> >> Hello!
> >>
> >> Is it possible to serialize Lucene's Automata? I see that the javadoc
> for
> >> the original BRICS package indicates that instances of Automaton
> implement
> >> Serialzable, but this is not the case with the Automaton class in
> Lucene 5+.
> >>
> >> I assume it is possible, considering that a FSA is basically just a set
> of
> >> states and transitions, but how would I go about (1) extracting that
> data
> >> from an instance of automaton and (2) recreating the original automaton
> >> given a set of transitions and states as it would be possible to obtain
> >> them from a live instance?
> >>
> >> Alternatively, maybe there is some other place where this is
> implemented?
> >> How can I persist lucene's automata?
> >>
> >> thanks,
> >> jta
> >>
> >> --
> >> entia non sunt multiplicanda praeter necessitatem
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
entia non sunt multiplicanda praeter necessitatem

Re: Persistence/Serialization of Automaton

Posted by Michael McCandless <lu...@mikemccandless.com>.
Lucene no longer has Serializable on its classes: the
cross-java-version implications are too difficult.  So we expect/rely
on the user layer above Lucene to handle any serialization needs.

That said, serializing an automaton should be quite simple since the
data structure is just int node IDs, marked as accept nodes or not,
with connecting transitions that have min/max labels.  You could write
that to your own byte stream and re-build the automaton on
deserializing.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Mar 24, 2016 at 12:08 PM, Erick Erickson
<er...@gmail.com> wrote:
> I'm really out of my league here, but some of the suggester stuff
> builds an image on disk and some of the implementations use FSTs,
> which are at least in the ballpark.
>
> What I'm saying here is that the code may already be in place, or at
> least a place to start.
>
> And I have to ask, "why do you want to do this in the first place?".
> What is the problem you're trying to solve anyway?
>
> Best,
> Erick
>
> On Thu, Mar 24, 2016 at 6:57 AM, McKinley, James T
> <ja...@cengage.com> wrote:
>> Here's an archive link from this mailing list regarding serializing queries, I guess this would work for Automaton objects as well.
>>
>> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201603.mbox/browser
>>
>> Hope it helps.
>>
>> Jim
>> ________________________________________
>> From: José Tomás Atria <jt...@gmail.com>
>> Sent: 23 March 2016 19:09
>> To: java-user@lucene.apache.org
>> Subject: Persistence/Serialization of Automaton
>>
>> Hello!
>>
>> Is it possible to serialize Lucene's Automata? I see that the javadoc for
>> the original BRICS package indicates that instances of Automaton implement
>> Serialzable, but this is not the case with the Automaton class in Lucene 5+.
>>
>> I assume it is possible, considering that a FSA is basically just a set of
>> states and transitions, but how would I go about (1) extracting that data
>> from an instance of automaton and (2) recreating the original automaton
>> given a set of transitions and states as it would be possible to obtain
>> them from a live instance?
>>
>> Alternatively, maybe there is some other place where this is implemented?
>> How can I persist lucene's automata?
>>
>> thanks,
>> jta
>>
>> --
>> entia non sunt multiplicanda praeter necessitatem
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Persistence/Serialization of Automaton

Posted by José Tomás Atria <jt...@gmail.com>.
​Thanks for the link to the archives Jim, but i get the feeling that the
approached discussed in that thread may be overkill for what I want to do,
I'm not even using Query objects, just filtering Terms instances with
automata.

short version: I basically wanted to know if there is code somewhere in
lucene that allows for Automaton instances to be persisted, and if this is
not currently available, if it is more or less simple to implement or if
there are some gotchas I should be aware of. I am not a FSA expert nor
familiar with Lucene's implementation, but from my basic understanding,
Automata are basically int arrays, so it should be trivial to implement
persistence support.

I was looking at the code for Automaton.copy( Automaton other ), but it
seems that this method relies on some state variables in the copied
automaton, and that's when I decided to come here and ask before shoot
myself in the foot.

The long version:

What I am trying to do may be sort of a stretch with Lucene, but it's
actually pretty efficient and straightforward: My application needs to
define sets of terms that will be included/excluded and/or grouped to build
a lexicon, a stable list of words represented as sets of terms. Right now,
I'm doing this more or less like:

Automaton included = getIncludeRule();
Automaton excluded = getExcludeRule();
Automaton filter = Operations.intersection( included,
Operations.complement( excluded, Operations.DEFAULT_MAX_DETERMINIZED_STATES
) );
Terms terms = leafReader.terms( "field" );
TermsEnum tEnum = terms.intersect( new CompiledAutomaton( filter ) );
while( tEnum.next != null ) {
    // add term to lexicon.
}

Grouping terms works in a similar way:

Terms terms = leafReader.terms( "field" );
Map<Automaton,String> termSets = getTermSets();
for( Entry<Automaton,String> tSet : termSets.entrySet() ) {
     CompiledAutomaton cau = new CompiledAutomaton( tSet.getKey() );
     TermsEnum tEnum = terms.intersect( cau );
     MyTermGroup grp = new MyTermGroup( tSet.getKey, tSet.getValue() );
     while( tEnum.next != null ) {
           grp.addAnddoStuff( tEnum );
     }
}

This works extremely well. It's blazing fast, and it allows me to have a
very clean and efficient API for building sets of (groups of) terms that I
can then use as the basis for a stable corpus lexicon for downstream
analysis. Also,  this allows me to avoid having to run searches and rely
only on IndexReader instances (maybe this is stupid? I don't have much
experience with Lucene's search aspects).

However, lacking a way to persist the automata that define a lexicon, I
have to build the set from scratch everytime, and I'm not totally
comfortable doing that. Hence the above question, since I would assume that
Automata are more or less straightforward to serialize, given that they are
nothing but int arrays, right? But I am not such a FSA expert, and I'm not
very familiar with the implementation details of Lucene's automata.

Any tips would be greatly appreciated :)







On Thu, Mar 24, 2016 at 12:08 PM, Erick Erickson <er...@gmail.com>
wrote:

> I'm really out of my league here, but some of the suggester stuff
> builds an image on disk and some of the implementations use FSTs,
> which are at least in the ballpark.
>
> What I'm saying here is that the code may already be in place, or at
> least a place to start.
>
> And I have to ask, "why do you want to do this in the first place?".
> What is the problem you're trying to solve anyway?
>
> Best,
> Erick
>
> On Thu, Mar 24, 2016 at 6:57 AM, McKinley, James T
> <ja...@cengage.com> wrote:
> > Here's an archive link from this mailing list regarding serializing
> queries, I guess this would work for Automaton objects as well.
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201603.mbox/browser
> >
> > Hope it helps.
> >
> > Jim
> > ________________________________________
> > From: José Tomás Atria <jt...@gmail.com>
> > Sent: 23 March 2016 19:09
> > To: java-user@lucene.apache.org
> > Subject: Persistence/Serialization of Automaton
> >
> > Hello!
> >
> > Is it possible to serialize Lucene's Automata? I see that the javadoc for
> > the original BRICS package indicates that instances of Automaton
> implement
> > Serialzable, but this is not the case with the Automaton class in Lucene
> 5+.
> >
> > I assume it is possible, considering that a FSA is basically just a set
> of
> > states and transitions, but how would I go about (1) extracting that data
> > from an instance of automaton and (2) recreating the original automaton
> > given a set of transitions and states as it would be possible to obtain
> > them from a live instance?
> >
> > Alternatively, maybe there is some other place where this is implemented?
> > How can I persist lucene's automata?
> >
> > thanks,
> > jta
> >
> > --
> > entia non sunt multiplicanda praeter necessitatem
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
entia non sunt multiplicanda praeter necessitatem

Re: Persistence/Serialization of Automaton

Posted by Erick Erickson <er...@gmail.com>.
I'm really out of my league here, but some of the suggester stuff
builds an image on disk and some of the implementations use FSTs,
which are at least in the ballpark.

What I'm saying here is that the code may already be in place, or at
least a place to start.

And I have to ask, "why do you want to do this in the first place?".
What is the problem you're trying to solve anyway?

Best,
Erick

On Thu, Mar 24, 2016 at 6:57 AM, McKinley, James T
<ja...@cengage.com> wrote:
> Here's an archive link from this mailing list regarding serializing queries, I guess this would work for Automaton objects as well.
>
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201603.mbox/browser
>
> Hope it helps.
>
> Jim
> ________________________________________
> From: José Tomás Atria <jt...@gmail.com>
> Sent: 23 March 2016 19:09
> To: java-user@lucene.apache.org
> Subject: Persistence/Serialization of Automaton
>
> Hello!
>
> Is it possible to serialize Lucene's Automata? I see that the javadoc for
> the original BRICS package indicates that instances of Automaton implement
> Serialzable, but this is not the case with the Automaton class in Lucene 5+.
>
> I assume it is possible, considering that a FSA is basically just a set of
> states and transitions, but how would I go about (1) extracting that data
> from an instance of automaton and (2) recreating the original automaton
> given a set of transitions and states as it would be possible to obtain
> them from a live instance?
>
> Alternatively, maybe there is some other place where this is implemented?
> How can I persist lucene's automata?
>
> thanks,
> jta
>
> --
> entia non sunt multiplicanda praeter necessitatem
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Persistence/Serialization of Automaton

Posted by "McKinley, James T" <ja...@cengage.com>.
Here's an archive link from this mailing list regarding serializing queries, I guess this would work for Automaton objects as well.

http://mail-archives.apache.org/mod_mbox/lucene-java-user/201603.mbox/browser

Hope it helps.

Jim
________________________________________
From: José Tomás Atria <jt...@gmail.com>
Sent: 23 March 2016 19:09
To: java-user@lucene.apache.org
Subject: Persistence/Serialization of Automaton

Hello!

Is it possible to serialize Lucene's Automata? I see that the javadoc for
the original BRICS package indicates that instances of Automaton implement
Serialzable, but this is not the case with the Automaton class in Lucene 5+.

I assume it is possible, considering that a FSA is basically just a set of
states and transitions, but how would I go about (1) extracting that data
from an instance of automaton and (2) recreating the original automaton
given a set of transitions and states as it would be possible to obtain
them from a live instance?

Alternatively, maybe there is some other place where this is implemented?
How can I persist lucene's automata?

thanks,
jta

--
entia non sunt multiplicanda praeter necessitatem
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org