You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2017/10/02 10:27:10 UTC

Re: Binary Automaton

I'm not sure this is exactly what you are asking, but Lucene's terms are
already byte[] (default UTF-8 encoded from char[] terms), and the automata
that are created for searching (e.g. by WildcardQuery, PrefixQuery,
FuzzyQuery, AutomatonQuery) are also byte based (see the crazy
UTF32ToUTF8.java conversion class).  Lucene's Automaton class uses integer
labels on the transitions, so as long as you ensure those ints never fall
outside of an unsigned byte (0-255) then it's byte-based.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Sep 30, 2017 at 2:58 PM, Dawid Weiss <da...@gmail.com> wrote:

> >  Preface: I dont know how automaton is implemented deeply inside lucene ,
>
> Well, you can take a look, it's open source. :) There are two
> different finite state automata inside Lucene: one is pretty much a
> "read-only" transducer from unique input seqences (of bytes) into an
> output. This is the FST<?> class. The other is Automaton class which
> has been ported from the Brics library [1].
>
> I can't really relate to your comment about fast querying for
> sub-automata; sounds interesting though. Dig in the code and suggest a
> patch (or even demonstrate what you came up with!).
>
> Dawid
>
> [1] http://www.brics.dk/automaton/
>
> > but (considering automaton is built on the fly when index is already
> > present) i imagine that the automaton   is scanning the lexicons/tokens
> > present in the lucene index for finding the document references (solution
> > 1).
> > I think there are 2 different generic solutions for using automata for my
> > opinion.
> > 1) to create a automaton for parsing the token present in the lucene
> table
> > as described above.
> > 2) to create a pattern matching automaton(on binary, or better of a
> > abstract stream could be  more generic) and put these states directly in
> a
> > index . In this case you can receive very fastly the documents matching a
> > specific automaton built when you created the index ( or a sub-automaton
> >  rappreenting a subset of the same states) . The second solution could
> > maybe be used for mapping inside a single lucene document field a complex
> > structure  and then you can find nested information embedded . In this
> way
> > i need not to use multiple lucene documents (this could create
> performance
> > and scalability problems)
> > In many cases this solution could be fastest of actual joins for example,
> >  be usefull in bioinformatic or all those cases where data is not a basic
> >  ADT.
> >
> > Cristian
> >
> > 2017-09-30 12:24 GMT+02:00 Dawid Weiss <da...@gmail.com>:
> >
> >> > Hi , it is possible to create a Automaton in lucene parsing not a
> string
> >> > but a byte array?
> >>
> >> Can you state what problem are you trying to solve? This seems to be a
> >> question stripped of a more general context -- why do you need those
> >> byte-based automata?
> >>
> >> Dawid
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Binary Automaton

Posted by Michael McCandless <lu...@mikemccandless.com>.

Oh I was simply explaining that the Lucene Automaton API uses "int" labels,
and so if you want an automaton operating in byte space, you just need to
ensure those ints only use the range supported by unsigned bytes (0 - 255).

Mike McCandless

http://blog.mikemccandless.com

On Mon, Oct 2, 2017 at 1:30 PM, José Tomás Atria <jt...@gmail.com> wrote:

> Mike, could you clarify what you meant by the int comment at the end of
> your last message? I fail to see the significance of having multibyte
> transition labels for the format of the payloads the automation will run
> on...
>
> Thanks!
> Jta
>
> On Mon, Oct 2, 2017, 12:41 Cristian Lorenzetto <
> cristian.lorenzetto@gmail.com> wrote:
>
> > It sounds a good way :) Maybe the code to develop it is not so huge.
> Thanks
> > for the suggestions :)
> >
> > 2017-10-02 12:27 GMT+02:00 Michael McCandless <lucene@mikemccandless.com
> >:
> >
> > > I'm not sure this is exactly what you are asking, but Lucene's terms
> are
> > > already byte[] (default UTF-8 encoded from char[] terms), and the
> > automata
> > > that are created for searching (e.g. by WildcardQuery, PrefixQuery,
> > > FuzzyQuery, AutomatonQuery) are also byte based (see the crazy
> > > UTF32ToUTF8.java conversion class).  Lucene's Automaton class uses
> > integer
> > > labels on the transitions, so as long as you ensure those ints never
> fall
> > > outside of an unsigned byte (0-255) then it's byte-based.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > > On Sat, Sep 30, 2017 at 2:58 PM, Dawid Weiss <da...@gmail.com>
> > > wrote:
> > >
> > > > >  Preface: I dont know how automaton is implemented deeply inside
> > > lucene ,
> > > >
> > > > Well, you can take a look, it's open source. :) There are two
> > > > different finite state automata inside Lucene: one is pretty much a
> > > > "read-only" transducer from unique input seqences (of bytes) into an
> > > > output. This is the FST<?> class. The other is Automaton class which
> > > > has been ported from the Brics library [1].
> > > >
> > > > I can't really relate to your comment about fast querying for
> > > > sub-automata; sounds interesting though. Dig in the code and suggest
> a
> > > > patch (or even demonstrate what you came up with!).
> > > >
> > > > Dawid
> > > >
> > > > [1] http://www.brics.dk/automaton/
> > > >
> > > > > but (considering automaton is built on the fly when index is
> already
> > > > > present) i imagine that the automaton   is scanning the
> > lexicons/tokens
> > > > > present in the lucene index for finding the document references
> > > (solution
> > > > > 1).
> > > > > I think there are 2 different generic solutions for using automata
> > for
> > > my
> > > > > opinion.
> > > > > 1) to create a automaton for parsing the token present in the
> lucene
> > > > table
> > > > > as described above.
> > > > > 2) to create a pattern matching automaton(on binary, or better of a
> > > > > abstract stream could be  more generic) and put these states
> directly
> > > in
> > > > a
> > > > > index . In this case you can receive very fastly the documents
> > > matching a
> > > > > specific automaton built when you created the index ( or a
> > > sub-automaton
> > > > >  rappreenting a subset of the same states) . The second solution
> > could
> > > > > maybe be used for mapping inside a single lucene document field a
> > > complex
> > > > > structure  and then you can find nested information embedded . In
> > this
> > > > way
> > > > > i need not to use multiple lucene documents (this could create
> > > > performance
> > > > > and scalability problems)
> > > > > In many cases this solution could be fastest of actual joins for
> > > example,
> > > > >  be usefull in bioinformatic or all those cases where data is not a
> > > basic
> > > > >  ADT.
> > > > >
> > > > > Cristian
> > > > >
> > > > > 2017-09-30 12:24 GMT+02:00 Dawid Weiss <da...@gmail.com>:
> > > > >
> > > > >> > Hi , it is possible to create a Automaton in lucene parsing not
> a
> > > > string
> > > > >> > but a byte array?
> > > > >>
> > > > >> Can you state what problem are you trying to solve? This seems to
> > be a
> > > > >> question stripped of a more general context -- why do you need
> those
> > > > >> byte-based automata?
> > > > >>
> > > > >> Dawid
> > > > >>
> > > > >>
> > ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >>
> > > > >>
> > > >
> > > > ------------------------------------------------------------
> ---------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
> --
>
> sent from a phone. please excuse terseness and tpyos.
>
> enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
>

Re: Binary Automaton

Posted by José Tomás Atria <jt...@gmail.com>.

Mike, could you clarify what you meant by the int comment at the end of
your last message? I fail to see the significance of having multibyte
transition labels for the format of the payloads the automation will run
on...

Thanks!
Jta

On Mon, Oct 2, 2017, 12:41 Cristian Lorenzetto <
cristian.lorenzetto@gmail.com> wrote:

> It sounds a good way :) Maybe the code to develop it is not so huge. Thanks
> for the suggestions :)
>
> 2017-10-02 12:27 GMT+02:00 Michael McCandless <lu...@mikemccandless.com>:
>
> > I'm not sure this is exactly what you are asking, but Lucene's terms are
> > already byte[] (default UTF-8 encoded from char[] terms), and the
> automata
> > that are created for searching (e.g. by WildcardQuery, PrefixQuery,
> > FuzzyQuery, AutomatonQuery) are also byte based (see the crazy
> > UTF32ToUTF8.java conversion class).  Lucene's Automaton class uses
> integer
> > labels on the transitions, so as long as you ensure those ints never fall
> > outside of an unsigned byte (0-255) then it's byte-based.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Sat, Sep 30, 2017 at 2:58 PM, Dawid Weiss <da...@gmail.com>
> > wrote:
> >
> > > >  Preface: I dont know how automaton is implemented deeply inside
> > lucene ,
> > >
> > > Well, you can take a look, it's open source. :) There are two
> > > different finite state automata inside Lucene: one is pretty much a
> > > "read-only" transducer from unique input seqences (of bytes) into an
> > > output. This is the FST<?> class. The other is Automaton class which
> > > has been ported from the Brics library [1].
> > >
> > > I can't really relate to your comment about fast querying for
> > > sub-automata; sounds interesting though. Dig in the code and suggest a
> > > patch (or even demonstrate what you came up with!).
> > >
> > > Dawid
> > >
> > > [1] http://www.brics.dk/automaton/
> > >
> > > > but (considering automaton is built on the fly when index is already
> > > > present) i imagine that the automaton   is scanning the
> lexicons/tokens
> > > > present in the lucene index for finding the document references
> > (solution
> > > > 1).
> > > > I think there are 2 different generic solutions for using automata
> for
> > my
> > > > opinion.
> > > > 1) to create a automaton for parsing the token present in the lucene
> > > table
> > > > as described above.
> > > > 2) to create a pattern matching automaton(on binary, or better of a
> > > > abstract stream could be  more generic) and put these states directly
> > in
> > > a
> > > > index . In this case you can receive very fastly the documents
> > matching a
> > > > specific automaton built when you created the index ( or a
> > sub-automaton
> > > >  rappreenting a subset of the same states) . The second solution
> could
> > > > maybe be used for mapping inside a single lucene document field a
> > complex
> > > > structure  and then you can find nested information embedded . In
> this
> > > way
> > > > i need not to use multiple lucene documents (this could create
> > > performance
> > > > and scalability problems)
> > > > In many cases this solution could be fastest of actual joins for
> > example,
> > > >  be usefull in bioinformatic or all those cases where data is not a
> > basic
> > > >  ADT.
> > > >
> > > > Cristian
> > > >
> > > > 2017-09-30 12:24 GMT+02:00 Dawid Weiss <da...@gmail.com>:
> > > >
> > > >> > Hi , it is possible to create a Automaton in lucene parsing not a
> > > string
> > > >> > but a byte array?
> > > >>
> > > >> Can you state what problem are you trying to solve? This seems to
> be a
> > > >> question stripped of a more general context -- why do you need those
> > > >> byte-based automata?
> > > >>
> > > >> Dawid
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
-- 

sent from a phone. please excuse terseness and tpyos.

enviado desde un teléfono. por favor disculpe la parquedad y los erroers.

Re: Binary Automaton

Posted by Cristian Lorenzetto <cr...@gmail.com>.

It sounds a good way :) Maybe the code to develop it is not so huge. Thanks
for the suggestions :)

2017-10-02 12:27 GMT+02:00 Michael McCandless <lu...@mikemccandless.com>:

> I'm not sure this is exactly what you are asking, but Lucene's terms are
> already byte[] (default UTF-8 encoded from char[] terms), and the automata
> that are created for searching (e.g. by WildcardQuery, PrefixQuery,
> FuzzyQuery, AutomatonQuery) are also byte based (see the crazy
> UTF32ToUTF8.java conversion class).  Lucene's Automaton class uses integer
> labels on the transitions, so as long as you ensure those ints never fall
> outside of an unsigned byte (0-255) then it's byte-based.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sat, Sep 30, 2017 at 2:58 PM, Dawid Weiss <da...@gmail.com>
> wrote:
>
> > >  Preface: I dont know how automaton is implemented deeply inside
> lucene ,
> >
> > Well, you can take a look, it's open source. :) There are two
> > different finite state automata inside Lucene: one is pretty much a
> > "read-only" transducer from unique input seqences (of bytes) into an
> > output. This is the FST<?> class. The other is Automaton class which
> > has been ported from the Brics library [1].
> >
> > I can't really relate to your comment about fast querying for
> > sub-automata; sounds interesting though. Dig in the code and suggest a
> > patch (or even demonstrate what you came up with!).
> >
> > Dawid
> >
> > [1] http://www.brics.dk/automaton/
> >
> > > but (considering automaton is built on the fly when index is already
> > > present) i imagine that the automaton   is scanning the lexicons/tokens
> > > present in the lucene index for finding the document references
> (solution
> > > 1).
> > > I think there are 2 different generic solutions for using automata for
> my
> > > opinion.
> > > 1) to create a automaton for parsing the token present in the lucene
> > table
> > > as described above.
> > > 2) to create a pattern matching automaton(on binary, or better of a
> > > abstract stream could be  more generic) and put these states directly
> in
> > a
> > > index . In this case you can receive very fastly the documents
> matching a
> > > specific automaton built when you created the index ( or a
> sub-automaton
> > >  rappreenting a subset of the same states) . The second solution could
> > > maybe be used for mapping inside a single lucene document field a
> complex
> > > structure  and then you can find nested information embedded . In this
> > way
> > > i need not to use multiple lucene documents (this could create
> > performance
> > > and scalability problems)
> > > In many cases this solution could be fastest of actual joins for
> example,
> > >  be usefull in bioinformatic or all those cases where data is not a
> basic
> > >  ADT.
> > >
> > > Cristian
> > >
> > > 2017-09-30 12:24 GMT+02:00 Dawid Weiss <da...@gmail.com>:
> > >
> > >> > Hi , it is possible to create a Automaton in lucene parsing not a
> > string
> > >> > but a byte array?
> > >>
> > >> Can you state what problem are you trying to solve? This seems to be a
> > >> question stripped of a more general context -- why do you need those
> > >> byte-based automata?
> > >>
> > >> Dawid
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>