You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@asterixdb.apache.org by sa...@gmail.com, sa...@gmail.com on 2019/04/25 07:08:52 UTC

Add record as argument to Java UDF

Hi devs!

Given a datatype RankingResultType and a dataset RankingResult(RankingResultType) which contains only one record, what is the correct approach when I want to pass a single RankingResult record as an argument to a Java UDF in a SQL++ UDF? The resulting record of the Java UDF should be selected at the end of the UDF as it is going to be stored in the dataset the feed which uses the SQL++ UDF is attached to.

CREATE FUNCTION rank(newItem) {
 LET rankingResult = *must select the record here*,
 SELECT testlib#detectRelevance(newItem, *must pass RankingResult record here*)
};

I have tried some different approaches, for instance
1. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)
 SELECT testlib#detectRelevance(newItem, rankingResult)
2. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)[0] SELECT testlib#detectRelevance(newItem, rankingResult)

The first approach throws a TypeMismatchException, ASX1002: Type mismatch: function testlib#detectRelevance expects its 2nd input parameter to be of type object, but the actual input type is array

So I therefore tried to access the first element of the array in the second approach, but the second approach does not compile:
SX1079: Compilation error: The input type union(RankingResultType: closed {
  id: bigint, 
  first: RankingType: open { score: double }, 
  second: RankingType: open {score: double}, 
  third: RankingType: open { score: double},
  fourth: RankingType: open {score: double},
  fifth: RankingType: open {score: double}
} , null, missing) is not a valid record type! 

Could you maybe point me in the right direction? 
Thanks in advance!

Best,
Sandra

Re: Add record as argument to Java UDF

Posted by sa...@gmail.com, sa...@gmail.com.

Hi Dmitry, thanks for your reply!

I changed the Java function declaration, and tried to define the SQL++ function as you described, and apply it to the TwitterFeed. However, that resulted in the "start feed TwitterFeed" query to never finish executing, it kind of just halts (no job execution time provided in the web interface). I am currently using a AsterixDB version provided by Xikui, which uses a decoupled ingestion framework. 

Maybe Xikui knows if there should be any problem passing primitive types to the Java UDF in that version? The version with the decoupled ingestion framework uses a different function signature than the master version.

Best,
Sandra
On 2019/04/26 00:02:54, Dmitry Lychagin <dm...@couchbase.com.INVALID> wrote: 
> Sandra, 
> 
> Your approach #2 is one the right track, but it looks like there is a bug in how external function framework handles optional record types.
> The return type for "(SELECT VALUE r FROM RankingResult r)[0]" is computed as "RankingResultType?" which means that it could either be a record or NULL or MISSING.
> There's a rule in the optimizer that deals with external functions and that rule incorrectly fails on optional record types.
> It'd be great if you could file a bug for this.
> 
> As a workaround try passing record fields as primitive types to your function instead of the whole record.
> LET rankingResult = (SELECT VALUE r FROM RankingResult r)[0] 
> SELECT testlib#detectRelevance(newItem, rankingResult.first.score, rankingResult.second.score, rankingResult.third.score, rankingResult.fourth.score, rankingResult.fifth.score)
> 
> You'll also need to change a function declaration to accept primitive types instead of the record type:
> <argument_type> ..., ADouble, ADouble, ADouble, ADouble, ADouble </argument_type>
> 
> Thanks,
> -- Dmitry
>  
> 
> On 4/25/19, 3:57 PM, "sandraskarshaug@" <gmail.com sandraskarshaug@gmail.com> wrote:
> 
>     I forgot to add a link to the paper about the decoupled ingestion framework [1].
>     
>     [1] https://arxiv.org/pdf/1902.08271.pdf
>     
>     On 2019/04/25 22:54:15, sandraskarshaug@gmail.com <sa...@gmail.com> wrote: 
>     > Hi, thanks for your reply! I will try to be a bit more precise :-) 
>     > 
>     > I am currently testing the decoupled framework, and I would like to use data from another dataset when enriching tweets, here being data from the RankingResult dataset. Additionally, I would like to send the incoming tweet, as well as a record from RankingResult (say with id = 1) to a Java UDF (from within the SQL++ UDF) for more complex processing, like clustering the tweets, and scoring them based on how relevant they are for a given topic. The scoring within the Java UDF requires information about the record stored in RankingResult.
>     > 
>     > Applying the SQL++ UDF to a TwitterFeed, I aim to check whether a tweet is scored higher than the tweets found in the RankingList record (containing the top ranked tweets for the given topic). I see now that I could select the record I wish to use by "SELECT VALUE r FROM RankingResult r where id=1". One can think of the RankingResult dataset to hold one record per topic/user query which I want to find the top k most relevant tweets for. 
>     > 
>     > The overall goal of the project is to see if AsterixDB can be used to continuously rank tweets in real-time with respect to a user-defined query, meaning that the RankingResult record for the given user query should be updated continuously. I am however also looking into creating a TreeMap data structure in the Java UDF to hold the top current tweets and their scores, and use this for deciding whether the incoming tweet should switch place with any of the top ranked tweets. However, I would like to update the RankingResult record in order to make the data queryable.
>     > 
>     > 
>     > Thanks in advance,
>     > Sandra
>     > 
>     > On 2019/04/25 22:10:56, Mike Carey <dt...@gmail.com> wrote: 
>     > > I will let someone else chime in on what the compilation error might be 
>     > > about, but approach 1 has the problem that you rightly tried to correct 
>     > > in approach 2 (because SELECT always returns an array of results).  But 
>     > > - could you say a bit more - up 5000 feet - about the use case you are 
>     > > trying to address...?  It's not clear (to me) why one might want to have 
>     > > a single-item dataset - perhaps that's just a part of your 
>     > > trying-to-make-this-work debugging - but it might help if the group 
>     > > could see what you are trying to do overall.  (E.g., if you just want to 
>     > > process incoming records on a feed, you wouldn't need another dataset 
>     > > for that.  What's the more general picture/desire?)
>     > > 
>     > > Cheers,
>     > > 
>     > > Mike
>     > > 
>     > > On 4/25/19 12:08 AM, sandraskarshaug@gmail.com wrote:
>     > > > Hi devs!
>     > > >
>     > > > Given a datatype RankingResultType and a dataset RankingResult(RankingResultType) which contains only one record, what is the correct approach when I want to pass a single RankingResult record as an argument to a Java UDF in a SQL++ UDF? The resulting record of the Java UDF should be selected at the end of the UDF as it is going to be stored in the dataset the feed which uses the SQL++ UDF is attached to.
>     > > >
>     > > > CREATE FUNCTION rank(newItem) {
>     > > >   LET rankingResult = *must select the record here*,
>     > > >   SELECT testlib#detectRelevance(newItem, *must pass RankingResult record here*)
>     > > > };
>     > > >
>     > > > I have tried some different approaches, for instance
>     > > > 1. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)
>     > > >   SELECT testlib#detectRelevance(newItem, rankingResult)
>     > > > 2. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)[0] SELECT testlib#detectRelevance(newItem, rankingResult)
>     > > >
>     > > > The first approach throws a TypeMismatchException, ASX1002: Type mismatch: function testlib#detectRelevance expects its 2nd input parameter to be of type object, but the actual input type is array
>     > > >
>     > > > So I therefore tried to access the first element of the array in the second approach, but the second approach does not compile:
>     > > > SX1079: Compilation error: The input type union(RankingResultType: closed {
>     > > >    id: bigint,
>     > > >    first: RankingType: open { score: double },
>     > > >    second: RankingType: open {score: double},
>     > > >    third: RankingType: open { score: double},
>     > > >    fourth: RankingType: open {score: double},
>     > > >    fifth: RankingType: open {score: double}
>     > > > } , null, missing) is not a valid record type!
>     > > >
>     > > > Could you maybe point me in the right direction?
>     > > > Thanks in advance!
>     > > >
>     > > > Best,
>     > > > Sandra
>     > > >
>     > > 
>     > 
>     
> 
>

Re: Add record as argument to Java UDF

Posted by Dmitry Lychagin <dm...@couchbase.com.INVALID>.

Sandra, 

Your approach #2 is one the right track, but it looks like there is a bug in how external function framework handles optional record types.
The return type for "(SELECT VALUE r FROM RankingResult r)[0]" is computed as "RankingResultType?" which means that it could either be a record or NULL or MISSING.
There's a rule in the optimizer that deals with external functions and that rule incorrectly fails on optional record types.
It'd be great if you could file a bug for this.

As a workaround try passing record fields as primitive types to your function instead of the whole record.
LET rankingResult = (SELECT VALUE r FROM RankingResult r)[0] 
SELECT testlib#detectRelevance(newItem, rankingResult.first.score, rankingResult.second.score, rankingResult.third.score, rankingResult.fourth.score, rankingResult.fifth.score)

You'll also need to change a function declaration to accept primitive types instead of the record type:
<argument_type> ..., ADouble, ADouble, ADouble, ADouble, ADouble </argument_type>

Thanks,
-- Dmitry
 

On 4/25/19, 3:57 PM, "sandraskarshaug@" <gmail.com sandraskarshaug@gmail.com> wrote:

    I forgot to add a link to the paper about the decoupled ingestion framework [1].
    
    [1] https://arxiv.org/pdf/1902.08271.pdf
    
    On 2019/04/25 22:54:15, sandraskarshaug@gmail.com <sa...@gmail.com> wrote: 
    > Hi, thanks for your reply! I will try to be a bit more precise :-) 
    > 
    > I am currently testing the decoupled framework, and I would like to use data from another dataset when enriching tweets, here being data from the RankingResult dataset. Additionally, I would like to send the incoming tweet, as well as a record from RankingResult (say with id = 1) to a Java UDF (from within the SQL++ UDF) for more complex processing, like clustering the tweets, and scoring them based on how relevant they are for a given topic. The scoring within the Java UDF requires information about the record stored in RankingResult.
    > 
    > Applying the SQL++ UDF to a TwitterFeed, I aim to check whether a tweet is scored higher than the tweets found in the RankingList record (containing the top ranked tweets for the given topic). I see now that I could select the record I wish to use by "SELECT VALUE r FROM RankingResult r where id=1". One can think of the RankingResult dataset to hold one record per topic/user query which I want to find the top k most relevant tweets for. 
    > 
    > The overall goal of the project is to see if AsterixDB can be used to continuously rank tweets in real-time with respect to a user-defined query, meaning that the RankingResult record for the given user query should be updated continuously. I am however also looking into creating a TreeMap data structure in the Java UDF to hold the top current tweets and their scores, and use this for deciding whether the incoming tweet should switch place with any of the top ranked tweets. However, I would like to update the RankingResult record in order to make the data queryable.
    > 
    > 
    > Thanks in advance,
    > Sandra
    > 
    > On 2019/04/25 22:10:56, Mike Carey <dt...@gmail.com> wrote: 
    > > I will let someone else chime in on what the compilation error might be 
    > > about, but approach 1 has the problem that you rightly tried to correct 
    > > in approach 2 (because SELECT always returns an array of results).  But 
    > > - could you say a bit more - up 5000 feet - about the use case you are 
    > > trying to address...?  It's not clear (to me) why one might want to have 
    > > a single-item dataset - perhaps that's just a part of your 
    > > trying-to-make-this-work debugging - but it might help if the group 
    > > could see what you are trying to do overall.  (E.g., if you just want to 
    > > process incoming records on a feed, you wouldn't need another dataset 
    > > for that.  What's the more general picture/desire?)
    > > 
    > > Cheers,
    > > 
    > > Mike
    > > 
    > > On 4/25/19 12:08 AM, sandraskarshaug@gmail.com wrote:
    > > > Hi devs!
    > > >
    > > > Given a datatype RankingResultType and a dataset RankingResult(RankingResultType) which contains only one record, what is the correct approach when I want to pass a single RankingResult record as an argument to a Java UDF in a SQL++ UDF? The resulting record of the Java UDF should be selected at the end of the UDF as it is going to be stored in the dataset the feed which uses the SQL++ UDF is attached to.
    > > >
    > > > CREATE FUNCTION rank(newItem) {
    > > >   LET rankingResult = *must select the record here*,
    > > >   SELECT testlib#detectRelevance(newItem, *must pass RankingResult record here*)
    > > > };
    > > >
    > > > I have tried some different approaches, for instance
    > > > 1. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)
    > > >   SELECT testlib#detectRelevance(newItem, rankingResult)
    > > > 2. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)[0] SELECT testlib#detectRelevance(newItem, rankingResult)
    > > >
    > > > The first approach throws a TypeMismatchException, ASX1002: Type mismatch: function testlib#detectRelevance expects its 2nd input parameter to be of type object, but the actual input type is array
    > > >
    > > > So I therefore tried to access the first element of the array in the second approach, but the second approach does not compile:
    > > > SX1079: Compilation error: The input type union(RankingResultType: closed {
    > > >    id: bigint,
    > > >    first: RankingType: open { score: double },
    > > >    second: RankingType: open {score: double},
    > > >    third: RankingType: open { score: double},
    > > >    fourth: RankingType: open {score: double},
    > > >    fifth: RankingType: open {score: double}
    > > > } , null, missing) is not a valid record type!
    > > >
    > > > Could you maybe point me in the right direction?
    > > > Thanks in advance!
    > > >
    > > > Best,
    > > > Sandra
    > > >
    > > 
    >

Re: Add record as argument to Java UDF

Posted by sa...@gmail.com, sa...@gmail.com.

I forgot to add a link to the paper about the decoupled ingestion framework [1].

[1] https://arxiv.org/pdf/1902.08271.pdf

On 2019/04/25 22:54:15, sandraskarshaug@gmail.com <sa...@gmail.com> wrote: 
> Hi, thanks for your reply! I will try to be a bit more precise :-) 
> 
> I am currently testing the decoupled framework, and I would like to use data from another dataset when enriching tweets, here being data from the RankingResult dataset. Additionally, I would like to send the incoming tweet, as well as a record from RankingResult (say with id = 1) to a Java UDF (from within the SQL++ UDF) for more complex processing, like clustering the tweets, and scoring them based on how relevant they are for a given topic. The scoring within the Java UDF requires information about the record stored in RankingResult.
> 
> Applying the SQL++ UDF to a TwitterFeed, I aim to check whether a tweet is scored higher than the tweets found in the RankingList record (containing the top ranked tweets for the given topic). I see now that I could select the record I wish to use by "SELECT VALUE r FROM RankingResult r where id=1". One can think of the RankingResult dataset to hold one record per topic/user query which I want to find the top k most relevant tweets for. 
> 
> The overall goal of the project is to see if AsterixDB can be used to continuously rank tweets in real-time with respect to a user-defined query, meaning that the RankingResult record for the given user query should be updated continuously. I am however also looking into creating a TreeMap data structure in the Java UDF to hold the top current tweets and their scores, and use this for deciding whether the incoming tweet should switch place with any of the top ranked tweets. However, I would like to update the RankingResult record in order to make the data queryable.
> 
> 
> Thanks in advance,
> Sandra
> 
> On 2019/04/25 22:10:56, Mike Carey <dt...@gmail.com> wrote: 
> > I will let someone else chime in on what the compilation error might be 
> > about, but approach 1 has the problem that you rightly tried to correct 
> > in approach 2 (because SELECT always returns an array of results).  But 
> > - could you say a bit more - up 5000 feet - about the use case you are 
> > trying to address...?  It's not clear (to me) why one might want to have 
> > a single-item dataset - perhaps that's just a part of your 
> > trying-to-make-this-work debugging - but it might help if the group 
> > could see what you are trying to do overall.  (E.g., if you just want to 
> > process incoming records on a feed, you wouldn't need another dataset 
> > for that.  What's the more general picture/desire?)
> > 
> > Cheers,
> > 
> > Mike
> > 
> > On 4/25/19 12:08 AM, sandraskarshaug@gmail.com wrote:
> > > Hi devs!
> > >
> > > Given a datatype RankingResultType and a dataset RankingResult(RankingResultType) which contains only one record, what is the correct approach when I want to pass a single RankingResult record as an argument to a Java UDF in a SQL++ UDF? The resulting record of the Java UDF should be selected at the end of the UDF as it is going to be stored in the dataset the feed which uses the SQL++ UDF is attached to.
> > >
> > > CREATE FUNCTION rank(newItem) {
> > >   LET rankingResult = *must select the record here*,
> > >   SELECT testlib#detectRelevance(newItem, *must pass RankingResult record here*)
> > > };
> > >
> > > I have tried some different approaches, for instance
> > > 1. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)
> > >   SELECT testlib#detectRelevance(newItem, rankingResult)
> > > 2. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)[0] SELECT testlib#detectRelevance(newItem, rankingResult)
> > >
> > > The first approach throws a TypeMismatchException, ASX1002: Type mismatch: function testlib#detectRelevance expects its 2nd input parameter to be of type object, but the actual input type is array
> > >
> > > So I therefore tried to access the first element of the array in the second approach, but the second approach does not compile:
> > > SX1079: Compilation error: The input type union(RankingResultType: closed {
> > >    id: bigint,
> > >    first: RankingType: open { score: double },
> > >    second: RankingType: open {score: double},
> > >    third: RankingType: open { score: double},
> > >    fourth: RankingType: open {score: double},
> > >    fifth: RankingType: open {score: double}
> > > } , null, missing) is not a valid record type!
> > >
> > > Could you maybe point me in the right direction?
> > > Thanks in advance!
> > >
> > > Best,
> > > Sandra
> > >
> > 
>

Re: Add record as argument to Java UDF

Posted by sa...@gmail.com, sa...@gmail.com.

Hi, thanks for your reply! I will try to be a bit more precise :-) 

I am currently testing the decoupled framework, and I would like to use data from another dataset when enriching tweets, here being data from the RankingResult dataset. Additionally, I would like to send the incoming tweet, as well as a record from RankingResult (say with id = 1) to a Java UDF (from within the SQL++ UDF) for more complex processing, like clustering the tweets, and scoring them based on how relevant they are for a given topic. The scoring within the Java UDF requires information about the record stored in RankingResult.

Applying the SQL++ UDF to a TwitterFeed, I aim to check whether a tweet is scored higher than the tweets found in the RankingList record (containing the top ranked tweets for the given topic). I see now that I could select the record I wish to use by "SELECT VALUE r FROM RankingResult r where id=1". One can think of the RankingResult dataset to hold one record per topic/user query which I want to find the top k most relevant tweets for. 

The overall goal of the project is to see if AsterixDB can be used to continuously rank tweets in real-time with respect to a user-defined query, meaning that the RankingResult record for the given user query should be updated continuously. I am however also looking into creating a TreeMap data structure in the Java UDF to hold the top current tweets and their scores, and use this for deciding whether the incoming tweet should switch place with any of the top ranked tweets. However, I would like to update the RankingResult record in order to make the data queryable.

Thanks in advance,
Sandra

On 2019/04/25 22:10:56, Mike Carey <dt...@gmail.com> wrote: 
> I will let someone else chime in on what the compilation error might be 
> about, but approach 1 has the problem that you rightly tried to correct 
> in approach 2 (because SELECT always returns an array of results).  But 
> - could you say a bit more - up 5000 feet - about the use case you are 
> trying to address...?  It's not clear (to me) why one might want to have 
> a single-item dataset - perhaps that's just a part of your 
> trying-to-make-this-work debugging - but it might help if the group 
> could see what you are trying to do overall.  (E.g., if you just want to 
> process incoming records on a feed, you wouldn't need another dataset 
> for that.  What's the more general picture/desire?)
> 
> Cheers,
> 
> Mike
> 
> On 4/25/19 12:08 AM, sandraskarshaug@gmail.com wrote:
> > Hi devs!
> >
> > Given a datatype RankingResultType and a dataset RankingResult(RankingResultType) which contains only one record, what is the correct approach when I want to pass a single RankingResult record as an argument to a Java UDF in a SQL++ UDF? The resulting record of the Java UDF should be selected at the end of the UDF as it is going to be stored in the dataset the feed which uses the SQL++ UDF is attached to.
> >
> > CREATE FUNCTION rank(newItem) {
> >   LET rankingResult = *must select the record here*,
> >   SELECT testlib#detectRelevance(newItem, *must pass RankingResult record here*)
> > };
> >
> > I have tried some different approaches, for instance
> > 1. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)
> >   SELECT testlib#detectRelevance(newItem, rankingResult)
> > 2. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)[0] SELECT testlib#detectRelevance(newItem, rankingResult)
> >
> > The first approach throws a TypeMismatchException, ASX1002: Type mismatch: function testlib#detectRelevance expects its 2nd input parameter to be of type object, but the actual input type is array
> >
> > So I therefore tried to access the first element of the array in the second approach, but the second approach does not compile:
> > SX1079: Compilation error: The input type union(RankingResultType: closed {
> >    id: bigint,
> >    first: RankingType: open { score: double },
> >    second: RankingType: open {score: double},
> >    third: RankingType: open { score: double},
> >    fourth: RankingType: open {score: double},
> >    fifth: RankingType: open {score: double}
> > } , null, missing) is not a valid record type!
> >
> > Could you maybe point me in the right direction?
> > Thanks in advance!
> >
> > Best,
> > Sandra
> >
>

Re: Add record as argument to Java UDF

Posted by Mike Carey <dt...@gmail.com>.

I will let someone else chime in on what the compilation error might be 
about, but approach 1 has the problem that you rightly tried to correct 
in approach 2 (because SELECT always returns an array of results).  But 
- could you say a bit more - up 5000 feet - about the use case you are 
trying to address...?  It's not clear (to me) why one might want to have 
a single-item dataset - perhaps that's just a part of your 
trying-to-make-this-work debugging - but it might help if the group 
could see what you are trying to do overall.  (E.g., if you just want to 
process incoming records on a feed, you wouldn't need another dataset 
for that.  What's the more general picture/desire?)

Cheers,

Mike

On 4/25/19 12:08 AM, sandraskarshaug@gmail.com wrote:
> Hi devs!
>
> Given a datatype RankingResultType and a dataset RankingResult(RankingResultType) which contains only one record, what is the correct approach when I want to pass a single RankingResult record as an argument to a Java UDF in a SQL++ UDF? The resulting record of the Java UDF should be selected at the end of the UDF as it is going to be stored in the dataset the feed which uses the SQL++ UDF is attached to.
>
> CREATE FUNCTION rank(newItem) {
>   LET rankingResult = *must select the record here*,
>   SELECT testlib#detectRelevance(newItem, *must pass RankingResult record here*)
> };
>
> I have tried some different approaches, for instance
> 1. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)
>   SELECT testlib#detectRelevance(newItem, rankingResult)
> 2. running LET rankingResult = (SELECT VALUE r FROM RankingResult r)[0] SELECT testlib#detectRelevance(newItem, rankingResult)
>
> The first approach throws a TypeMismatchException, ASX1002: Type mismatch: function testlib#detectRelevance expects its 2nd input parameter to be of type object, but the actual input type is array
>
> So I therefore tried to access the first element of the array in the second approach, but the second approach does not compile:
> SX1079: Compilation error: The input type union(RankingResultType: closed {
>    id: bigint,
>    first: RankingType: open { score: double },
>    second: RankingType: open {score: double},
>    third: RankingType: open { score: double},
>    fourth: RankingType: open {score: double},
>    fifth: RankingType: open {score: double}
> } , null, missing) is not a valid record type!
>
> Could you maybe point me in the right direction?
> Thanks in advance!
>
> Best,
> Sandra
>