You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by AsterixDB Code Review <do...@asterix-gerrit.ics.uci.edu> on 2021/04/29 18:08:24 UTC

Change in asterixdb[master]: [ASTERIXDB-2894] Update UDF docs

From Ian Maxon <im...@uci.edu>:

Ian Maxon has uploaded this change for review. ( https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/11225 )


Change subject: [ASTERIXDB-2894] Update UDF docs
......................................................................

[ASTERIXDB-2894] Update UDF docs

Change-Id: Id9780d72960f9094c29f7f5766185782069fe7cf
---
M asterixdb/asterix-doc/src/main/user-defined_function/udf.md
1 file changed, 44 insertions(+), 9 deletions(-)



  git pull ssh://asterix-gerrit.ics.uci.edu:29418/asterixdb refs/changes/25/11225/1

diff --git a/asterixdb/asterix-doc/src/main/user-defined_function/udf.md b/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
index fe72789..50721a9 100644
--- a/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
+++ b/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
@@ -27,7 +27,8 @@
 
 ## <a name="authentication">Endpoints and Authentication</a>
 
-The UDF endpoint is not enabled by default until authentication has been configured properly. To enable it, we
+The UDF endpoint is not enabled by default until authentication has been configured properly. Even when enabled, it will
+only be available on the loopback interface on each NC for security purposes. To enable it, we
 will need to set the path to the credential file and populate it with our username and password.
 
 The credential file is a simple `/etc/passwd` style text file with usernames and corresponding `bcrypt` hashed and salted
@@ -50,9 +51,7 @@
 ## <a name="installingUDF">Installing a Java UDF Library</a>
 
 To install a UDF package to the cluster, we need to send a Multipart Form-data HTTP request to the `/admin/udf` endpoint
-of the CC at the normal API port (`19004` by default). The request should use HTTP Basic authentication. This means your
-credentials will *not* be obfuscated or encrypted *in any way*, so submit to this endpoint over localhost or a network
-where you know your traffic is safe from eavesdropping. Any suitable tool will do, but for the example here I will use
+of the CC at the normal API port (`19004` by default). Any suitable tool will do, but for the example here I will use
 `curl` which is widely available.
 
 For example, to install a library with the following criteria:
@@ -65,7 +64,7 @@
 
 we would execute
 
-    curl -v -u admin:admin -X POST -F 'data=@./lib.zip' localhost:19004/admin/udf/udfs/testlib
+    curl -v -u admin:admin -X POST -F 'data=@./lib.zip' -F 'type=java' localhost:19004/admin/udf/udfs/testlib
 
 Any response other than `200` indicates an error in deployment.
 
@@ -125,7 +124,7 @@
 
 Then, deploy it the same as the Java UDF was, with the library name `pylib` in `udfs` dataverse
 
-    curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' localhost:19002/admin/udf/udfs/pylib
+    curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' -F 'type=python' localhost:19002/admin/udf/udfs/pylib
 
 With the library deployed, we can define a function within it for use. For example, to expose the Python function
 `sentiment` in the module `sentiment_mod` in the class `sent_model`, the `CREATE FUNCTION` would be as follows
@@ -140,11 +139,11 @@
 result for the same input, irrespective of when or how many times the function is called on that input. 
 This particular function behaves the same on each input, so it satisfies the deterministic property. 
 This enables better optimization of queries including this function.
-If a function is not deterministic then it should be declared as such by using `WITH` sub-clause:
+If a function is not deterministic then it should be declared as such by using a `WITH` sub-clause:
 
     USE udfs;
 
-    CREATE FUNCTION sentiment(a)
+    CREATE FUNCTION sentiment(text)
       AS "sentiment_mod", "sent_model.sentiment" AT pylib
       WITH { "deterministic": false }
 
@@ -161,6 +160,42 @@
     SELECT t.msg as msg, sentiment(t.msg) as sentiment
     FROM Tweets t;
 
+## <a name="pytpes">Python Type Mappings</a>
+
+Currently only a subset of AsterixDB types are supported in Python UDFs. The supported types are as follows:
+
+- Integer types (int8,16,32,64)
+- Floating point types (float, double)
+- String
+- Boolean
+- Arrays, Sets (casted to lists)
+- Objects (casted to dict)
+
+Unsupported types can be casted to these in SQL++ first in order to be passed to a Python UDF
+
+## <a name="execution">Execution Model For UDFs</a>
+
+AsterixDB queries are deployed across the cluster as Hyracks jobs. A Hyracks job has a lifecycle that can be simplified
+for the purposes of UDFs to 
+ - A pre-run phase which allocates resources, `open` 
+ - The time during which the job has data flowing through it, `nextFrame`
+ - Cleanup and shutdown in `close`. 
+
+If a SQL++ function is defined as a member of a class in the library, the class will be instantiated first 
+during `open`. The class will exist in memory for the lifetime of the query. Therefore if your function needs to reference
+files or other data that would be costly to load per-call, making it a member variable that is initialized in the constructor
+of the object will greatly increase the performance of the SQL++ function.
+
+For each function invoked during a query, there will be an independent instance of the function per data partition. This
+means that the function must not assume there is any sort of global state or that it can assume things about the layout
+of the data. The execution will be parallel across the entire cluster at the level of data parallelism.
+
+After initialization, the function bound in the SQL++ function definition is called once per tuple during the query 
+execution (i.e. `nextFrame`). Unless the function specifies `null-call` in the `WITH` clause, `NULL` values will be
+skipped. 
+
+At the close of the query, the function is torn down and not re-used in any way. All functions should assume that 
+nothing will persist in-memory outside of the lifetime of a query, and any behavior contrary to this is undefined.
 
 ## <a id="UDFOnFeeds">Attaching a UDF on Data Feeds</a>
 
@@ -245,7 +280,7 @@
 functions declared with the library are removed. First we'll drop the function we declared earlier:
 
     USE udfs;
-    DROP FUNCTION mysum@2;
+    DROP FUNCTION mysum(a,b);
 
 Then issue the proper `DELETE` request
 

-- 
To view, visit https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/11225
To unsubscribe, or for help writing mail filters, visit https://asterix-gerrit.ics.uci.edu/settings

Gerrit-Project: asterixdb
Gerrit-Branch: master
Gerrit-Change-Id: Id9780d72960f9094c29f7f5766185782069fe7cf
Gerrit-Change-Number: 11225
Gerrit-PatchSet: 1
Gerrit-Owner: Ian Maxon <im...@uci.edu>
Gerrit-MessageType: newchange

Change in asterixdb[master]: [ASTERIXDB-2894] Update UDF docs

Posted by AsterixDB Code Review <do...@asterix-gerrit.ics.uci.edu>.
From Ian Maxon <im...@uci.edu>:

Ian Maxon has uploaded this change for review. ( https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/11225 )


Change subject: [ASTERIXDB-2894] Update UDF docs
......................................................................

[ASTERIXDB-2894] Update UDF docs

Change-Id: Id9780d72960f9094c29f7f5766185782069fe7cf
---
M asterixdb/asterix-doc/src/main/user-defined_function/udf.md
1 file changed, 44 insertions(+), 9 deletions(-)



  git pull ssh://asterix-gerrit.ics.uci.edu:29418/asterixdb refs/changes/25/11225/1

diff --git a/asterixdb/asterix-doc/src/main/user-defined_function/udf.md b/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
index fe72789..50721a9 100644
--- a/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
+++ b/asterixdb/asterix-doc/src/main/user-defined_function/udf.md
@@ -27,7 +27,8 @@
 
 ## <a name="authentication">Endpoints and Authentication</a>
 
-The UDF endpoint is not enabled by default until authentication has been configured properly. To enable it, we
+The UDF endpoint is not enabled by default until authentication has been configured properly. Even when enabled, it will
+only be available on the loopback interface on each NC for security purposes. To enable it, we
 will need to set the path to the credential file and populate it with our username and password.
 
 The credential file is a simple `/etc/passwd` style text file with usernames and corresponding `bcrypt` hashed and salted
@@ -50,9 +51,7 @@
 ## <a name="installingUDF">Installing a Java UDF Library</a>
 
 To install a UDF package to the cluster, we need to send a Multipart Form-data HTTP request to the `/admin/udf` endpoint
-of the CC at the normal API port (`19004` by default). The request should use HTTP Basic authentication. This means your
-credentials will *not* be obfuscated or encrypted *in any way*, so submit to this endpoint over localhost or a network
-where you know your traffic is safe from eavesdropping. Any suitable tool will do, but for the example here I will use
+of the CC at the normal API port (`19004` by default). Any suitable tool will do, but for the example here I will use
 `curl` which is widely available.
 
 For example, to install a library with the following criteria:
@@ -65,7 +64,7 @@
 
 we would execute
 
-    curl -v -u admin:admin -X POST -F 'data=@./lib.zip' localhost:19004/admin/udf/udfs/testlib
+    curl -v -u admin:admin -X POST -F 'data=@./lib.zip' -F 'type=java' localhost:19004/admin/udf/udfs/testlib
 
 Any response other than `200` indicates an error in deployment.
 
@@ -125,7 +124,7 @@
 
 Then, deploy it the same as the Java UDF was, with the library name `pylib` in `udfs` dataverse
 
-    curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' localhost:19002/admin/udf/udfs/pylib
+    curl -v -u admin:admin -X POST -F 'data=@./lib.pyz' -F 'type=python' localhost:19002/admin/udf/udfs/pylib
 
 With the library deployed, we can define a function within it for use. For example, to expose the Python function
 `sentiment` in the module `sentiment_mod` in the class `sent_model`, the `CREATE FUNCTION` would be as follows
@@ -140,11 +139,11 @@
 result for the same input, irrespective of when or how many times the function is called on that input. 
 This particular function behaves the same on each input, so it satisfies the deterministic property. 
 This enables better optimization of queries including this function.
-If a function is not deterministic then it should be declared as such by using `WITH` sub-clause:
+If a function is not deterministic then it should be declared as such by using a `WITH` sub-clause:
 
     USE udfs;
 
-    CREATE FUNCTION sentiment(a)
+    CREATE FUNCTION sentiment(text)
       AS "sentiment_mod", "sent_model.sentiment" AT pylib
       WITH { "deterministic": false }
 
@@ -161,6 +160,42 @@
     SELECT t.msg as msg, sentiment(t.msg) as sentiment
     FROM Tweets t;
 
+## <a name="pytpes">Python Type Mappings</a>
+
+Currently only a subset of AsterixDB types are supported in Python UDFs. The supported types are as follows:
+
+- Integer types (int8,16,32,64)
+- Floating point types (float, double)
+- String
+- Boolean
+- Arrays, Sets (casted to lists)
+- Objects (casted to dict)
+
+Unsupported types can be casted to these in SQL++ first in order to be passed to a Python UDF
+
+## <a name="execution">Execution Model For UDFs</a>
+
+AsterixDB queries are deployed across the cluster as Hyracks jobs. A Hyracks job has a lifecycle that can be simplified
+for the purposes of UDFs to 
+ - A pre-run phase which allocates resources, `open` 
+ - The time during which the job has data flowing through it, `nextFrame`
+ - Cleanup and shutdown in `close`. 
+
+If a SQL++ function is defined as a member of a class in the library, the class will be instantiated first 
+during `open`. The class will exist in memory for the lifetime of the query. Therefore if your function needs to reference
+files or other data that would be costly to load per-call, making it a member variable that is initialized in the constructor
+of the object will greatly increase the performance of the SQL++ function.
+
+For each function invoked during a query, there will be an independent instance of the function per data partition. This
+means that the function must not assume there is any sort of global state or that it can assume things about the layout
+of the data. The execution will be parallel across the entire cluster at the level of data parallelism.
+
+After initialization, the function bound in the SQL++ function definition is called once per tuple during the query 
+execution (i.e. `nextFrame`). Unless the function specifies `null-call` in the `WITH` clause, `NULL` values will be
+skipped. 
+
+At the close of the query, the function is torn down and not re-used in any way. All functions should assume that 
+nothing will persist in-memory outside of the lifetime of a query, and any behavior contrary to this is undefined.
 
 ## <a id="UDFOnFeeds">Attaching a UDF on Data Feeds</a>
 
@@ -245,7 +280,7 @@
 functions declared with the library are removed. First we'll drop the function we declared earlier:
 
     USE udfs;
-    DROP FUNCTION mysum@2;
+    DROP FUNCTION mysum(a,b);
 
 Then issue the proper `DELETE` request
 

-- 
To view, visit https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/11225
To unsubscribe, or for help writing mail filters, visit https://asterix-gerrit.ics.uci.edu/settings

Gerrit-Project: asterixdb
Gerrit-Branch: master
Gerrit-Change-Id: Id9780d72960f9094c29f7f5766185782069fe7cf
Gerrit-Change-Number: 11225
Gerrit-PatchSet: 1
Gerrit-Owner: Ian Maxon <im...@uci.edu>
Gerrit-MessageType: newchange