You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Joe Witt (Jira)" <ji...@apache.org> on 2020/05/04 14:49:00 UTC

[jira] [Updated] (MINIFICPP-1199) Integrates MiNiFi C++ with H2O Driverless AI Python Scoring Pipeline To Do ML Inference on Edge

     [ https://issues.apache.org/jira/browse/MINIFICPP-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joe Witt updated MINIFICPP-1199:
--------------------------------
    Priority: Blocker  (was: Major)

> Integrates MiNiFi C++ with H2O Driverless AI Python Scoring Pipeline To Do ML Inference on Edge
> -----------------------------------------------------------------------------------------------
>
>                 Key: MINIFICPP-1199
>                 URL: https://issues.apache.org/jira/browse/MINIFICPP-1199
>             Project: Apache NiFi MiNiFi C++
>          Issue Type: New Feature
>    Affects Versions: master
>         Environment: Ubuntu 18.04 in AWS EC2
> MiNiFi C++ 0.7.0
>            Reporter: James Medel
>            Priority: Blocker
>             Fix For: master
>
>
> *MiNiFi C++ and H2O Driverless AI Integration* via Custom Python Processors:
> Integrates MiNiFi C++ with H2O's Driverless AI by Using Driverless AI's Python Scoring Pipeline and MiNiFi's Custom Python Processors. Uses the Python Processors to execute the Python Scoring Pipeline scorer to do batch scoring and real-time scoring for one or more predicted labels on test data in the incoming flow file content. I would like to contribute my processors to MiNiFi C++ as a new feature.
>  
> *3 custom python processors* created for MiNiFi:
> *H2oPspScoreRealTime* - Executes H2O Driverless AI's Python Scoring Pipeline to do interactive scoring (real-time) scoring on an individual row or list of test data within each incoming flow file. Uses H2O's open-source Datatable library to load test data into a frame, then converts it to pandas dataframe. Pandas is used to convert the pandas dataframe rows to a list of lists, but since each flow file passing through this processor should have only 1 row, we extract the 1st list. Then that list is passed into the Driverless AI's Python scorer.score() function to predict one or more predicted labels. The prediction is returned to a list. The number of predicted labels is specified when the user built the Python Scoring Pipeline in Driverless AI. With that knowledge, there is a property for the user to pass in one or more predicted label names that will be used as the predicted header. I create a comma separated string using the predicted header and predicted value. The predicted header(s) is on one line followed by a newline and the predicted value(s) is on the next line followed by a newline. The string is written to the flow file content. Flow File attributes are added to the flow file for the number of lists scored and the predicted label name and its associated score. Finally, the flow file is transferred on a success relationship.
>  
> *H2oPspScoreBatches* - Executes H2O Driverless AI's Python Scoring Pipeline to do batch scoring on a frame of data within each incoming flow file. Uses H2O's open-source Datatable library to load test data into a frame. Each frame from the flow file passing through this processor should have multiple rows. That frame is passed into the Driverless AI's Python scorer.score_batch() function to predict one or more predicted labels. The prediction is returned to a pandas dataframe, then that dataframe is converted to a string, so it can be written to the flow file content. Flow File attributes are added to the flow file for the number of rows scored. There are also flow file attributes added for the predicted label name and its associated score for the first row in the frame. Finally, the flow file is transferred on a success relationship.
>  
> *ConvertDsToCsv* - Converts data source of incoming flow file to csv. Uses H2O's open-source Datatable library to load data into a frame, then converts it to pandas dataframe. Pandas is used to convert the pandas dataframe to a csv and store it into in-memory text stream StringIO without pandas dataframe index. The csv string data is grabbed using file read() function on the StringIO object, so it can be written to the flow file content. The flow file is transferred on a success relationship.
>  
> *Hydraulic System Condition Monitoring* Data used in MiNiFi Flow:
> The sensor test data I used in this integration comes from [Kaggle: Condition Monitoring of Hydraulic Systems|[https://www.kaggle.com/jjacostupa/condition-monitoring-of-hydraulic-systems#description.txt]]. I was able to predict hydraulic system cooling efficiency through MiNiFi and H2O integration described above. This use case here is hydraulic system predictive maintenance.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)