You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by 刘岩 <li...@richinfo.cn> on 2016/03/12 17:12:27 UTC

Re:Re: Multiple dataflow jobs management(lots of jobs)

Hi AldrinCurrently we need to extract 60K tables per day , and the time window is limited to 8 Hours. Which means that we need to run jobs concurrently , and we need a general description of what39s going on with all those 60K job flows and take further actions. We have tried Kettle and Talend , Talend is a IDE-Based so not what we are looking for, and Kettle was crashed due to the Mysql cannot handle the Kettle39s metadata with 10K jobs.So we want to use Nifi , this is really the product that we are looking for , but the missing piece here is a DataFlow jobs Admin Page. so we can have multiple Nifi instances running on different nodes, but monitoring the jobs in one page. If it can intergrate with Ambari metrics API, then we can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and Hive View. Thank you very much Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

06/03/2016----邮件原文----发件人：Aldrin Piri <al...@gmail.com>收件人：users <us...@nifi.apache.org>抄 送: dev <de...@nifi.apache.org>发送时间：2016-03-11 02:27:11主题：Re: Mutiple dataflow jobs management(lots of jobs)Hi Yan,
We can get more into details and particulars if needed, but have you experimented with expression language? I could see a Cron driven approach which covers your periodic efforts that feeds some number of ExecuteSQL processors (perhaps one for each database you are communicating with) each having a table. This would certainly cut down on the need for the 30k processors on a one-to-one basis with a given processor.

In terms of monitoring the dataflows, could you describe what else you are searching for beyond the graph view? NiFi tries to provide context for the flow of data but is not trying to be a sole monitoring, we can give information on a processor basis, but do not delve into specifics. There is a summary view for the overall flow where you can monitor stats about the components and connections in the system. We support interoperation with monitoring systems via push (ReportingTask) and pull (REST API [2]) semantics.

Any other details beyond your list of how this all interoperates might shed some more light on what you are trying to accomplish. It seems like NiFi should be able to help with this. With some additional information we may be able to provide further guidance or at least get some insights on use cases we could look to improve upon and extend NiFi to support.

Thanks!

[1] http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
[2] http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
[3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html

On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <li...@richinfo.cn> wrote:Hi All

i39m trying to adapt Nifi to production but can not find an admin console which monitoring the dataflows

The scenarios is simple,

1. we gather data from oracle database to hdfs and then to hive.

2. residules/incrementals are updated daily or monthly via Nifi.

3. full dump on some table are excuted daily or monthly via Nifi.

is it really simple , however , we have 7 oracle databases with over 30K tables needs to implement the above scenario.

which means that i will drag that ExcuteSQL elements for like 30K time or so and also need to place them with a nice looking way on my little 21 inch screen .

Just wondering if there is a table list like ,groupable and searchable task control and monitoring feature for Nifi

Thank you very much in advance

Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

06/03/2016

Re: Re: Re: Re: Multiple dataflow jobs management(lots of jobs)

Posted by Joe Witt <jo...@gmail.com>.

Hello

Do we plan to support Ambari metrics?

- We already do support a reporting task to send data to Ambari.
Additional work is being considered for more full support of Ambari
such that Ambari can be used to deploy/upgrade NiFi.  Great time to
share input/ideas if you have them.

Are there plans to separate the designer from the executor?

- Well we've described a child project called MiNiFi and you might
have seen recent e-mails that this work is underway.  It will
inherently have the execution and flow design disconnected.  For NiFi
itself we have discussed some really powerful roadmap efforts for
making templates and extensions more powerfully used through a
registry mechanism.  That will in many ways lead to better support for
a design and deploy model.  That said, NiFi was built originally with
a very intentional view against design and deploy so I think it is
likely that there will continue to be preference toward providing an
unprecedented and powerful interactive command and control model.
That interactive command and control model is important not only for
providing a nice intuitive user experience but also to enable
automated system interactions that can alter the dataflow structure
and behavior.

Where can I find examples for processors?

- Here is a pretty good set
https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates.
There are also many more made available here
https://github.com/hortonworks-gallery/nifi-templates.

Thanks
Joe

On Sun, Mar 13, 2016 at 2:16 PM, 刘岩 <li...@richinfo.cn> wrote:
> Hi  Joe
>
>
> Thanks for that clarify ， we have 3 more questions
>
>
>     does NIFI going to have a ambari metric API to collect monitoring data?
>
>
>
>     does the Designer and excutor  will be separated from each other?
>
>
>
>     where can i find demos/examples for each processor？
>
>
> Thank you very much
>
>
> Yan Liu
> Hortonworks Service Division
>
> Richinfo, Shenzhen, China (PR)
>
> 14/03/20
>
>
>
>
> ----邮件原文----
> 发件人：Joe Witt  <jo...@gmail.com>
> 收件人：users <us...@nifi.apache.org>
> 抄　送: (无)
> 发送时间：2016-03-14 01:14:40
> 主题：Re: Re: Re: Multiple dataflow jobs management(lots of jobs)
>
> To clarify about 'HA and master node' - that is for the control plane
> itself. The data continues to flow on all nodes even if the NCM is
> down. That said, we are working to solve it now with zero-master
> clustering.
>
> Thanks
> Joe
>
> On Sun, Mar 13, 2016 at 12:20 PM, 刘岩 wrote:
>> Hi Thad
>>
>> Thank you very much for your advice. Kettle can do the job for sure , but
>> the metadata i was talking about is the metadata of the job descriptions
>> used for kettle itself. The only option left for kettle is multiple
>> instances , but that also means that we need to develop a master
>> application
>> to gather all the instances metadata.
>>
>> Moreover , Kettle does not have a Web Based GUI for designing and testing
>> the job , that's why we want NIFI , but again , multiple instances of nifi
>> also leads to a HA problem for master node, so we turn to ambari metrics
>> for
>> that issue.
>>
>> Talend has a cloud server doing the similar thing, but it's running on
>> public cloud which is not accepted by our client.
>>
>> Kettle is a great ETL tool, but Web Based designer is really the master
>> point for future.
>>
>>
>> Thank you very much
>>
>> Yan Liu
>>
>>
>> Yan Liu
>>
>> Hortonworks Service Division
>>
>> Richinfo, Shenzhen, China (PR)
>>
>> 14/03/2016
>>
>>
>>
>>
>> ----邮件原文----
>> 发件人：Thad Guidry
>> 收件人：users
>> 抄　送: dev
>> 发送时间：2016-03-13 23:04:39
>>
>> 主题：Re: Re: Multiple dataflow jobs management(lots of jobs)
>>
>> Yan,
>>
>> Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K
>> jobs to accomplish this is not the proper way to setup Pentaho. Also,
>> using
>> MySQL to store the metadata is where you made a wrong choice. PostgreSQL
>> with data silos on SSD drives would be a better choice, while properly
>> doing
>> Async config [1] and other necessary steps for high writes. Don't keep
>> Pentaho's Table output commit levels at their default of 10k rows when
>> your
>> processing millions of rows!) For Oracle 11g or PostgreSQL, where I need
>> 30
>> sec time slice windows for the metadata logging and where I typically have
>> less than 1k of data on average per row, I typically will choose 200k rows
>> or more in Pentaho's table output commit option.
>>
>> I would suggest you contact Pentaho for some adhoc support or hire some
>> consultants to help you learn more, or setup properly for your use case.
>> For free, you can also just do a web search on "Pentaho best practices".
>> There's a lot to learn from industry experts who already have used these
>> tools and know their quirks.
>>
>> [1]
>>
>> http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR
>>
>>
>> Thad
>> +ThadGuidry
>>
>
>> On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 wrote:
>>>
>>> Hi Aldrin
>>>
>>> some additional information.
>>>
>>> it's a typical ETL offloading user case
>>>
>>> each extraction job should foucs on 1 table and 1 table only. data will
>>> be written on HDFS , this is similar to Database Staging.
>>>
>>> The reason why we need to foucs on 1 table for each job is because there
>>> might be database error or disconnection occur during the extraction , if
>>> it's running as a script like extraction job with expression langurage,
>>> then it's hard to do the re-running or excape thing on that table or
>>> tables.
>>>
>>> once the extraction is done, a triger like action will do the data
>>> cleansing. this is similar to ODS layer of Datawarehousing
>>>
>>> if the data quality has passed the quality check , then it will be marked
>>> as cleaned. otherwise , it will return to previous step and redo the data
>>> extraction, or send alert/email to the system administrator.
>>>
>>> if certain numbers of tables were all cleaned and checked , then it will
>>> call some Transforming processor to do the transforming ， then push the
>>> data into a datawarehouse (Hive in our case)
>>>
>>>
>>> Thank you very much
>>>
>>> Yan Liu
>>>
>>> Hortonworks Service Division
>>>
>>> Richinfo, Shenzhen, China (PR)
>>>
>>> 13/03/2016
>>>
>>> ----邮件原文----
>>> 发件人："刘岩"
>>> 收件人：users
>>> 抄　送: dev
>>> 发送时间：2016-03-13 00:12:27
>>> 主题：Re:Re: Multiple dataflow jobs management(lots of jobs)
>>>
>>>
>>> Hi Aldrin
>>>
>>> Currently we need to extract 60K tables per day , and the time window is
>>> limited to 8 Hours. Which means that we need to run jobs concurrently ,
>>> and
>>> we need a general description of what's going on with all those 60K job
>>> flows and take further actions.
>>>
>>> We have tried Kettle and Talend , Talend is a IDE-Based so not what we
>>> are looking for, and Kettle was crashed due to the Mysql cannot handle
>>> the
>>> Kettle's metadata with 10K jobs.
>>>
>>> So we want to use Nifi , this is really the product that we are looking
>>> for , but the missing piece here is a DataFlow jobs Admin Page. so we can
>>> have multiple Nifi instances running on different nodes, but monitoring
>>> the
>>> jobs in one page. If it can intergrate with Ambari metrics API, then we
>>> can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View
>>> and
>>> Hive View.
>>>
>>>
>>> Thank you very much
>>>
>>> Yan Liu
>>>
>>> Hortonworks Service Division
>>>
>>> Richinfo, Shenzhen, China (PR)
>>>
>>> 06/03/2016
>>>
>>>
>>> ----邮件原文----
>>> 发件人：Aldrin Piri
>>> 收件人：users
>>> 抄　送: dev
>
>>> 发送时间：2016-03-11 02:27:11
>>> 主题：Re: Mutiple dataflow jobs management(lots of jobs)
>>>
>>> Hi Yan,
>>>
>>> We can get more into details and particulars if needed, but have you
>>> experimented with expression language? I could see a Cron driven approach
>>> which covers your periodic efforts that feeds some number of ExecuteSQL
>>> processors (perhaps one for each database you are communicating with)
>>> each
>>> having a table. This would certainly cut down on the need for the 30k
>>> processors on a one-to-one basis with a given processor.
>>>
>>> In terms of monitoring the dataflows, could you describe what else you
>>> are
>>> searching for beyond the graph view? NiFi tries to provide context for
>>> the
>>> flow of data but is not trying to be a sole monitoring, we can give
>>> information on a processor basis, but do not delve into specifics. There
>>> is
>>> a summary view for the overall flow where you can monitor stats about the
>>> components and connections in the system. We support interoperation with
>>> monitoring systems via push (ReportingTask) and pull (REST API [2])
>>> semantics.
>>>
>>> Any other details beyond your list of how this all interoperates might
>>> shed some more light on what you are trying to accomplish. It seems like
>>> NiFi should be able to help with this. With some additional information
>>> we
>>> may be able to provide further guidance or at least get some insights on
>>> use
>>> cases we could look to improve upon and extend NiFi to support.
>>>
>>> Thanks!
>>>
>>>
>>> [1]
>>> http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
>>> [2]
>>>
>>> http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
>>> [3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html
>>>
>>> On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 wrote:
>>>>
>>>> Hi All
>>>>
>>>>
>>>> i'm trying to adapt Nifi to production but can not find an admin
>>>> console which monitoring the dataflows
>>>>
>>>>
>>>> The scenarios is simple,
>>>>
>>>>
>>>> 1. we gather data from oracle database to hdfs and then to hive.
>>>>
>>>> 2. residules/incrementals are updated daily or monthly via Nifi.
>>>>
>>>> 3. full dump on some table are excuted daily or monthly via Nifi.
>>>>
>>>>
>>>> is it really simple , however , we have 7 oracle databases with
>>>> over 30K tables needs to implement the above scenario.
>>>>
>>>>
>>>> which means that i will drag that ExcuteSQL elements for like 30K time
>>>> or so and also need to place them with a nice looking way on my little
>>>> 21
>>>> inch screen .
>>>>
>>>>
>>>> Just wondering if there is a table list like ,groupable and searchable
>>>> task control and monitoring feature for Nifi
>>>>
>>>>
>>>>
>>>> Thank you very much in advance
>>>>
>>>>
>>>>
>>>> Yan Liu
>>>>
>>>> Hortonworks Service Division
>>>>
>>>> Richinfo, Shenzhen, China (PR)
>>>>
>>>> 06/03/2016
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
> Subject：Re: Re: Re: Multiple dataflow jobs management(lots of jobs)
>
> To clarify about 'HA and master node' - that is for the control plane
> itself. The data continues to flow on all nodes even if the NCM is
> down. That said, we are working to solve it now with zero-master
> clustering.
>
> Thanks
> Joe
>
> On Sun, Mar 13, 2016 at 12:20 PM, 刘岩 wrote:
>> Hi Thad
>>
>> Thank you very much for your advice. Kettle can do the job for sure , but
>> the metadata i was talking about is the metadata of the job descriptions
>> used for kettle itself. The only option left for kettle is multiple
>> instances , but that also means that we need to develop a master
>> application
>> to gather all the instances metadata.
>>
>> Moreover , Kettle does not have a Web Based GUI for designing and testing
>> the job , that's why we want NIFI , but again , multiple instances of nifi
>> also leads to a HA problem for master node, so we turn to ambari metrics
>> for
>> that issue.
>>
>> Talend has a cloud server doing the similar thing, but it's running on
>> public cloud which is not accepted by our client.
>>
>> Kettle is a great ETL tool, but Web Based designer is really the master
>> point for future.
>>
>>
>> Thank you very much
>>
>> Yan Liu
>>
>>
>> Yan Liu
>>
>> Hortonworks Service Division
>>
>> Richinfo, Shenzhen, China (PR)
>>
>> 14/03/2016
>>
>>
>>
>>
>> ----邮件原文----
>> 发件人：Thad Guidry
>> 收件人：users
>> 抄　送: dev
>> 发送时间：2016-03-13 23:04:39
>>
>> 主题：Re: Re: Multiple dataflow jobs management(lots of jobs)
>>
>> Yan,
>>
>> Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K
>> jobs to accomplish this is not the proper way to setup Pentaho. Also,
>> using
>> MySQL to store the metadata is where you made a wrong choice. PostgreSQL
>> with data silos on SSD drives would be a better choice, while properly
>> doing
>> Async config [1] and other necessary steps for high writes. Don't keep
>> Pentaho's Table output commit levels at their default of 10k rows when
>> your
>> processing millions of rows!) For Oracle 11g or PostgreSQL, where I need
>> 30
>> sec time slice windows for the metadata logging and where I typically have
>> less than 1k of data on average per row, I typically will choose 200k rows
>> or more in Pentaho's table output commit option.
>>
>> I would suggest you contact Pentaho for some adhoc support or hire some
>> consultants to help you learn more, or setup properly for your use case.
>> For free, you can also just do a web search on "Pentaho best practices".
>> There's a lot to learn from industry experts who already have used these
>> tools and know their quirks.
>>
>> [1]
>>
>> http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR
>>
>>
>> Thad
>> +ThadGuidry
>>
>> On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 wrote:
>>>
>>> Hi Aldrin
>>>
>>> some additional information.
>>>
>>> it's a typical ETL offloading user case
>>>
>>> each extraction job should foucs on 1 table and 1 table only. data will
>>> be written on HDFS , this is similar to Database Staging.
>>>
>>> The reason why we need to foucs on 1 table for each job is because there
>>> might be database error or disconnection occur during the extraction , if
>>> it's running as a script like extraction job with expression langurage,
>>> then it's hard to do the re-running or excape thing on that table or
>>> tables.
>>>
>>> once the extraction is done, a triger like action will do the data
>>> cleansing. this is similar to ODS layer of Datawarehousing
>>>
>>> if the data quality has passed the quality check , then it will be marked
>>> as cleaned. otherwise , it will return to previous step and redo the data
>>> extraction, or send alert/email to the system administrator.
>>>
>>> if certain numbers of tables were all cleaned and checked , then it will
>>> call some Transforming processor to do the transforming ， then push the
>>> data into a datawarehouse (Hive in our case)
>>>
>>>
>>> Thank you very much
>>>
>>> Yan Liu
>>>
>>> Hortonworks Service Division
>>>
>>> Richinfo, Shenzhen, China (PR)
>>>
>>> 13/03/2016
>>>
>>> ----邮件原文----
>>> 发件人："刘岩"
>>> 收件人：users
>>> 抄　送: dev
>>> 发送时间：2016-03-13 00:12:27
>>> 主题：Re:Re: Multiple dataflow jobs management(lots of jobs)
>>>
>>>
>>> Hi Aldrin
>>>
>>> Currently we need to extract 60K tables per day , and the time window is
>>> limited to 8 Hours. Which means that we need to run jobs concurrently ,
>>> and
>>> we need a general description of what's going on with all those 60K job
>>> flows and take further actions.
>>>
>>> We have tried Kettle and Talend , Talend is a IDE-Based so not what we
>>> are looking for, and Kettle was crashed due to the Mysql cannot handle
>>> the
>>> Kettle's metadata with 10K jobs.
>>>
>>> So we want to use Nifi , this is really the product that we are looking
>>> for , but the missing piece here is a DataFlow jobs Admin Page. so we can
>>> have multiple Nifi instances running on different nodes, but monitoring
>>> the
>>> jobs in one page. If it can intergrate with Ambari metrics API, then we
>>> can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View
>>> and
>>> Hive View.
>>>
>>>
>>> Thank you very much
>>>
>>> Yan Liu
>>>
>>> Hortonworks Service Division
>>>
>>> Richinfo, Shenzhen, China (PR)
>>>
>>> 06/03/2016
>>>
>>>
>>> ----邮件原文----
>>> 发件人：Aldrin Piri
>>> 收件人：users
>>> 抄　送: dev
>
>>> 发送时间：2016-03-11 02:27:11
>>> 主题：Re: Mutiple dataflow jobs management(lots of jobs)
>>>
>>> Hi Yan,
>>>
>>> We can get more into details and particulars if needed, but have you
>>> experimented with expression language? I could see a Cron driven approach
>>> which covers your periodic efforts that feeds some number of ExecuteSQL
>>> processors (perhaps one for each database you are communicating with)
>>> each
>>> having a table. This would certainly cut down on the need for the 30k
>>> processors on a one-to-one basis with a given processor.
>>>
>>> In terms of monitoring the dataflows, could you describe what else you
>>> are
>>> searching for beyond the graph view? NiFi tries to provide context for
>>> the
>>> flow of data but is not trying to be a sole monitoring, we can give
>>> information on a processor basis, but do not delve into specifics. There
>>> is
>>> a summary view for the overall flow where you can monitor stats about the
>>> components and connections in the system. We support interoperation with
>>> monitoring systems via push (ReportingTask) and pull (REST API [2])
>>> semantics.
>>>
>>> Any other details beyond your list of how this all interoperates might
>>> shed some more light on what you are trying to accomplish. It seems like
>>> NiFi should be able to help with this. With some additional information
>>> we
>>> may be able to provide further guidance or at least get some insights on
>>> use
>>> cases we could look to improve upon and extend NiFi to support.
>>>
>>> Thanks!
>>>
>>>
>>> [1]
>>> http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
>>> [2]
>>>
>>> http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
>>> [3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html
>>>
>>> On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 wrote:
>>>>
>>>> Hi All
>>>>
>>>>
>>>> i'm trying to adapt Nifi to production but can not find an admin
>>>> console which monitoring the dataflows
>>>>
>>>>
>>>> The scenarios is simple,
>>>>
>>>>
>>>> 1. we gather data from oracle database to hdfs and then to hive.
>>>>
>>>> 2. residules/incrementals are updated daily or monthly via Nifi.
>>>>
>>>> 3. full dump on some table are excuted daily or monthly via Nifi.
>>>>
>>>>
>>>> is it really simple , however , we have 7 oracle databases with
>>>> over 30K tables needs to implement the above scenario.
>>>>
>>>>
>>>> which means that i will drag that ExcuteSQL elements for like 30K time
>>>> or so and also need to place them with a nice looking way on my little
>>>> 21
>>>> inch screen .
>>>>
>>>>
>>>> Just wondering if there is a table list like ,groupable and searchable
>>>> task control and monitoring feature for Nifi
>>>>
>>>>
>>>>
>>>> Thank you very much in advance
>>>>
>>>>
>>>>
>>>> Yan Liu
>>>>
>>>> Hortonworks Service Division
>>>>
>>>> Richinfo, Shenzhen, China (PR)
>>>>
>>>> 06/03/2016
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>

Re:Re: Re: Re: Multiple dataflow jobs management(lots of jobs)

Posted by 刘岩 <li...@richinfo.cn>.

Hi  Joe 



Thanks for that clarify ， we have 3 more questions 



    does NIFI going to have a ambari metric API to collect monitoring data?  

  

    does the Designer and excutor  will be separated from each other?

  

    where can i find demos/examples for each processor？ 



Thank you very much



Yan LiuHortonworks Service Division

Richinfo, Shenzhen, China (PR)14/03/20







----邮件原文----发件人：Joe Witt  <jo...@gmail.com>收件人：users <us...@nifi.apache.org>抄　送: (无)发送时间：2016-03-14 01:14:40主题：Re: Re: Re: Multiple dataflow jobs management(lots of jobs)To clarify about 39HA and master node39 - that is for the control planeitself. The data continues to flow on all nodes even if the NCM isdown. That said, we are working to solve it now with zero-masterclustering.ThanksJoeOn Sun, Mar 13, 2016 at 12:20 PM, 刘岩 wrote:> Hi Thad>> Thank you very much for your advice. Kettle can do the job for sure , but> the metadata i was talking about is the metadata of the job descriptions> used for kettle itself. The only option left for kettle is multiple> instances , but that also means that we need to develop a master application> to gather all the instances metadata.>> Moreover , Kettle does not have a Web Based GUI for designing and testing> the job , that39s why we want NIFI , but again , multiple instances of nifi> also leads to a HA problem for master node, so we turn to ambari metrics for> that issue.>> Talend has a cloud server doing the similar thing, but it39s running on> public cloud which is not accepted by our client.>> Kettle is a great ETL tool, but Web Based designer is really the master> point for future.>>> Thank you very much>> Yan Liu>>> Yan Liu>> Hortonworks Service Division>> Richinfo, Shenzhen, China (PR)>> 14/03/2016>>>>> ----邮件原文----> 发件人：Thad Guidry > 收件人：users > 抄　送: dev > 发送时间：2016-03-13 23:04:39>> 主题：Re: Re: Multiple dataflow jobs management(lots of jobs)>> Yan,>> Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K> jobs to accomplish this is not the proper way to setup Pentaho. Also, using> MySQL to store the metadata is where you made a wrong choice. PostgreSQL> with data silos on SSD drives would be a better choice, while properly doing> Async config [1] and other necessary steps for high writes. Don39t keep> Pentaho39s Table output commit levels at their default of 10k rows when your> processing millions of rows!) For Oracle 11g or PostgreSQL, where I need 30> sec time slice windows for the metadata logging and where I typically have> less than 1k of data on average per row, I typically will choose 200k rows> or more in Pentaho39s table output commit option.>> I would suggest you contact Pentaho for some adhoc support or hire some> consultants to help you learn more, or setup properly for your use case.> For free, you can also just do a web search on "Pentaho best practices".> There39s a lot to learn from industry experts who already have used these> tools and know their quirks.>> [1]> http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR>>> Thad> +ThadGuidry>> On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 wrote:>>>> Hi Aldrin>>>> some additional information.>>>> it39s a typical ETL offloading user case>>>> each extraction job should foucs on 1 table and 1 table only. data will>> be written on HDFS , this is similar to Database Staging.>>>> The reason why we need to foucs on 1 table for each job is because there>> might be database error or disconnection occur during the extraction , if>> it39s running as a script like extraction job with expression langurage,>> then it39s hard to do the re-running or excape thing on that table or tables.>>>> once the extraction is done, a triger like action will do the data>> cleansing. this is similar to ODS layer of Datawarehousing>>>> if the data quality has passed the quality check , then it will be marked>> as cleaned. otherwise , it will return to previous step and redo the data>> extraction, or send alert/email to the system administrator.>>>> if certain numbers of tables were all cleaned and checked , then it will>> call some Transforming processor to do the transforming ， then push the>> data into a datawarehouse (Hive in our case)>>>>>> Thank you very much>>>> Yan Liu>>>> Hortonworks Service Division>>>> Richinfo, Shenzhen, China (PR)>>>> 13/03/2016>>>> ----邮件原文---->> 发件人："刘岩" >> 收件人：users >> 抄　送: dev >> 发送时间：2016-03-13 00:12:27>> 主题：Re:Re: Multiple dataflow jobs management(lots of jobs)>>>>>> Hi Aldrin>>>> Currently we need to extract 60K tables per day , and the time window is>> limited to 8 Hours. Which means that we need to run jobs concurrently , and>> we need a general description of what39s going on with all those 60K job>> flows and take further actions.>>>> We have tried Kettle and Talend , Talend is a IDE-Based so not what we>> are looking for, and Kettle was crashed due to the Mysql cannot handle the>> Kettle39s metadata with 10K jobs.>>>> So we want to use Nifi , this is really the product that we are looking>> for , but the missing piece here is a DataFlow jobs Admin Page. so we can>> have multiple Nifi instances running on different nodes, but monitoring the>> jobs in one page. If it can intergrate with Ambari metrics API, then we>> can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and>> Hive View.>>>>>> Thank you very much>>>> Yan Liu>>>> Hortonworks Service Division>>>> Richinfo, Shenzhen, China (PR)>>>> 06/03/2016>>>>>> ----邮件原文---->> 发件人：Aldrin Piri >> 收件人：users >> 抄　送: dev >> 发送时间：2016-03-11 02:27:11>> 主题：Re: Mutiple dataflow jobs management(lots of jobs)>>>> Hi Yan,>>>> We can get more into details and particulars if needed, but have you>> experimented with expression language? I could see a Cron driven approach>> which covers your periodic efforts that feeds some number of ExecuteSQL>> processors (perhaps one for each database you are communicating with) each>> having a table. This would certainly cut down on the need for the 30k>> processors on a one-to-one basis with a given processor.>>>> In terms of monitoring the dataflows, could you describe what else you are>> searching for beyond the graph view? NiFi tries to provide context for the>> flow of data but is not trying to be a sole monitoring, we can give>> information on a processor basis, but do not delve into specifics. There is>> a summary view for the overall flow where you can monitor stats about the>> components and connections in the system. We support interoperation with>> monitoring systems via push (ReportingTask) and pull (REST API [2])>> semantics.>>>> Any other details beyond your list of how this all interoperates might>> shed some more light on what you are trying to accomplish. It seems like>> NiFi should be able to help with this. With some additional information we>> may be able to provide further guidance or at least get some insights on use>> cases we could look to improve upon and extend NiFi to support.>>>> Thanks!>>>>>> [1]>> http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html>> [2]>> http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks>> [3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html>>>> On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 wrote:>>>>>> Hi All>>>>>>>>> i39m trying to adapt Nifi to production but can not find an admin>>> console which monitoring the dataflows>>>>>>>>> The scenarios is simple,>>>>>>>>> 1. we gather data from oracle database to hdfs and then to hive.>>>>>> 2. residules/incrementals are updated daily or monthly via Nifi.>>>>>> 3. full dump on some table are excuted daily or monthly via Nifi.>>>>>>>>> is it really simple , however , we have 7 oracle databases with>>> over 30K tables needs to implement the above scenario.>>>>>>>>> which means that i will drag that ExcuteSQL elements for like 30K time>>> or so and also need to place them with a nice looking way on my little 21>>> inch screen .>>>>>>>>> Just wondering if there is a table list like ,groupable and searchable>>> task control and monitoring feature for Nifi>>>>>>>>>>>> Thank you very much in advance>>>>>>>>>>>> Yan Liu>>>>>> Hortonworks Service Division>>>>>> Richinfo, Shenzhen, China (PR)>>>>>> 06/03/2016>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Subject：Re: Re: Re: Multiple dataflow jobs management(lots of jobs)To clarify about 39HA and master node39 - that is for the control planeitself. The data continues to flow on all nodes even if the NCM isdown. That said, we are working to solve it now with zero-masterclustering.ThanksJoeOn Sun, Mar 13, 2016 at 12:20 PM, 刘岩 wrote:> Hi Thad>> Thank you very much for your advice. Kettle can do the job for sure , but> the metadata i was talking about is the metadata of the job descriptions> used for kettle itself. The only option left for kettle is multiple> instances , but that also means that we need to develop a master application> to gather all the instances metadata.>> Moreover , Kettle does not have a Web Based GUI for designing and testing> the job , that39s why we want NIFI , but again , multiple instances of nifi> also leads to a HA problem for master node, so we turn to ambari metrics for> that issue.>> Talend has a cloud server doing the similar thing, but it39s running on> public cloud which is not accepted by our client.>> Kettle is a great ETL tool, but Web Based designer is really the master> point for future.>>> Thank you very much>> Yan Liu>>> Yan Liu>> Hortonworks Service Division>> Richinfo, Shenzhen, China (PR)>> 14/03/2016>>>>> ----邮件原文----> 发件人：Thad Guidry > 收件人：users > 抄　送: dev > 发送时间：2016-03-13 23:04:39>> 主题：Re: Re: Multiple dataflow jobs management(lots of jobs)>> Yan,>> Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K> jobs to accomplish this is not the proper way to setup Pentaho. Also, using> MySQL to store the metadata is where you made a wrong choice. PostgreSQL> with data silos on SSD drives would be a better choice, while properly doing> Async config [1] and other necessary steps for high writes. Don39t keep> Pentaho39s Table output commit levels at their default of 10k rows when your> processing millions of rows!) For Oracle 11g or PostgreSQL, where I need 30> sec time slice windows for the metadata logging and where I typically have> less than 1k of data on average per row, I typically will choose 200k rows> or more in Pentaho39s table output commit option.>> I would suggest you contact Pentaho for some adhoc support or hire some> consultants to help you learn more, or setup properly for your use case.> For free, you can also just do a web search on "Pentaho best practices".> There39s a lot to learn from industry experts who already have used these> tools and know their quirks.>> [1]> http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR>>> Thad> +ThadGuidry>> On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 wrote:>>>> Hi Aldrin>>>> some additional information.>>>> it39s a typical ETL offloading user case>>>> each extraction job should foucs on 1 table and 1 table only. data will>> be written on HDFS , this is similar to Database Staging.>>>> The reason why we need to foucs on 1 table for each job is because there>> might be database error or disconnection occur during the extraction , if>> it39s running as a script like extraction job with expression langurage,>> then it39s hard to do the re-running or excape thing on that table or tables.>>>> once the extraction is done, a triger like action will do the data>> cleansing. this is similar to ODS layer of Datawarehousing>>>> if the data quality has passed the quality check , then it will be marked>> as cleaned. otherwise , it will return to previous step and redo the data>> extraction, or send alert/email to the system administrator.>>>> if certain numbers of tables were all cleaned and checked , then it will>> call some Transforming processor to do the transforming ， then push the>> data into a datawarehouse (Hive in our case)>>>>>> Thank you very much>>>> Yan Liu>>>> Hortonworks Service Division>>>> Richinfo, Shenzhen, China (PR)>>>> 13/03/2016>>>> ----邮件原文---->> 发件人："刘岩" >> 收件人：users >> 抄　送: dev >> 发送时间：2016-03-13 00:12:27>> 主题：Re:Re: Multiple dataflow jobs management(lots of jobs)>>>>>> Hi Aldrin>>>> Currently we need to extract 60K tables per day , and the time window is>> limited to 8 Hours. Which means that we need to run jobs concurrently , and>> we need a general description of what39s going on with all those 60K job>> flows and take further actions.>>>> We have tried Kettle and Talend , Talend is a IDE-Based so not what we>> are looking for, and Kettle was crashed due to the Mysql cannot handle the>> Kettle39s metadata with 10K jobs.>>>> So we want to use Nifi , this is really the product that we are looking>> for , but the missing piece here is a DataFlow jobs Admin Page. so we can>> have multiple Nifi instances running on different nodes, but monitoring the>> jobs in one page. If it can intergrate with Ambari metrics API, then we>> can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and>> Hive View.>>>>>> Thank you very much>>>> Yan Liu>>>> Hortonworks Service Division>>>> Richinfo, Shenzhen, China (PR)>>>> 06/03/2016>>>>>> ----邮件原文---->> 发件人：Aldrin Piri >> 收件人：users >> 抄　送: dev >> 发送时间：2016-03-11 02:27:11>> 主题：Re: Mutiple dataflow jobs management(lots of jobs)>>>> Hi Yan,>>>> We can get more into details and particulars if needed, but have you>> experimented with expression language? I could see a Cron driven approach>> which covers your periodic efforts that feeds some number of ExecuteSQL>> processors (perhaps one for each database you are communicating with) each>> having a table. This would certainly cut down on the need for the 30k>> processors on a one-to-one basis with a given processor.>>>> In terms of monitoring the dataflows, could you describe what else you are>> searching for beyond the graph view? NiFi tries to provide context for the>> flow of data but is not trying to be a sole monitoring, we can give>> information on a processor basis, but do not delve into specifics. There is>> a summary view for the overall flow where you can monitor stats about the>> components and connections in the system. We support interoperation with>> monitoring systems via push (ReportingTask) and pull (REST API [2])>> semantics.>>>> Any other details beyond your list of how this all interoperates might>> shed some more light on what you are trying to accomplish. It seems like>> NiFi should be able to help with this. With some additional information we>> may be able to provide further guidance or at least get some insights on use>> cases we could look to improve upon and extend NiFi to support.>>>> Thanks!>>>>>> [1]>> http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html>> [2]>> http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks>> [3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html>>>> On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 wrote:>>>>>> Hi All>>>>>>>>> i39m trying to adapt Nifi to production but can not find an admin>>> console which monitoring the dataflows>>>>>>>>> The scenarios is simple,>>>>>>>>> 1. we gather data from oracle database to hdfs and then to hive.>>>>>> 2. residules/incrementals are updated daily or monthly via Nifi.>>>>>> 3. full dump on some table are excuted daily or monthly via Nifi.>>>>>>>>> is it really simple , however , we have 7 oracle databases with>>> over 30K tables needs to implement the above scenario.>>>>>>>>> which means that i will drag that ExcuteSQL elements for like 30K time>>> or so and also need to place them with a nice looking way on my little 21>>> inch screen .>>>>>>>>> Just wondering if there is a table list like ,groupable and searchable>>> task control and monitoring feature for Nifi>>>>>>>>>>>> Thank you very much in advance>>>>>>>>>>>> Yan Liu>>>>>> Hortonworks Service Division>>>>>> Richinfo, Shenzhen, China (PR)>>>>>> 06/03/2016>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Re: Re: Re: Multiple dataflow jobs management(lots of jobs)

Posted by Joe Witt <jo...@gmail.com>.

To clarify about 'HA and master node' - that is for the control plane
itself.  The data continues to flow on all nodes even if the NCM is
down.  That said, we are working to solve it now with zero-master
clustering.

Thanks
Joe

On Sun, Mar 13, 2016 at 12:20 PM, 刘岩 <li...@richinfo.cn> wrote:
> Hi  Thad
>
> Thank you very much for your advice. Kettle can do the job for sure , but
> the metadata i was talking about is the metadata of the job descriptions
> used for kettle itself. The only option left for kettle is multiple
> instances , but that also means that we need to develop a master application
> to gather all the instances metadata.
>
> Moreover , Kettle does not have a Web Based GUI for designing and testing
> the job , that's why we want NIFI , but again , multiple instances of nifi
> also leads to a HA problem for master node, so we turn to ambari metrics for
> that issue.
>
> Talend has a cloud server doing the similar thing, but it's running on
> public cloud which is not accepted by our client.
>
> Kettle is a great ETL tool, but Web Based designer is really the master
> point for future.
>
>
> Thank you very much
>
> Yan Liu
>
>
> Yan Liu
>
> Hortonworks Service Division
>
> Richinfo, Shenzhen, China (PR)
>
> 14/03/2016
>
>
>
>
> ----邮件原文----
> 发件人：Thad Guidry  <th...@gmail.com>
> 收件人：users <us...@nifi.apache.org>
> 抄　送: dev  <de...@nifi.apache.org>
> 发送时间：2016-03-13 23:04:39
>
> 主题：Re: Re: Multiple dataflow jobs management(lots of jobs)
>
> Yan,
>
> Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K
> jobs to accomplish this is not the proper way to setup Pentaho.  Also, using
> MySQL to store the metadata is where you made a wrong choice.  PostgreSQL
> with data silos on SSD drives would be a better choice, while properly doing
> Async config [1] and other necessary steps for high writes.  Don't keep
> Pentaho's Table output commit levels at their default of 10k rows when your
> processing millions of rows!) For Oracle 11g or PostgreSQL, where I need 30
> sec time slice windows for the metadata logging and where I typically have
> less than 1k of data on average per row, I typically will choose 200k rows
> or more in Pentaho's table output commit option.
>
> I would suggest you contact Pentaho for some adhoc support or hire some
> consultants to help you learn more, or setup properly for your use case.
> For free, you can also just do a web search on "Pentaho best practices".
> There's a lot to learn from industry experts who already have used these
> tools and know their quirks.
>
> [1]
> http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR
>
>
> Thad
> +ThadGuidry
>
> On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 <li...@richinfo.cn> wrote:
>>
>> Hi Aldrin
>>
>> some additional information.
>>
>> it's a typical ETL offloading user case
>>
>> each extraction job should foucs on 1 table and 1 table only.  data will
>> be written on HDFS , this is similar to Database Staging.
>>
>> The reason why we need to foucs on 1 table for each job is because there
>> might be database error or disconnection occur during the extraction , if
>> it's running as  a script like extraction job with expression langurage,
>> then it's hard to do the re-running or excape thing on that table or tables.
>>
>> once the extraction is done, a triger like action will do the data
>> cleansing.  this is similar to ODS layer of Datawarehousing
>>
>> if the data quality has passed the quality check , then it will be marked
>> as cleaned. otherwise , it will return to previous step and redo the data
>> extraction, or send alert/email to the  system administrator.
>>
>> if certain numbers of tables were all cleaned and checked , then it will
>> call some Transforming  processor to do the transforming ， then push the
>> data into a datawarehouse (Hive in our case)
>>
>>
>> Thank you very much
>>
>> Yan Liu
>>
>> Hortonworks Service Division
>>
>> Richinfo, Shenzhen, China (PR)
>>
>> 13/03/2016
>>
>> ----邮件原文----
>> 发件人："刘岩" <li...@richinfo.cn>
>> 收件人：users  <us...@nifi.apache.org>
>> 抄　送: dev  <de...@nifi.apache.org>
>> 发送时间：2016-03-13 00:12:27
>> 主题：Re:Re: Multiple dataflow jobs management(lots of jobs)
>>
>>
>> Hi Aldrin
>>
>> Currently  we need to extract 60K tables per day , and the time window is
>> limited to 8 Hours.  Which means that we need to run jobs concurrently , and
>> we need a general description of what's going on with all those 60K job
>> flows and take further actions.
>>
>> We have tried Kettle and Talend ,  Talend is a IDE-Based so not what we
>> are looking for,  and Kettle was crashed due to the Mysql cannot handle the
>> Kettle's metadata with 10K jobs.
>>
>> So we want to use Nifi ,  this is really the product that we are looking
>> for , but  the missing piece here is a DataFlow jobs Admin Page.  so we can
>> have multiple Nifi instances running on different nodes, but monitoring the
>> jobs in one page.  If it can intergrate with Ambari metrics API,  then we
>> can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and
>> Hive View.
>>
>>
>> Thank you very much
>>
>> Yan Liu
>>
>> Hortonworks Service Division
>>
>> Richinfo, Shenzhen, China (PR)
>>
>> 06/03/2016
>>
>>
>> ----邮件原文----
>> 发件人：Aldrin Piri  <al...@gmail.com>
>> 收件人：users <us...@nifi.apache.org>
>> 抄　送: dev  <de...@nifi.apache.org>
>> 发送时间：2016-03-11 02:27:11
>> 主题：Re: Mutiple dataflow jobs management(lots of jobs)
>>
>> Hi Yan,
>>
>> We can get more into details and particulars if needed, but have you
>> experimented with expression language?  I could see a Cron driven approach
>> which covers your periodic efforts that feeds some number of ExecuteSQL
>> processors (perhaps one for each database you are communicating with) each
>> having a table.  This would certainly cut down on the need for the 30k
>> processors on a one-to-one basis with a given processor.
>>
>> In terms of monitoring the dataflows, could you describe what else you are
>> searching for beyond the graph view?  NiFi tries to provide context for the
>> flow of data but is not trying to be a sole monitoring, we can give
>> information on a processor basis, but do not delve into specifics.  There is
>> a summary view for the overall flow where you can monitor stats about the
>> components and connections in the system. We support interoperation with
>> monitoring systems via push (ReportingTask) and pull (REST API [2])
>> semantics.
>>
>> Any other details beyond your list of how this all interoperates might
>> shed some more light on what you are trying to accomplish.  It seems like
>> NiFi should be able to help with this.  With some additional information we
>> may be able to provide further guidance or at least get some insights on use
>> cases we could look to improve upon and extend NiFi to support.
>>
>> Thanks!
>>
>>
>> [1]
>> http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
>> [2]
>> http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
>> [3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html
>>
>> On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <li...@richinfo.cn> wrote:
>>>
>>> Hi All
>>>
>>>
>>>     i'm trying to adapt Nifi to production but can not find an admin
>>> console which monitoring the dataflows
>>>
>>>
>>>    The scenarios is simple,
>>>
>>>
>>>    1.  we gather data from oracle database to hdfs and then to hive.
>>>
>>>    2.  residules/incrementals are updated daily or monthly via Nifi.
>>>
>>>    3.  full dump on some table are excuted daily or monthly via Nifi.
>>>
>>>
>>>     is it really simple ,  however , we have  7 oracle databases with
>>> over 30K  tables needs to implement the above scenario.
>>>
>>>
>>> which means that i will drag that ExcuteSQL  elements for like 30K time
>>> or so and also need to place them with a nice looking way on my little 21
>>> inch screen .
>>>
>>>
>>> Just wondering if there is a table list like  ,groupable and searchable
>>> task control and monitoring feature for Nifi
>>>
>>>
>>>
>>> Thank you very much  in advance
>>>
>>>
>>>
>>> Yan Liu
>>>
>>> Hortonworks Service Division
>>>
>>> Richinfo, Shenzhen, China (PR)
>>>
>>> 06/03/2016
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>>
>
>
>

Re:Re: Re: Multiple dataflow jobs management(lots of jobs)

Posted by 刘岩 <li...@richinfo.cn>.

Hi ThadThank you very much for your advice. Kettle can do the job for sure , but the metadata i was talking about is the metadata of the job descriptions used for kettle itself. The only option left for kettle is multiple instances , but that also means that we need to develop a master application to gather all the instances metadata. Moreover , Kettle does not have a Web Based GUI for designing and testing the job , that39s why we want NIFI , but again , multiple instances of nifi also leads to a HA problem for master node, so we turn to ambari metrics for that issue.Talend has a cloud server doing the similar thing, but it39s running on public cloud which is not accepted by our client.Kettle is a great ETL tool, but Web Based designer is really the master point for future.Thank you very muchYan LiuYan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

14/03/2016----邮件原文----发件人：Thad Guidry <th...@gmail.com>收件人：users <us...@nifi.apache.org>抄 送: dev <de...@nifi.apache.org>发送时间：2016-03-13 23:04:39主题：Re: Re: Multiple dataflow jobs management(lots of jobs)
Yan,

Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K jobs to accomplish this is not the proper way to setup Pentaho. Also, using MySQL to store the metadata is where you made a wrong choice. PostgreSQL with data silos on SSD drives would be a better choice, while properly doing Async config [1] and other necessary steps for high writes. Don39t keep Pentaho39s Table output commit levels at their default of 10k rows when your processing millions of rows!) For Oracle 11g or PostgreSQL, where I need 30 sec time slice windows for the metadata logging and where I typically have less than 1k of data on average per row, I typically will choose 200k rows or more in Pentaho39s table output commit option.

I would suggest you contact Pentaho for some adhoc support or hire some consultants to help you learn more, or setup properly for your use case. For free, you can also just do a web search on "Pentaho best practices". There39s a lot to learn from industry experts who already have used these tools and know their quirks.

[1] http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR

Thad+ThadGuidry

On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 <li...@richinfo.cn> wrote:Hi Aldrinsome additional information.it39s a typical ETL offloading user case each extraction job should foucs on 1 table and 1 table only. data will be written on HDFS , this is similar to Database Staging. The reason why we need to foucs on 1 table for each job is because there might be database error or disconnection occur during the extraction , if it39s running as a script like extraction job with expression langurage, then it39s hard to do the re-running or excape thing on that table or tables.once the extraction is done, a triger like action will do the data cleansing. this is similar to ODS layer of Datawarehousingif the data quality has passed the quality check , then it will be marked as cleaned. otherwise , it will return to previous step and redo the data extraction, or send alert/email to the system administrator.if certain numbers of tables were all cleaned and checked , then it will call some Transforming processor to do the transforming ， then push the data into a datawarehouse (Hive in our case)Thank you very much Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

13/03/2016----邮件原文----发件人："刘岩" <li...@richinfo.cn>收件人：users <us...@nifi.apache.org>抄 送: dev <de...@nifi.apache.org>发送时间：2016-03-13 00:12:27主题：Re:Re: Multiple dataflow jobs management(lots of jobs)
Hi AldrinCurrently we need to extract 60K tables per day , and the time window is limited to 8 Hours. Which means that we need to run jobs concurrently , and we need a general description of what39s going on with all those 60K job flows and take further actions. We have tried Kettle and Talend , Talend is a IDE-Based so not what we are looking for, and Kettle was crashed due to the Mysql cannot handle the Kettle39s metadata with 10K jobs.So we want to use Nifi , this is really the product that we are looking for , but the missing piece here is a DataFlow jobs Admin Page. so we can have multiple Nifi instances running on different nodes, but monitoring the jobs in one page. If it can intergrate with Ambari metrics API, then we can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and Hive View. Thank you very much Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

Thanks!

On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <li...@richinfo.cn> wrote:Hi All

i39m trying to adapt Nifi to production but can not find an admin console which monitoring the dataflows

The scenarios is simple,

1. we gather data from oracle database to hdfs and then to hive.

2. residules/incrementals are updated daily or monthly via Nifi.

3. full dump on some table are excuted daily or monthly via Nifi.

is it really simple , however , we have 7 oracle databases with over 30K tables needs to implement the above scenario.

which means that i will drag that ExcuteSQL elements for like 30K time or so and also need to place them with a nice looking way on my little 21 inch screen .

Just wondering if there is a table list like ,groupable and searchable task control and monitoring feature for Nifi

Thank you very much in advance

Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

06/03/2016

Re:Re: Re: Multiple dataflow jobs management(lots of jobs)

Posted by 刘岩 <li...@richinfo.cn>.

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

[1] http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR

Thad+ThadGuidry

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

Thanks!

On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <li...@richinfo.cn> wrote:Hi All

i39m trying to adapt Nifi to production but can not find an admin console which monitoring the dataflows

The scenarios is simple,

1. we gather data from oracle database to hdfs and then to hive.

2. residules/incrementals are updated daily or monthly via Nifi.

3. full dump on some table are excuted daily or monthly via Nifi.

is it really simple , however , we have 7 oracle databases with over 30K tables needs to implement the above scenario.

which means that i will drag that ExcuteSQL elements for like 30K time or so and also need to place them with a nice looking way on my little 21 inch screen .

Just wondering if there is a table list like ,groupable and searchable task control and monitoring feature for Nifi

Thank you very much in advance

Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

06/03/2016

Re: Re: Multiple dataflow jobs management(lots of jobs)

Posted by Thad Guidry <th...@gmail.com>.

Yan,

Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K
jobs to accomplish this is not the proper way to setup Pentaho.  Also,
using MySQL to store the metadata is where you made a wrong choice.
PostgreSQL with data silos on SSD drives would be a better choice, while
properly doing Async config [1] and other necessary steps for high writes.
Don't keep Pentaho's Table output commit levels at their default of 10k
rows when your processing millions of rows!) For Oracle 11g or PostgreSQL,
where I need 30 sec time slice windows for the metadata logging and where I
typically have less than 1k of data on average per row, I typically will
choose 200k rows or more in Pentaho's table output commit option.

I would suggest you contact Pentaho for some adhoc support or hire some
consultants to help you learn more, or setup properly for your use case.
For free, you can also just do a web search on "Pentaho best practices".
There's a lot to learn from industry experts who already have used these
tools and know their quirks.

[1]
http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR


Thad
+ThadGuidry <https://www.google.com/+ThadGuidry>

On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 <li...@richinfo.cn> wrote:

> Hi Aldrin
>
> some additional information.
>
> it's a typical ETL offloading user case
>
> each extraction job should foucs on 1 table and 1 table only.  data will
> be written on HDFS , this is similar to Database Staging.
>
> The reason why we need to foucs on 1 table for each job is because there
> might be database error or disconnection occur during the extraction , if
> it's running as  a script like extraction job with expression langurage,
> then it's hard to do the re-running or excape thing on that table or tables.
>
> once the extraction is done, a triger like action will do the data
> cleansing.  this is similar to ODS layer of Datawarehousing
>
> if the data quality has passed the quality check , then it will be marked
> as cleaned. otherwise , it will return to previous step and redo the data
> extraction, or send alert/email to the  system administrator.
>
> if certain numbers of tables were all cleaned and checked , then it will
> call some Transforming  processor to do the transforming ， then push the
> data into a datawarehouse (Hive in our case)
>
>
> Thank you very much
>
> Yan Liu
>
> Hortonworks Service Division
>
> Richinfo, Shenzhen, China (PR)
> 13/03/2016
>
> ----邮件原文----
> *发件人：*"刘岩" <li...@richinfo.cn>
> *收件人：*users  <us...@nifi.apache.org>
> *抄 送: *dev  <de...@nifi.apache.org>
> *发送时间：*2016-03-13 00:12:27
> *主题：*Re:Re: Multiple dataflow jobs management(lots of jobs)
>
>
> Hi Aldrin
>
> Currently  we need to extract 60K tables per day , and the time window is
> limited to 8 Hours.  Which means that we need to run jobs concurrently ,
> and we need a general description of what's going on with all those 60K job
> flows and take further actions.
>
> We have tried Kettle and Talend ,  Talend is a IDE-Based so not what we
> are looking for,  and Kettle was crashed due to the Mysql cannot handle the
> Kettle's metadata with 10K jobs.
>
> So we want to use Nifi ,  this is really the product that we are looking
> for , but  the missing piece here is a DataFlow jobs Admin Page.  so we can
> have multiple Nifi instances running on different nodes, but monitoring the
> jobs in one page.  If it can intergrate with Ambari metrics API,  then we
> can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and
> Hive View.
>
>
> Thank you very much
>
> Yan Liu
>
> Hortonworks Service Division
>
> Richinfo, Shenzhen, China (PR)
> 06/03/2016
>
>
> ----邮件原文----
> *发件人：*Aldrin Piri  <al...@gmail.com>
> *收件人：*users <us...@nifi.apache.org>
> *抄 送: *dev  <de...@nifi.apache.org>
> *发送时间：*2016-03-11 02:27:11
> *主题：*Re: Mutiple dataflow jobs management(lots of jobs)
>
> Hi Yan,
>
> We can get more into details and particulars if needed, but have you
> experimented with expression language?  I could see a Cron driven approach
> which covers your periodic efforts that feeds some number of ExecuteSQL
> processors (perhaps one for each database you are communicating with) each
> having a table.  This would certainly cut down on the need for the 30k
> processors on a one-to-one basis with a given processor.
>
> In terms of monitoring the dataflows, could you describe what else you are
> searching for beyond the graph view?  NiFi tries to provide context for the
> flow of data but is not trying to be a sole monitoring, we can give
> information on a processor basis, but do not delve into specifics.  There
> is a summary view for the overall flow where you can monitor stats about
> the components and connections in the system. We support interoperation
> with monitoring systems via push (ReportingTask) and pull (REST API [2])
> semantics.
>
> Any other details beyond your list of how this all interoperates might
> shed some more light on what you are trying to accomplish.  It seems like
> NiFi should be able to help with this.  With some additional information we
> may be able to provide further guidance or at least get some insights on
> use cases we could look to improve upon and extend NiFi to support.
>
> Thanks!
>
>
> [1]
> http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
> [2]
> http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
> [3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html
>
> On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <li...@richinfo.cn> wrote:
>
>> Hi All
>>
>>
>>     i'm trying to adapt Nifi to production but can not find an admin
>> console which monitoring the dataflows
>>
>>
>>    The scenarios is simple,
>>
>>
>>    1.  we gather data from oracle database to hdfs and then to hive.
>>
>>    2.  residules/incrementals are updated daily or monthly via Nifi.
>>
>>    3.  full dump on some table are excuted daily or monthly via Nifi.
>>
>>
>>     is it really simple ,  however , we have  7 oracle databases with
>> over 30K  tables needs to implement the above scenario.
>>
>>
>> which means that i will drag that ExcuteSQL  elements for like 30K time
>> or so and also need to place them with a nice looking way on my little 21
>> inch screen .
>>
>>
>> Just wondering if there is a table list like  ,groupable and searchable
>> task control and monitoring feature for Nifi
>>
>>
>>
>> Thank you very much  in advance
>>
>>
>>
>> Yan Liu
>>
>> Hortonworks Service Division
>>
>> Richinfo, Shenzhen, China (PR)
>>
>> 06/03/2016
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>

Re: Re: Multiple dataflow jobs management(lots of jobs)

Posted by Thad Guidry <th...@gmail.com>.

Yan,

Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K
jobs to accomplish this is not the proper way to setup Pentaho.  Also,
using MySQL to store the metadata is where you made a wrong choice.
PostgreSQL with data silos on SSD drives would be a better choice, while
properly doing Async config [1] and other necessary steps for high writes.
Don't keep Pentaho's Table output commit levels at their default of 10k
rows when your processing millions of rows!) For Oracle 11g or PostgreSQL,
where I need 30 sec time slice windows for the metadata logging and where I
typically have less than 1k of data on average per row, I typically will
choose 200k rows or more in Pentaho's table output commit option.

I would suggest you contact Pentaho for some adhoc support or hire some
consultants to help you learn more, or setup properly for your use case.
For free, you can also just do a web search on "Pentaho best practices".
There's a lot to learn from industry experts who already have used these
tools and know their quirks.

[1]
http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR


Thad
+ThadGuidry <https://www.google.com/+ThadGuidry>

On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 <li...@richinfo.cn> wrote:

> Hi Aldrin
>
> some additional information.
>
> it's a typical ETL offloading user case
>
> each extraction job should foucs on 1 table and 1 table only.  data will
> be written on HDFS , this is similar to Database Staging.
>
> The reason why we need to foucs on 1 table for each job is because there
> might be database error or disconnection occur during the extraction , if
> it's running as  a script like extraction job with expression langurage,
> then it's hard to do the re-running or excape thing on that table or tables.
>
> once the extraction is done, a triger like action will do the data
> cleansing.  this is similar to ODS layer of Datawarehousing
>
> if the data quality has passed the quality check , then it will be marked
> as cleaned. otherwise , it will return to previous step and redo the data
> extraction, or send alert/email to the  system administrator.
>
> if certain numbers of tables were all cleaned and checked , then it will
> call some Transforming  processor to do the transforming ， then push the
> data into a datawarehouse (Hive in our case)
>
>
> Thank you very much
>
> Yan Liu
>
> Hortonworks Service Division
>
> Richinfo, Shenzhen, China (PR)
> 13/03/2016
>
> ----邮件原文----
> *发件人：*"刘岩" <li...@richinfo.cn>
> *收件人：*users  <us...@nifi.apache.org>
> *抄 送: *dev  <de...@nifi.apache.org>
> *发送时间：*2016-03-13 00:12:27
> *主题：*Re:Re: Multiple dataflow jobs management(lots of jobs)
>
>
> Hi Aldrin
>
> Currently  we need to extract 60K tables per day , and the time window is
> limited to 8 Hours.  Which means that we need to run jobs concurrently ,
> and we need a general description of what's going on with all those 60K job
> flows and take further actions.
>
> We have tried Kettle and Talend ,  Talend is a IDE-Based so not what we
> are looking for,  and Kettle was crashed due to the Mysql cannot handle the
> Kettle's metadata with 10K jobs.
>
> So we want to use Nifi ,  this is really the product that we are looking
> for , but  the missing piece here is a DataFlow jobs Admin Page.  so we can
> have multiple Nifi instances running on different nodes, but monitoring the
> jobs in one page.  If it can intergrate with Ambari metrics API,  then we
> can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and
> Hive View.
>
>
> Thank you very much
>
> Yan Liu
>
> Hortonworks Service Division
>
> Richinfo, Shenzhen, China (PR)
> 06/03/2016
>
>
> ----邮件原文----
> *发件人：*Aldrin Piri  <al...@gmail.com>
> *收件人：*users <us...@nifi.apache.org>
> *抄 送: *dev  <de...@nifi.apache.org>
> *发送时间：*2016-03-11 02:27:11
> *主题：*Re: Mutiple dataflow jobs management(lots of jobs)
>
> Hi Yan,
>
> We can get more into details and particulars if needed, but have you
> experimented with expression language?  I could see a Cron driven approach
> which covers your periodic efforts that feeds some number of ExecuteSQL
> processors (perhaps one for each database you are communicating with) each
> having a table.  This would certainly cut down on the need for the 30k
> processors on a one-to-one basis with a given processor.
>
> In terms of monitoring the dataflows, could you describe what else you are
> searching for beyond the graph view?  NiFi tries to provide context for the
> flow of data but is not trying to be a sole monitoring, we can give
> information on a processor basis, but do not delve into specifics.  There
> is a summary view for the overall flow where you can monitor stats about
> the components and connections in the system. We support interoperation
> with monitoring systems via push (ReportingTask) and pull (REST API [2])
> semantics.
>
> Any other details beyond your list of how this all interoperates might
> shed some more light on what you are trying to accomplish.  It seems like
> NiFi should be able to help with this.  With some additional information we
> may be able to provide further guidance or at least get some insights on
> use cases we could look to improve upon and extend NiFi to support.
>
> Thanks!
>
>
> [1]
> http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
> [2]
> http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
> [3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html
>
> On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <li...@richinfo.cn> wrote:
>
>> Hi All
>>
>>
>>     i'm trying to adapt Nifi to production but can not find an admin
>> console which monitoring the dataflows
>>
>>
>>    The scenarios is simple,
>>
>>
>>    1.  we gather data from oracle database to hdfs and then to hive.
>>
>>    2.  residules/incrementals are updated daily or monthly via Nifi.
>>
>>    3.  full dump on some table are excuted daily or monthly via Nifi.
>>
>>
>>     is it really simple ,  however , we have  7 oracle databases with
>> over 30K  tables needs to implement the above scenario.
>>
>>
>> which means that i will drag that ExcuteSQL  elements for like 30K time
>> or so and also need to place them with a nice looking way on my little 21
>> inch screen .
>>
>>
>> Just wondering if there is a table list like  ,groupable and searchable
>> task control and monitoring feature for Nifi
>>
>>
>>
>> Thank you very much  in advance
>>
>>
>>
>> Yan Liu
>>
>> Hortonworks Service Division
>>
>> Richinfo, Shenzhen, China (PR)
>>
>> 06/03/2016
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>

Re:Re: Multiple dataflow jobs management(lots of jobs)

Posted by 刘岩 <li...@richinfo.cn>.

Hi Aldrinsome additional information.it39s a typical ETL offloading user case each extraction job should foucs on 1 table and 1 table only. data will be written on HDFS , this is similar to Database Staging. The reason why we need to foucs on 1 table for each job is because there might be database error or disconnection occur during the extraction , if it39s running as a script like extraction job with expression langurage, then it39s hard to do the re-running or excape thing on that table or tables.once the extraction is done, a triger like action will do the data cleansing. this is similar to ODS layer of Datawarehousingif the data quality has passed the quality check , then it will be marked as cleaned. otherwise , it will return to previous step and redo the data extraction, or send alert/email to the system administrator.if certain numbers of tables were all cleaned and checked , then it will call some Transforming processor to do the transforming ， then push the data into a datawarehouse (Hive in our case)Thank you very much Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

13/03/2016----邮件原文----发件人："刘岩" <li...@richinfo.cn>收件人：users <us...@nifi.apache.org>抄 送: dev <de...@nifi.apache.org>发送时间：2016-03-13 00:12:27主题：Re:Re: Multiple dataflow jobs management(lots of jobs)Hi AldrinCurrently we need to extract 60K tables per day , and the time window is limited to 8 Hours. Which means that we need to run jobs concurrently , and we need a general description of what39s going on with all those 60K job flows and take further actions. We have tried Kettle and Talend , Talend is a IDE-Based so not what we are looking for, and Kettle was crashed due to the Mysql cannot handle the Kettle39s metadata with 10K jobs.So we want to use Nifi , this is really the product that we are looking for , but the missing piece here is a DataFlow jobs Admin Page. so we can have multiple Nifi instances running on different nodes, but monitoring the jobs in one page. If it can intergrate with Ambari metrics API, then we can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and Hive View. Thank you very much Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

Thanks!

On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <li...@richinfo.cn> wrote:Hi All

i39m trying to adapt Nifi to production but can not find an admin console which monitoring the dataflows

The scenarios is simple,

1. we gather data from oracle database to hdfs and then to hive.

2. residules/incrementals are updated daily or monthly via Nifi.

3. full dump on some table are excuted daily or monthly via Nifi.

is it really simple , however , we have 7 oracle databases with over 30K tables needs to implement the above scenario.

which means that i will drag that ExcuteSQL elements for like 30K time or so and also need to place them with a nice looking way on my little 21 inch screen .

Just wondering if there is a table list like ,groupable and searchable task control and monitoring feature for Nifi

Thank you very much in advance

Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

06/03/2016

Re:Re: Multiple dataflow jobs management(lots of jobs)

Posted by 刘岩 <li...@richinfo.cn>.

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

13/03/2016----邮件原文----发件人："刘岩" <li...@richinfo.cn>收件人：users <us...@nifi.apache.org>抄 送: dev <de...@nifi.apache.org>发送时间：2016-03-13 00:12:27主题：Re:Re: Multiple dataflow jobs management(lots of jobs)Hi AldrinCurrently we need to extract 60K tables per day , and the time window is limited to 8 Hours. Which means that we need to run jobs concurrently , and we need a general description of what39s going on with all those 60K job flows and take further actions. We have tried Kettle and Talend , Talend is a IDE-Based so not what we are looking for, and Kettle was crashed due to the Mysql cannot handle the Kettle39s metadata with 10K jobs.So we want to use Nifi , this is really the product that we are looking for , but the missing piece here is a DataFlow jobs Admin Page. so we can have multiple Nifi instances running on different nodes, but monitoring the jobs in one page. If it can intergrate with Ambari metrics API, then we can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and Hive View. Thank you very much Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

Thanks!

On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <li...@richinfo.cn> wrote:Hi All

i39m trying to adapt Nifi to production but can not find an admin console which monitoring the dataflows

The scenarios is simple,

1. we gather data from oracle database to hdfs and then to hive.

2. residules/incrementals are updated daily or monthly via Nifi.

3. full dump on some table are excuted daily or monthly via Nifi.

is it really simple , however , we have 7 oracle databases with over 30K tables needs to implement the above scenario.

which means that i will drag that ExcuteSQL elements for like 30K time or so and also need to place them with a nice looking way on my little 21 inch screen .

Just wondering if there is a table list like ,groupable and searchable task control and monitoring feature for Nifi

Thank you very much in advance

Yan Liu

Hortonworks Service Division

Richinfo, Shenzhen, China (PR)

06/03/2016