You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Joe Trite <jo...@gmail.com> on 2018/07/02 14:05:43 UTC

Clustered Flow Execution Help

I have a question/need confirmation about cluster execution. I have a 3
node - 1.6 NiFi cluster. My use case is extracting data from Hive and
deposting it into an RDBMS. Here is my flow.

1. SelectHiveQL - executes a "show paritions" command.
2. SplitText - splits the returned partition (7) into individual flowFiles
3. ExtractText - populates a 'partition_info' attribute
4. UpdateAttribute - reformat the 'partition_info' into sql syntax
5. SelectHiveQL - executes the "SELECT" against hive with the provided
'partition_info' as the WHERE clause.
6. SplitAvro - chunks the data info bit-size peices.
7. PutDatabaseRecord - INSERT into the db.

Processors 1-4 are set to 'Primary Node' only. 5-7 are set to 'All
Nodes'. All processors are set to 1 concurrent task.

The question is around what happens in step 5. I see the 7
'partition_info' flowFiles in the queue after step 4 completes and they
seem to get executed one-at-a-time in step 5, atleast from viewing the
queue drain. I would expect that step 5 would execute on each on the nodes
(3) and that i would see the queue drain in 3's, is this assumption correct
and maybe I have something misconfigured?

I do see in the provenance data that all 3 nodes did process a flowFile, I
am just expecting it to happen in parallel.

I did see this article about distribution but don't think it is required
for this use case to work:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

Thanks
Joe

Re: Clustered Flow Execution Help

Posted by Joe Trite <jo...@gmail.com>.

thx, the flow is executing as expected now.

On Mon, Jul 2, 2018 at 10:09 AM, Matt Burgess <ma...@apache.org> wrote:

> Joe,
>
> Only the first (source) processor needs to be set to Primary Node
> Only. Once that happens, the flow files will only proceed down the
> flow on the primary node, so step 5 will also only run on the primary
> node. In order to redistribute the flow files among the cluster,
> you'll want a Remote Process Group to point back to an Input Port on
> your cluster, between steps 4 & 5. From that point on, the flow files
> will be distributed among the nodes and the downstream flow (steps
> 5-7) will run on all the nodes.
>
> Regards,
> Matt
>
> On Mon, Jul 2, 2018 at 10:05 AM Joe Trite <jo...@gmail.com> wrote:
> >
> > I have a question/need confirmation about cluster execution.  I have a 3
> node - 1.6 NiFi cluster.  My use case is extracting data from Hive and
> deposting it into an RDBMS.  Here is my flow.
> >
> > 1. SelectHiveQL - executes a "show paritions" command.
> > 2. SplitText - splits the returned partition (7) into individual
> flowFiles
> > 3. ExtractText - populates a 'partition_info' attribute
> > 4. UpdateAttribute - reformat the 'partition_info' into sql syntax
> > 5. SelectHiveQL - executes the "SELECT" against hive with the provided
> 'partition_info' as the WHERE clause.
> > 6. SplitAvro - chunks the data info bit-size peices.
> > 7. PutDatabaseRecord - INSERT into the db.
> >
> > Processors 1-4 are set to 'Primary Node' only.  5-7 are set to 'All
> Nodes'.  All processors are set to 1 concurrent task.
> >
> > The question is around what happens in step 5.  I see the 7
> 'partition_info' flowFiles in the queue after step 4 completes and they
> seem to get executed one-at-a-time in step 5, atleast from viewing the
> queue drain.  I would expect that step 5 would execute on each on the nodes
> (3) and that i would see the queue drain in 3's, is this assumption correct
> and maybe I have something misconfigured?
> >
> > I do see in the provenance data that all 3 nodes did process a flowFile,
> I am just expecting it to happen in parallel.
> >
> > I did see this article about distribution but don't think it is required
> for this use case to work:
> > https://community.hortonworks.com/articles/16120/how-do-i-
> distribute-data-across-a-nifi-cluster.html
> >
> > Thanks
> > Joe
> >
> >
>

Re: Clustered Flow Execution Help

Posted by "Ravi Papisetti (rpapiset)" <rp...@cisco.com>.

In my experience, if you set first processor to run on "Primary Node", then remaining flow that directly connects to it will run on that node independent of how sub-sequent processors are configured to run.

If you really want to distribute the flow, after first step, insert a RPG processors, that will distribute the load across cluster if you set sub-sequenty processors to run in "AllNodes" mode.

Thanks,
Ravi Papisetti

On 02/07/18, 9:10 AM, "Matt Burgess" <ma...@apache.org> wrote:

    Joe,
    
    Only the first (source) processor needs to be set to Primary Node
    Only. Once that happens, the flow files will only proceed down the
    flow on the primary node, so step 5 will also only run on the primary
    node. In order to redistribute the flow files among the cluster,
    you'll want a Remote Process Group to point back to an Input Port on
    your cluster, between steps 4 & 5. From that point on, the flow files
    will be distributed among the nodes and the downstream flow (steps
    5-7) will run on all the nodes.
    
    Regards,
    Matt
    
    On Mon, Jul 2, 2018 at 10:05 AM Joe Trite <jo...@gmail.com> wrote:
    >
    > I have a question/need confirmation about cluster execution.  I have a 3 node - 1.6 NiFi cluster.  My use case is extracting data from Hive and deposting it into an RDBMS.  Here is my flow.
    >
    > 1. SelectHiveQL - executes a "show paritions" command.
    > 2. SplitText - splits the returned partition (7) into individual flowFiles
    > 3. ExtractText - populates a 'partition_info' attribute
    > 4. UpdateAttribute - reformat the 'partition_info' into sql syntax
    > 5. SelectHiveQL - executes the "SELECT" against hive with the provided 'partition_info' as the WHERE clause.
    > 6. SplitAvro - chunks the data info bit-size peices.
    > 7. PutDatabaseRecord - INSERT into the db.
    >
    > Processors 1-4 are set to 'Primary Node' only.  5-7 are set to 'All Nodes'.  All processors are set to 1 concurrent task.
    >
    > The question is around what happens in step 5.  I see the 7 'partition_info' flowFiles in the queue after step 4 completes and they seem to get executed one-at-a-time in step 5, atleast from viewing the queue drain.  I would expect that step 5 would execute on each on the nodes (3) and that i would see the queue drain in 3's, is this assumption correct and maybe I have something misconfigured?
    >
    > I do see in the provenance data that all 3 nodes did process a flowFile, I am just expecting it to happen in parallel.
    >
    > I did see this article about distribution but don't think it is required for this use case to work:
    > https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
    >
    > Thanks
    > Joe
    >
    >

Re: Clustered Flow Execution Help

Posted by Matt Burgess <ma...@apache.org>.

Joe,

Only the first (source) processor needs to be set to Primary Node
Only. Once that happens, the flow files will only proceed down the
flow on the primary node, so step 5 will also only run on the primary
node. In order to redistribute the flow files among the cluster,
you'll want a Remote Process Group to point back to an Input Port on
your cluster, between steps 4 & 5. From that point on, the flow files
will be distributed among the nodes and the downstream flow (steps
5-7) will run on all the nodes.

Regards,
Matt

On Mon, Jul 2, 2018 at 10:05 AM Joe Trite <jo...@gmail.com> wrote:
>
> I have a question/need confirmation about cluster execution.  I have a 3 node - 1.6 NiFi cluster.  My use case is extracting data from Hive and deposting it into an RDBMS.  Here is my flow.
>
> 1. SelectHiveQL - executes a "show paritions" command.
> 2. SplitText - splits the returned partition (7) into individual flowFiles
> 3. ExtractText - populates a 'partition_info' attribute
> 4. UpdateAttribute - reformat the 'partition_info' into sql syntax
> 5. SelectHiveQL - executes the "SELECT" against hive with the provided 'partition_info' as the WHERE clause.
> 6. SplitAvro - chunks the data info bit-size peices.
> 7. PutDatabaseRecord - INSERT into the db.
>
> Processors 1-4 are set to 'Primary Node' only.  5-7 are set to 'All Nodes'.  All processors are set to 1 concurrent task.
>
> The question is around what happens in step 5.  I see the 7 'partition_info' flowFiles in the queue after step 4 completes and they seem to get executed one-at-a-time in step 5, atleast from viewing the queue drain.  I would expect that step 5 would execute on each on the nodes (3) and that i would see the queue drain in 3's, is this assumption correct and maybe I have something misconfigured?
>
> I do see in the provenance data that all 3 nodes did process a flowFile, I am just expecting it to happen in parallel.
>
> I did see this article about distribution but don't think it is required for this use case to work:
> https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
>
> Thanks
> Joe
>
>