You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Arnaud G <gr...@gmail.com> on 2017/04/05 12:45:07 UTC

GetHDFS and triggering

Hi,

I'm currently building a flow in Nifi and I'm trying to get the best way to
do it in a reliable manner:

The setup is the following:

1) Some files are copied in a folder in HDFS
2) An Hive external table point to this directory
3) The data of this table are then copied in an ORC table
4) The data from the folder are archived and compress in another folder

My first issue is that I cannot easily trigger an Insert SQL query from
Nifi. ExecuteSQL processor only execute SELECT query and not INSERT query.
I can of course Select all the data and bring them back in Nifi and then
use a PutSQL but as the data are going to be copied as is, it doesn't bring
any value.
My current solution is to rely on an external python script (using JDBC
from there) and use the ExecuteStreamCommand to trigger the insert from the
external table. It is not very elegant but it seems to work.

Now I have to ensure that the SQL query is successful before moving the
file to an other folder, otherwise I will end up with inconsistent data.
I'm currently using the GetHDFS/PutHDFS to move file around however it is
not possible to trigger the GetHDFS processor.

What will be the best strategy to move the HDFS file only if a previous
event is successful? Any recommendation?

Thanks for your help!

Regards,

Re: GetHDFS and triggering

Posted by Arnaud G <gr...@gmail.com>.

Hi Matt,

Thank you very much for your time and detailed answer.

I will try to explain a little bit more the use case, as I suppose that
this should be close to standard patterns.

My current use case involve the extraction of data from a Web API. This
data is not as clean as we would like to but we have to cope with that. The
data are then stored as CSV files on HDFS and an external Hive table is
pointing to the directory. As the data have multiple duplicate/error that
need to be clean up before ingestion, we have a PrestoDB view on this table
that is providing a clean data set.

The goal now is to select the data from this table and insert them in a
another Hive/Presto table. For this I still need to join the data from this
landing table with the data in the destination table to ensure that I'm not
inserting a duplicate records. I do this with Presto. Once done, I need to
move the file from this landing folder to another folder to empty/truncate
the external table. This where my problem reside as I'm struggling to find
an elegant way to trigger this file move only if the SQL process is
successful.

I tried to have a look at ListHDFS but this processor like GetHDFS cannot
be triggered (except by CRON of course). As I need to receive enough data
in this table to correlate them, I cannot have a flow process based on each
file, which rules out FetchHDFS. The direct convertion to ORC is not
working either as I have to filter/deduplicate a lot of information from
the raw API data.

The last idea we had was to do insert line by line in a landing table that
will act as the current folder but with more flexibility as we won't have
to take care of the CSV files. Unfortunately this doesn't work with using
putHiveQL as we have around 10'000-100'000 lines to insert per minute and
the putHiveQL processor cannot follow (we get 1-2 inserts per sec). We
tried to use the Hivestreaming processor, but despite our effort we were
unable to make it work with Kerberos and HDP 2.5 (The Nifi processor seems
to require a the "hive-metastore" principal that we don't have, and when we
create it, we still encounter Kerberos issue). Our last test was to use the
Presto Teradata JDBC driver with the PutSQL processor but it doesn't work
as this driver is in auto-commit mode that is incompatible with the
processor.

I'm currently trying to imagine a better flow that can go around the
limitation, and will maybe try to use a SQL database as a buffer instead.
This should provide me with a better way to control when to truncate this
table. Of course any thought/recommendation are appreciated.

Thanks!

On Wed, Apr 5, 2017 at 10:16 PM, Matt Burgess <ma...@apache.org> wrote:

> Arnaud,
>
> Can you explain more about what you'd like to do via an INSERT query?
> Are you trying to accomplish #3 using Hive via JDBC?  If so you should
> be able to use PutHiveQL rather than PutSQL. If you already have an
> external table in Hive and don't yet have the ORC table, you should be
> able to use a CREATE TABLE AS (CTAS) statement [1] in PutHiveQL.  If
> the ORC table exists and you want to insert from the external table,
> you can use INSERT INTO/OVERWRITE [2].  Apologies if I misunderstood
> what you are trying to do, if that's the case can you please
> elaborate?
>
> Per your comment that you can't trigger GetHDFS, consider using
> ListHDFS [3] and/or FetchHDFS [4] instead. If you know which files you
> want (from the flow), you don't need ListHDFS, rather you'd just set
> the filename attribute on the flow and route it to FetchHDFS.  Having
> said that, if you are already pulling the content of the HDFS files
> into NiFi, perhaps consider the ConvertAvroToORC [5] processor (if you
> can easily get your incoming data into Avro). This would allow you to
> convert to ORC within NiFi, then you can use PutHDFS to land the files
> on Hadoop, then PutHiveQL to create a table on top of the directory
> containing the ORC files.  If that is overkill, hopefully the
> PutHiveQL with the CTAS or INSERT statements will suffice.
>
> Regards,
> Matt
>
> [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#
> LanguageManualDDL-CreateTableAsSelect(CTAS)
> [2] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#
> LanguageManualDML-InsertingdataintoHiveTablesfromqueries
> [3] https://nifi.apache.org/docs/nifi-docs/components/org.
> apache.nifi.processors.hadoop.ListHDFS/index.html
> [4] https://nifi.apache.org/docs/nifi-docs/components/org.
> apache.nifi.processors.hadoop.FetchHDFS/index.html
> [5] https://nifi.apache.org/docs/nifi-docs/components/org.
> apache.nifi.processors.hive.ConvertAvroToORC/index.html
>
>
> On Wed, Apr 5, 2017 at 8:45 AM, Arnaud G <gr...@gmail.com> wrote:
> > Hi,
> >
> > I'm currently building a flow in Nifi and I'm trying to get the best way
> to
> > do it in a reliable manner:
> >
> > The setup is the following:
> >
> > 1) Some files are copied in a folder in HDFS
> > 2) An Hive external table point to this directory
> > 3) The data of this table are then copied in an ORC table
> > 4) The data from the folder are archived and compress in another folder
> >
> > My first issue is that I cannot easily trigger an Insert SQL query from
> > Nifi. ExecuteSQL processor only execute SELECT query and not INSERT
> query. I
> > can of course Select all the data and bring them back in Nifi and then
> use a
> > PutSQL but as the data are going to be copied as is, it doesn't bring any
> > value.
> > My current solution is to rely on an external python script (using JDBC
> from
> > there) and use the ExecuteStreamCommand to trigger the insert from the
> > external table. It is not very elegant but it seems to work.
> >
> > Now I have to ensure that the SQL query is successful before moving the
> file
> > to an other folder, otherwise I will end up with inconsistent data. I'm
> > currently using the GetHDFS/PutHDFS to move file around however it is not
> > possible to trigger the GetHDFS processor.
> >
> > What will be the best strategy to move the HDFS file only if a previous
> > event is successful? Any recommendation?
> >
> > Thanks for your help!
> >
> > Regards,
> >
> >
> >
> >
> >
>

Re: GetHDFS and triggering

Posted by Matt Burgess <ma...@apache.org>.

Arnaud,

Can you explain more about what you'd like to do via an INSERT query?
Are you trying to accomplish #3 using Hive via JDBC?  If so you should
be able to use PutHiveQL rather than PutSQL. If you already have an
external table in Hive and don't yet have the ORC table, you should be
able to use a CREATE TABLE AS (CTAS) statement [1] in PutHiveQL.  If
the ORC table exists and you want to insert from the external table,
you can use INSERT INTO/OVERWRITE [2].  Apologies if I misunderstood
what you are trying to do, if that's the case can you please
elaborate?

Per your comment that you can't trigger GetHDFS, consider using
ListHDFS [3] and/or FetchHDFS [4] instead. If you know which files you
want (from the flow), you don't need ListHDFS, rather you'd just set
the filename attribute on the flow and route it to FetchHDFS.  Having
said that, if you are already pulling the content of the HDFS files
into NiFi, perhaps consider the ConvertAvroToORC [5] processor (if you
can easily get your incoming data into Avro). This would allow you to
convert to ORC within NiFi, then you can use PutHDFS to land the files
on Hadoop, then PutHiveQL to create a table on top of the directory
containing the ORC files.  If that is overkill, hopefully the
PutHiveQL with the CTAS or INSERT statements will suffice.

Regards,
Matt

[1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS)
[2] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
[3] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.hadoop.ListHDFS/index.html
[4] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.hadoop.FetchHDFS/index.html
[5] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.hive.ConvertAvroToORC/index.html

On Wed, Apr 5, 2017 at 8:45 AM, Arnaud G <gr...@gmail.com> wrote:
> Hi,
>
> I'm currently building a flow in Nifi and I'm trying to get the best way to
> do it in a reliable manner:
>
> The setup is the following:
>
> 1) Some files are copied in a folder in HDFS
> 2) An Hive external table point to this directory
> 3) The data of this table are then copied in an ORC table
> 4) The data from the folder are archived and compress in another folder
>
> My first issue is that I cannot easily trigger an Insert SQL query from
> Nifi. ExecuteSQL processor only execute SELECT query and not INSERT query. I
> can of course Select all the data and bring them back in Nifi and then use a
> PutSQL but as the data are going to be copied as is, it doesn't bring any
> value.
> My current solution is to rely on an external python script (using JDBC from
> there) and use the ExecuteStreamCommand to trigger the insert from the
> external table. It is not very elegant but it seems to work.
>
> Now I have to ensure that the SQL query is successful before moving the file
> to an other folder, otherwise I will end up with inconsistent data. I'm
> currently using the GetHDFS/PutHDFS to move file around however it is not
> possible to trigger the GetHDFS processor.
>
> What will be the best strategy to move the HDFS file only if a previous
> event is successful? Any recommendation?
>
> Thanks for your help!
>
> Regards,
>
>
>
>
>