You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by Dima Kovalyov <Di...@sstech.us> on 2016/12/21 23:28:28 UTC

Long-term storage for enriched data

Hello,

Currently we are researching fast and resources efficient way to save
enriched data in Hive for further Analytics.

There are two scenarios that we consider:
a) Use Ozzie Java job that uses Metron enrichment classes to "manually"
enrich each line of the source data that is picked up from the source
dir (the one that we have developed already and using). That is
something that we developed on our own. Downside: custom code that built
on top of Metron source code.

b) Use NiFi to listen for indexing Kafka topic -> split stream by source
type -> Put every source type in corresponding Hive table.

I wonder, if someone was going any of this direction and if there are
best practices for this? Please advise.
Thank you.

- Dima

Re: Long-term storage for enriched data

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.
I don't recall a conversation on that product specifically, but I've
definitely brought up the need to search HDFS from time to time.  Things
like Spark SQL, Hive, Oozie have been discussed, but Avro is new to me I'll
have to look into it.  Are you able to summarize it's benefits?

Jon

On Wed, Dec 28, 2016, 14:45 Kyle Richardson <ky...@gmail.com>
wrote:

> This thread got me thinking... there are likely a fair number of use cases
> for searching and analyzing the output stored in HDFS. Dima's use case is
> certainly one. Has there been any discussion on the use of Avro to store
> the output in HDFS? This would likely require an expansion of the current
> json schema.
>
> -Kyle
>
> On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <ce...@gmail.com> wrote:
>
> > Oozie (or something like it) would appear to me to be the correct tool
> > here.  You are likely moving files around and pinning up hive tables:
> >
> >    - Moving the data written in HDFS from /apps/metron/enrichment/${
> > sensor}
> >    to another directory in HDFS
> >    - Running a job in Hive or pig or spark to take the JSON blobs, map
> them
> >    to rows and pin it up as an ORC table for downstream analytics
> >
> > NiFi is mostly about getting data in the cluster, not really for
> scheduling
> > large-scale batch ETL, I think.
> >
> > Casey
> >
> > On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov <Di...@sstech.us>
> > wrote:
> >
> > > Thank you for reply Carolyn,
> > >
> > > Currently for the test purposes we enrich flow with Geo and ThreatIntel
> > > malware IP, but plan to expand this further.
> > >
> > > Our dev team is working on Oozie job to process this. So meanwhile I
> > > wonder if I could use NiFi for this purpose (because we already using
> it
> > > for data ingest and stream).
> > >
> > > Could you elaborate why it may be overkill? The idea is to have
> > > everything in one place instead of hacking into Metron libraries and
> > code.
> > >
> > > - Dima
> > >
> > > On 12/22/2016 02:26 AM, Carolyn Duby wrote:
> > > > Hi Dima -
> > > >
> > > > What type of analytics are you looking to do?  Is the normalized
> format
> > > not working?  You could use an oozie or spark job to create derivative
> > > tables.
> > > >
> > > > Nifi may be overkill for breaking up the kafka stream.  Spark
> streaming
> > > may be easier.
> > > >
> > > > Thanks
> > > > Carolyn
> > > >
> > > >
> > > >
> > > > Sent from my Verizon, Samsung Galaxy smartphone
> > > >
> > > >
> > > > -------- Original message --------
> > > > From: Dima Kovalyov <Di...@sstech.us>
> > > > Date: 12/21/16 6:28 PM (GMT-05:00)
> > > > To: dev@metron.incubator.apache.org
> > > > Subject: Long-term storage for enriched data
> > > >
> > > > Hello,
> > > >
> > > > Currently we are researching fast and resources efficient way to save
> > > > enriched data in Hive for further Analytics.
> > > >
> > > > There are two scenarios that we consider:
> > > > a) Use Ozzie Java job that uses Metron enrichment classes to
> "manually"
> > > > enrich each line of the source data that is picked up from the source
> > > > dir (the one that we have developed already and using). That is
> > > > something that we developed on our own. Downside: custom code that
> > built
> > > > on top of Metron source code.
> > > >
> > > > b) Use NiFi to listen for indexing Kafka topic -> split stream by
> > source
> > > > type -> Put every source type in corresponding Hive table.
> > > >
> > > > I wonder, if someone was going any of this direction and if there are
> > > > best practices for this? Please advise.
> > > > Thank you.
> > > >
> > > > - Dima
> > > >
> > > >
> > >
> > >
> >
>
-- 

Jon

Sent from my mobile device

Re: Long-term storage for enriched data

Posted by Kyle Richardson <ky...@gmail.com>.
This thread got me thinking... there are likely a fair number of use cases
for searching and analyzing the output stored in HDFS. Dima's use case is
certainly one. Has there been any discussion on the use of Avro to store
the output in HDFS? This would likely require an expansion of the current
json schema.

-Kyle

On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <ce...@gmail.com> wrote:

> Oozie (or something like it) would appear to me to be the correct tool
> here.  You are likely moving files around and pinning up hive tables:
>
>    - Moving the data written in HDFS from /apps/metron/enrichment/${
> sensor}
>    to another directory in HDFS
>    - Running a job in Hive or pig or spark to take the JSON blobs, map them
>    to rows and pin it up as an ORC table for downstream analytics
>
> NiFi is mostly about getting data in the cluster, not really for scheduling
> large-scale batch ETL, I think.
>
> Casey
>
> On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov <Di...@sstech.us>
> wrote:
>
> > Thank you for reply Carolyn,
> >
> > Currently for the test purposes we enrich flow with Geo and ThreatIntel
> > malware IP, but plan to expand this further.
> >
> > Our dev team is working on Oozie job to process this. So meanwhile I
> > wonder if I could use NiFi for this purpose (because we already using it
> > for data ingest and stream).
> >
> > Could you elaborate why it may be overkill? The idea is to have
> > everything in one place instead of hacking into Metron libraries and
> code.
> >
> > - Dima
> >
> > On 12/22/2016 02:26 AM, Carolyn Duby wrote:
> > > Hi Dima -
> > >
> > > What type of analytics are you looking to do?  Is the normalized format
> > not working?  You could use an oozie or spark job to create derivative
> > tables.
> > >
> > > Nifi may be overkill for breaking up the kafka stream.  Spark streaming
> > may be easier.
> > >
> > > Thanks
> > > Carolyn
> > >
> > >
> > >
> > > Sent from my Verizon, Samsung Galaxy smartphone
> > >
> > >
> > > -------- Original message --------
> > > From: Dima Kovalyov <Di...@sstech.us>
> > > Date: 12/21/16 6:28 PM (GMT-05:00)
> > > To: dev@metron.incubator.apache.org
> > > Subject: Long-term storage for enriched data
> > >
> > > Hello,
> > >
> > > Currently we are researching fast and resources efficient way to save
> > > enriched data in Hive for further Analytics.
> > >
> > > There are two scenarios that we consider:
> > > a) Use Ozzie Java job that uses Metron enrichment classes to "manually"
> > > enrich each line of the source data that is picked up from the source
> > > dir (the one that we have developed already and using). That is
> > > something that we developed on our own. Downside: custom code that
> built
> > > on top of Metron source code.
> > >
> > > b) Use NiFi to listen for indexing Kafka topic -> split stream by
> source
> > > type -> Put every source type in corresponding Hive table.
> > >
> > > I wonder, if someone was going any of this direction and if there are
> > > best practices for this? Please advise.
> > > Thank you.
> > >
> > > - Dima
> > >
> > >
> >
> >
>

Re: Long-term storage for enriched data

Posted by Casey Stella <ce...@gmail.com>.
Oozie (or something like it) would appear to me to be the correct tool
here.  You are likely moving files around and pinning up hive tables:

   - Moving the data written in HDFS from /apps/metron/enrichment/${sensor}
   to another directory in HDFS
   - Running a job in Hive or pig or spark to take the JSON blobs, map them
   to rows and pin it up as an ORC table for downstream analytics

NiFi is mostly about getting data in the cluster, not really for scheduling
large-scale batch ETL, I think.

Casey

On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov <Di...@sstech.us>
wrote:

> Thank you for reply Carolyn,
>
> Currently for the test purposes we enrich flow with Geo and ThreatIntel
> malware IP, but plan to expand this further.
>
> Our dev team is working on Oozie job to process this. So meanwhile I
> wonder if I could use NiFi for this purpose (because we already using it
> for data ingest and stream).
>
> Could you elaborate why it may be overkill? The idea is to have
> everything in one place instead of hacking into Metron libraries and code.
>
> - Dima
>
> On 12/22/2016 02:26 AM, Carolyn Duby wrote:
> > Hi Dima -
> >
> > What type of analytics are you looking to do?  Is the normalized format
> not working?  You could use an oozie or spark job to create derivative
> tables.
> >
> > Nifi may be overkill for breaking up the kafka stream.  Spark streaming
> may be easier.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> > Sent from my Verizon, Samsung Galaxy smartphone
> >
> >
> > -------- Original message --------
> > From: Dima Kovalyov <Di...@sstech.us>
> > Date: 12/21/16 6:28 PM (GMT-05:00)
> > To: dev@metron.incubator.apache.org
> > Subject: Long-term storage for enriched data
> >
> > Hello,
> >
> > Currently we are researching fast and resources efficient way to save
> > enriched data in Hive for further Analytics.
> >
> > There are two scenarios that we consider:
> > a) Use Ozzie Java job that uses Metron enrichment classes to "manually"
> > enrich each line of the source data that is picked up from the source
> > dir (the one that we have developed already and using). That is
> > something that we developed on our own. Downside: custom code that built
> > on top of Metron source code.
> >
> > b) Use NiFi to listen for indexing Kafka topic -> split stream by source
> > type -> Put every source type in corresponding Hive table.
> >
> > I wonder, if someone was going any of this direction and if there are
> > best practices for this? Please advise.
> > Thank you.
> >
> > - Dima
> >
> >
>
>

Re: Long-term storage for enriched data

Posted by Dima Kovalyov <Di...@sstech.us>.
Thank you for reply Carolyn,

Currently for the test purposes we enrich flow with Geo and ThreatIntel
malware IP, but plan to expand this further.

Our dev team is working on Oozie job to process this. So meanwhile I
wonder if I could use NiFi for this purpose (because we already using it
for data ingest and stream).

Could you elaborate why it may be overkill? The idea is to have
everything in one place instead of hacking into Metron libraries and code.

- Dima

On 12/22/2016 02:26 AM, Carolyn Duby wrote:
> Hi Dima -
>
> What type of analytics are you looking to do?  Is the normalized format not working?  You could use an oozie or spark job to create derivative tables.
>
> Nifi may be overkill for breaking up the kafka stream.  Spark streaming may be easier.
>
> Thanks
> Carolyn
>
>
>
> Sent from my Verizon, Samsung Galaxy smartphone
>
>
> -------- Original message --------
> From: Dima Kovalyov <Di...@sstech.us>
> Date: 12/21/16 6:28 PM (GMT-05:00)
> To: dev@metron.incubator.apache.org
> Subject: Long-term storage for enriched data
>
> Hello,
>
> Currently we are researching fast and resources efficient way to save
> enriched data in Hive for further Analytics.
>
> There are two scenarios that we consider:
> a) Use Ozzie Java job that uses Metron enrichment classes to "manually"
> enrich each line of the source data that is picked up from the source
> dir (the one that we have developed already and using). That is
> something that we developed on our own. Downside: custom code that built
> on top of Metron source code.
>
> b) Use NiFi to listen for indexing Kafka topic -> split stream by source
> type -> Put every source type in corresponding Hive table.
>
> I wonder, if someone was going any of this direction and if there are
> best practices for this? Please advise.
> Thank you.
>
> - Dima
>
>


RE: Long-term storage for enriched data

Posted by Carolyn Duby <cd...@hortonworks.com>.
Hi Dima -

What type of analytics are you looking to do?  Is the normalized format not working?  You could use an oozie or spark job to create derivative tables.

Nifi may be overkill for breaking up the kafka stream.  Spark streaming may be easier.

Thanks
Carolyn



Sent from my Verizon, Samsung Galaxy smartphone


-------- Original message --------
From: Dima Kovalyov <Di...@sstech.us>
Date: 12/21/16 6:28 PM (GMT-05:00)
To: dev@metron.incubator.apache.org
Subject: Long-term storage for enriched data

Hello,

Currently we are researching fast and resources efficient way to save
enriched data in Hive for further Analytics.

There are two scenarios that we consider:
a) Use Ozzie Java job that uses Metron enrichment classes to "manually"
enrich each line of the source data that is picked up from the source
dir (the one that we have developed already and using). That is
something that we developed on our own. Downside: custom code that built
on top of Metron source code.

b) Use NiFi to listen for indexing Kafka topic -> split stream by source
type -> Put every source type in corresponding Hive table.

I wonder, if someone was going any of this direction and if there are
best practices for this? Please advise.
Thank you.

- Dima