You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@metron.apache.org by Justin Leet <ju...@gmail.com> on 2017/01/16 14:17:11 UTC

[DISCUSS] Moving GeoIP management away from MySQL

Hi all,

As a bit of background, right now, GeoIP data is loaded into and managed by
MySQL (the connectors are LGPL licensed and we need to sever our Maven
dependency on it before next release). We currently depend on and install
an instance of MySQL (in each of the Management Pack, Ansible, and Docker
installs). In the topology, we use the JDBCAdapter to connect to MySQL and
query for a given IP.  Additionally, it's a single point of failure for
that particular enrichment right now.  If MySQL is down, geo enrichment
can't occur.

I'm proposing that we eliminate the use of MySQL entirely, through all
installation paths (which, unless I missed some, includes Ansible, the
Ambari Management Pack, and Docker).  We'd do this by dropping all the
various MySQL setup and management through the code, along with all the
DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to setup
their own databases for enrichments and install connectors is able to do so.

In its place, I've looked at using MapDB, which is a really easy to use
library for creating Java collections backed by a file (This is NOT a
separate installation of anything, it's just a jar that manages interaction
with the file system).  Given the slow churn of the GeoIP files (I believe
they get updated once a week), we can have a script that can be run when
needed, downloads the MaxMind tar file, builds the MapDB file that will be
used by the bolts, and places it into HDFS.  Finally, we update a config to
point to the new file, the bolts get the updated config callback and can
update their db files.  Inside the code, we wrap the MapDB portions to make
it transparent to downstream code.

The particularly nice parts about using MapDB are that its ease of use plus
it offers the utilities we need out of the box to be able to support the
operations we need on this (Keep in mind the GeoIP files use IP ranges and
we need to be able to easily grab the appropriate range).

The main point of concern I have about this is that when we grab the HDFS
file during an update, given that multiple JVMs can be running, we don't
want them to clobber each other. I believe this can be avoided by simply
using each worker's working directory to store the file (and appropriately
ensure threads on the same JVM manage multithreading).  This should keep
the JVMs (and the underlying DB files) entirely independent.

This script would get called by the various installations during startup to
do the initial setup.  After install, it can then be called on demand in
order.

At this point, we should be all set, with everything running and updatable.

Justin

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

Just to make sure it's obvious, there is already a JIRA for this (METRON-283
<https://issues.apache.org/jira/browse/METRON-283>), however it lacks the
detail being discussed here, so this is good.  I'm also personally
interested in this migration away from MySQL and would support using MapDB
(or similar) for reasons previously outlined.

Jon

On Mon, Jan 16, 2017 at 11:02 AM Casey Stella <ce...@gmail.com> wrote:

> I think that it's a sensible thing to use MapDB for the geo enrichment.
> Let me state my reasoning:
>
>    - An HBase implementation  would necessitate a HBase scan possibly
>    hitting HDFS, which is expensive per-message.
>    - An HBase implementation would necessitate a network hop and MapDB
>    would not.
>
> I also think this might be the beginning of a more general purpose support
> in Stellar for locally shipped, read-only MapDB lookups, which might be
> interesting.
>
> In short, all quotes about premature optimization are sure to apply to my
> reasoning, but I can't help but have my spidey senses tingle when we
> introduce a scan-per-message architecture.
>
> Casey
>
> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <Di...@sstech.us>
> wrote:
>
> > Hello Justin,
> >
> > Considering that Metron uses hbase tables for storing enrichment and
> > threatintel feeds, can we use Hbase for geo enrichment as well?
> > Or MapDB can be used for enrichment and threatintel feeds instead of
> hbase?
> >
> > - Dima
> >
> > On 01/16/2017 04:17 PM, Justin Leet wrote:
> > > Hi all,
> > >
> > > As a bit of background, right now, GeoIP data is loaded into and
> managed
> > by
> > > MySQL (the connectors are LGPL licensed and we need to sever our Maven
> > > dependency on it before next release). We currently depend on and
> install
> > > an instance of MySQL (in each of the Management Pack, Ansible, and
> Docker
> > > installs). In the topology, we use the JDBCAdapter to connect to MySQL
> > and
> > > query for a given IP.  Additionally, it's a single point of failure for
> > > that particular enrichment right now.  If MySQL is down, geo enrichment
> > > can't occur.
> > >
> > > I'm proposing that we eliminate the use of MySQL entirely, through all
> > > installation paths (which, unless I missed some, includes Ansible, the
> > > Ambari Management Pack, and Docker).  We'd do this by dropping all the
> > > various MySQL setup and management through the code, along with all the
> > > DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
> setup
> > > their own databases for enrichments and install connectors is able to
> do
> > so.
> > >
> > > In its place, I've looked at using MapDB, which is a really easy to use
> > > library for creating Java collections backed by a file (This is NOT a
> > > separate installation of anything, it's just a jar that manages
> > interaction
> > > with the file system).  Given the slow churn of the GeoIP files (I
> > believe
> > > they get updated once a week), we can have a script that can be run
> when
> > > needed, downloads the MaxMind tar file, builds the MapDB file that will
> > be
> > > used by the bolts, and places it into HDFS.  Finally, we update a
> config
> > to
> > > point to the new file, the bolts get the updated config callback and
> can
> > > update their db files.  Inside the code, we wrap the MapDB portions to
> > make
> > > it transparent to downstream code.
> > >
> > > The particularly nice parts about using MapDB are that its ease of use
> > plus
> > > it offers the utilities we need out of the box to be able to support
> the
> > > operations we need on this (Keep in mind the GeoIP files use IP ranges
> > and
> > > we need to be able to easily grab the appropriate range).
> > >
> > > The main point of concern I have about this is that when we grab the
> HDFS
> > > file during an update, given that multiple JVMs can be running, we
> don't
> > > want them to clobber each other. I believe this can be avoided by
> simply
> > > using each worker's working directory to store the file (and
> > appropriately
> > > ensure threads on the same JVM manage multithreading).  This should
> keep
> > > the JVMs (and the underlying DB files) entirely independent.
> > >
> > > This script would get called by the various installations during
> startup
> > to
> > > do the initial setup.  After install, it can then be called on demand
> in
> > > order.
> > >
> > > At this point, we should be all set, with everything running and
> > updatable.
> > >
> > > Justin
> > >
> >
> >
>
-- 

Jon

Sent from my mobile device

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by "Zeolla@GMail.com" <ze...@gmail.com>.

Re: extensibility - I am one of those enterprise users who plan to do
enrichment using their IPAM data in the next couple of months.  However,
since the information that I have is a much different format compared to
maxmind, my approach was going to make a completely separate HBase
enricher.  That also makes it easier for me to upgrade my Metron cluster in
the future, as I would not be customizing a built-in.

That said, I'm game for a follow-on enhancement, but for now this should
probably just be a replacement of what currently exists.

Jon

On Mon, Jan 16, 2017 at 12:15 PM Justin Leet <ju...@gmail.com> wrote:

> I definitely agree on checking out the MaxMind API.  I'll take a look at
> it, but at first glance it looks like it does include everything we use.
> Great find, JJ.
>
> More details on various people's points:
>
> As a note to anyone hopping in, Simon's point on the range lookup vs a key
> lookup is why it becomes a Scan in HBase vs a Get.  As an addendum to what
> Simon mentioned, denormalizing is easy enough and turns it into an easy
> range lookup.
>
> To David's point, the MapDB does require a network hop, but it's once per
> refresh of the data (Got a relevant callback? Grab new data, load it, swap
> out) instead of (up to) once per message.  I would expect the same to be
> true of the MaxMind db files.
>
> I'd also argue MapDB not really more complex than refreshing the HBase
> table, because we potentially have to start worrying about things like
> hashing and/or indices and even just general data represtation. It's
> definitely correct that the file processing has to occur on either path, so
> it really boils down to handling the callback and reloading the file vs
> handling some of the standard HBasey things.  I don't think either is an
> enormous amount of work (and both are almost certainly more work than
> MaxMind's API)
>
> Regarding extensibility, I'd argue for parity with what we have first, then
> build what we need from there.  Does anybody have any disagreement with
> that approach for right now?
>
> Justin
>
> On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <dl...@gmail.com> wrote:
>
> > It is interesting- save us a ton of effort, and has the right license. I
> > think it's worth at least checking out.
> >
> > -D...
> >
> >
> > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> > simon@simonellistonball.com> wrote:
> >
> > > I like that approach even more. That way we would only have to worry
> > about
> > > distributing the database file in binary format to all the supervisor
> > nodes
> > > on update.
> > >
> > > It would also make it easier for people to switch to the enterprise DB
> > > potentially if they had the license.
> > >
> > > One slight issue with this might be for people who wanted to extend the
> > > database. For example, organisations may want to add geo-enrichment to
> > > their own private network addresses based modified versions of the geo
> > > database. Currently we don’t really allow this, since we hard-code
> > ignoring
> > > private network classes into the geo enrichment adapter, but I can see
> a
> > > case where a global org might want to add their own ranges and
> locations
> > to
> > > the data set. Does that make sense to anyone else?
> > >
> > > Simon
> > >
> > >
> > > > On 16 Jan 2017, at 16:50, JJ Meyer <jj...@gmail.com> wrote:
> > > >
> > > > Hello all,
> > > >
> > > > Can we leverage maxmind's Java client (
> > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> > > main/java/com/maxmind/geoip2)
> > > > in this case? I believe it can directly read maxmind file. Plus I
> think
> > > it
> > > > also has some support for caching as well.
> > > >
> > > > Thanks,
> > > > JJ
> > > >
> > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > > > simon@simonellistonball.com> wrote:
> > > >
> > > >> I like the idea of MapDB, since we can essentially pull an instance
> > into
> > > >> each supervisor, so it makes a lot of sense for relatively small
> > scale,
> > > >> relatively static enrichments in general.
> > > >>
> > > >> Generally this feels like a caching problem, and would be for a
> simple
> > > >> key-value lookup. In that case I would agree with David Lyle on
> using
> > > HBase
> > > >> as a source or truth and relying on caching.
> > > >>
> > > >> That said, GeoIP is a different lookup pattern, since it’s a range
> > > lookup
> > > >> then a key lookup (or if we denormalize the MaxMind data, just a
> range
> > > >> lookup) for that kind of thing MapDB with something like the BTree
> > > seems a
> > > >> good fit.
> > > >>
> > > >> Simon
> > > >>
> > > >>
> > > >>> On 16 Jan 2017, at 16:28, David Lyle <dl...@gmail.com> wrote:
> > > >>>
> > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it
> > as
> > > an
> > > >>> HBase enrichment. If our current caching isn't enough to mitigate
> the
> > > >> above
> > > >>> issues, we have a problem, don't we? Or do we not recommend HBase
> > > >>> enrichment for per message enrichment in general?
> > > >>>
> > > >>> Also- can you elaborate on how MapDB would not require a network
> hop?
> > > >>> Doesn't this mean we would have to sync the enrichment data to each
> > > Storm
> > > >>> supervisor? HDFS could (probably would) have a network hop too, no?
> > > >>>
> > > >>> Fwiw -
> > > >>> "In its place, I've looked at using MapDB, which is a really easy
> to
> > > use
> > > >>> library for creating Java collections backed by a file (This is
> NOT a
> > > >>> separate installation of anything, it's just a jar that manages
> > > >> interaction
> > > >>> with the file system).  Given the slow churn of the GeoIP files (I
> > > >> believe
> > > >>> they get updated once a week), we can have a script that can be run
> > > when
> > > >>> needed, downloads the MaxMind tar file, builds the MapDB file that
> > will
> > > >> be
> > > >>> used by the bolts, and places it into HDFS.  Finally, we update a
> > > config
> > > >> to
> > > >>> point to the new file, the bolts get the updated config callback
> and
> > > can
> > > >>> update their db files.  Inside the code, we wrap the MapDB portions
> > to
> > > >> make
> > > >>> it transparent to downstream code."
> > > >>>
> > > >>> Seems a bit more complex than "refresh the hbase table". Afaik,
> > either
> > > >>> approach would require some sort of translation between GeoIP
> source
> > > >> format
> > > >>> and target format, so that part is a wash imo.
> > > >>>
> > > >>> So, I'd really like to see, at least, an attempt to leverage HBase
> > > >>> enrichment.
> > > >>>
> > > >>> -D...
> > > >>>
> > > >>>
> > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <cestella@gmail.com
> >
> > > >> wrote:
> > > >>>
> > > >>>> I think that it's a sensible thing to use MapDB for the geo
> > > enrichment.
> > > >>>> Let me state my reasoning:
> > > >>>>
> > > >>>>  - An HBase implementation  would necessitate a HBase scan
> possibly
> > > >>>>  hitting HDFS, which is expensive per-message.
> > > >>>>  - An HBase implementation would necessitate a network hop and
> MapDB
> > > >>>>  would not.
> > > >>>>
> > > >>>> I also think this might be the beginning of a more general purpose
> > > >> support
> > > >>>> in Stellar for locally shipped, read-only MapDB lookups, which
> might
> > > be
> > > >>>> interesting.
> > > >>>>
> > > >>>> In short, all quotes about premature optimization are sure to
> apply
> > to
> > > >> my
> > > >>>> reasoning, but I can't help but have my spidey senses tingle when
> we
> > > >>>> introduce a scan-per-message architecture.
> > > >>>>
> > > >>>> Casey
> > > >>>>
> > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> > > >> Dima.Kovalyov@sstech.us>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hello Justin,
> > > >>>>>
> > > >>>>> Considering that Metron uses hbase tables for storing enrichment
> > and
> > > >>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
> > > >>>>> Or MapDB can be used for enrichment and threatintel feeds instead
> > of
> > > >>>> hbase?
> > > >>>>>
> > > >>>>> - Dima
> > > >>>>>
> > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> As a bit of background, right now, GeoIP data is loaded into and
> > > >>>> managed
> > > >>>>> by
> > > >>>>>> MySQL (the connectors are LGPL licensed and we need to sever our
> > > Maven
> > > >>>>>> dependency on it before next release). We currently depend on
> and
> > > >>>> install
> > > >>>>>> an instance of MySQL (in each of the Management Pack, Ansible,
> and
> > > >>>> Docker
> > > >>>>>> installs). In the topology, we use the JDBCAdapter to connect to
> > > MySQL
> > > >>>>> and
> > > >>>>>> query for a given IP.  Additionally, it's a single point of
> > failure
> > > >> for
> > > >>>>>> that particular enrichment right now.  If MySQL is down, geo
> > > >> enrichment
> > > >>>>>> can't occur.
> > > >>>>>>
> > > >>>>>> I'm proposing that we eliminate the use of MySQL entirely,
> through
> > > all
> > > >>>>>> installation paths (which, unless I missed some, includes
> Ansible,
> > > the
> > > >>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping
> all
> > > the
> > > >>>>>> various MySQL setup and management through the code, along with
> > all
> > > >> the
> > > >>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants
> > to
> > > >>>> setup
> > > >>>>>> their own databases for enrichments and install connectors is
> able
> > > to
> > > >>>> do
> > > >>>>> so.
> > > >>>>>>
> > > >>>>>> In its place, I've looked at using MapDB, which is a really easy
> > to
> > > >> use
> > > >>>>>> library for creating Java collections backed by a file (This is
> > NOT
> > > a
> > > >>>>>> separate installation of anything, it's just a jar that manages
> > > >>>>> interaction
> > > >>>>>> with the file system).  Given the slow churn of the GeoIP files
> (I
> > > >>>>> believe
> > > >>>>>> they get updated once a week), we can have a script that can be
> > run
> > > >>>> when
> > > >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file
> that
> > > >> will
> > > >>>>> be
> > > >>>>>> used by the bolts, and places it into HDFS.  Finally, we update
> a
> > > >>>> config
> > > >>>>> to
> > > >>>>>> point to the new file, the bolts get the updated config callback
> > and
> > > >>>> can
> > > >>>>>> update their db files.  Inside the code, we wrap the MapDB
> > portions
> > > to
> > > >>>>> make
> > > >>>>>> it transparent to downstream code.
> > > >>>>>>
> > > >>>>>> The particularly nice parts about using MapDB are that its ease
> of
> > > use
> > > >>>>> plus
> > > >>>>>> it offers the utilities we need out of the box to be able to
> > support
> > > >>>> the
> > > >>>>>> operations we need on this (Keep in mind the GeoIP files use IP
> > > ranges
> > > >>>>> and
> > > >>>>>> we need to be able to easily grab the appropriate range).
> > > >>>>>>
> > > >>>>>> The main point of concern I have about this is that when we grab
> > the
> > > >>>> HDFS
> > > >>>>>> file during an update, given that multiple JVMs can be running,
> we
> > > >>>> don't
> > > >>>>>> want them to clobber each other. I believe this can be avoided
> by
> > > >>>> simply
> > > >>>>>> using each worker's working directory to store the file (and
> > > >>>>> appropriately
> > > >>>>>> ensure threads on the same JVM manage multithreading).  This
> > should
> > > >>>> keep
> > > >>>>>> the JVMs (and the underlying DB files) entirely independent.
> > > >>>>>>
> > > >>>>>> This script would get called by the various installations during
> > > >>>> startup
> > > >>>>> to
> > > >>>>>> do the initial setup.  After install, it can then be called on
> > > demand
> > > >>>> in
> > > >>>>>> order.
> > > >>>>>>
> > > >>>>>> At this point, we should be all set, with everything running and
> > > >>>>> updatable.
> > > >>>>>>
> > > >>>>>> Justin
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>
-- 

Jon

Sent from my mobile device

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by Kyle Richardson <ky...@gmail.com>.

+1 Agree with David's order

-Kyle

On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <dl...@gmail.com> wrote:

> Def agree on the parity point.
>
> I'm a little worried about Supervisor relocations for non-HBase solutions,
> but having much of the work done for us by MaxMind changes my preference to
> (in order)
>
> 1) MM API
> 2) HBase Enrichment
> 3) MapDB should the others prove not feasible
>
>
> -D...
>
>
> On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <ju...@gmail.com>
> wrote:
>
> > I definitely agree on checking out the MaxMind API.  I'll take a look at
> > it, but at first glance it looks like it does include everything we use.
> > Great find, JJ.
> >
> > More details on various people's points:
> >
> > As a note to anyone hopping in, Simon's point on the range lookup vs a
> key
> > lookup is why it becomes a Scan in HBase vs a Get.  As an addendum to
> what
> > Simon mentioned, denormalizing is easy enough and turns it into an easy
> > range lookup.
> >
> > To David's point, the MapDB does require a network hop, but it's once per
> > refresh of the data (Got a relevant callback? Grab new data, load it,
> swap
> > out) instead of (up to) once per message.  I would expect the same to be
> > true of the MaxMind db files.
> >
> > I'd also argue MapDB not really more complex than refreshing the HBase
> > table, because we potentially have to start worrying about things like
> > hashing and/or indices and even just general data represtation. It's
> > definitely correct that the file processing has to occur on either path,
> so
> > it really boils down to handling the callback and reloading the file vs
> > handling some of the standard HBasey things.  I don't think either is an
> > enormous amount of work (and both are almost certainly more work than
> > MaxMind's API)
> >
> > Regarding extensibility, I'd argue for parity with what we have first,
> then
> > build what we need from there.  Does anybody have any disagreement with
> > that approach for right now?
> >
> > Justin
> >
> > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <dl...@gmail.com>
> wrote:
> >
> > > It is interesting- save us a ton of effort, and has the right license.
> I
> > > think it's worth at least checking out.
> > >
> > > -D...
> > >
> > >
> > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> > > simon@simonellistonball.com> wrote:
> > >
> > > > I like that approach even more. That way we would only have to worry
> > > about
> > > > distributing the database file in binary format to all the supervisor
> > > nodes
> > > > on update.
> > > >
> > > > It would also make it easier for people to switch to the enterprise
> DB
> > > > potentially if they had the license.
> > > >
> > > > One slight issue with this might be for people who wanted to extend
> the
> > > > database. For example, organisations may want to add geo-enrichment
> to
> > > > their own private network addresses based modified versions of the
> geo
> > > > database. Currently we don’t really allow this, since we hard-code
> > > ignoring
> > > > private network classes into the geo enrichment adapter, but I can
> see
> > a
> > > > case where a global org might want to add their own ranges and
> > locations
> > > to
> > > > the data set. Does that make sense to anyone else?
> > > >
> > > > Simon
> > > >
> > > >
> > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jj...@gmail.com> wrote:
> > > > >
> > > > > Hello all,
> > > > >
> > > > > Can we leverage maxmind's Java client (
> > > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> > > > main/java/com/maxmind/geoip2)
> > > > > in this case? I believe it can directly read maxmind file. Plus I
> > think
> > > > it
> > > > > also has some support for caching as well.
> > > > >
> > > > > Thanks,
> > > > > JJ
> > > > >
> > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > > > > simon@simonellistonball.com> wrote:
> > > > >
> > > > >> I like the idea of MapDB, since we can essentially pull an
> instance
> > > into
> > > > >> each supervisor, so it makes a lot of sense for relatively small
> > > scale,
> > > > >> relatively static enrichments in general.
> > > > >>
> > > > >> Generally this feels like a caching problem, and would be for a
> > simple
> > > > >> key-value lookup. In that case I would agree with David Lyle on
> > using
> > > > HBase
> > > > >> as a source or truth and relying on caching.
> > > > >>
> > > > >> That said, GeoIP is a different lookup pattern, since it’s a range
> > > > lookup
> > > > >> then a key lookup (or if we denormalize the MaxMind data, just a
> > range
> > > > >> lookup) for that kind of thing MapDB with something like the BTree
> > > > seems a
> > > > >> good fit.
> > > > >>
> > > > >> Simon
> > > > >>
> > > > >>
> > > > >>> On 16 Jan 2017, at 16:28, David Lyle <dl...@gmail.com>
> wrote:
> > > > >>>
> > > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see
> it
> > > as
> > > > an
> > > > >>> HBase enrichment. If our current caching isn't enough to mitigate
> > the
> > > > >> above
> > > > >>> issues, we have a problem, don't we? Or do we not recommend HBase
> > > > >>> enrichment for per message enrichment in general?
> > > > >>>
> > > > >>> Also- can you elaborate on how MapDB would not require a network
> > hop?
> > > > >>> Doesn't this mean we would have to sync the enrichment data to
> each
> > > > Storm
> > > > >>> supervisor? HDFS could (probably would) have a network hop too,
> no?
> > > > >>>
> > > > >>> Fwiw -
> > > > >>> "In its place, I've looked at using MapDB, which is a really easy
> > to
> > > > use
> > > > >>> library for creating Java collections backed by a file (This is
> > NOT a
> > > > >>> separate installation of anything, it's just a jar that manages
> > > > >> interaction
> > > > >>> with the file system).  Given the slow churn of the GeoIP files
> (I
> > > > >> believe
> > > > >>> they get updated once a week), we can have a script that can be
> run
> > > > when
> > > > >>> needed, downloads the MaxMind tar file, builds the MapDB file
> that
> > > will
> > > > >> be
> > > > >>> used by the bolts, and places it into HDFS.  Finally, we update a
> > > > config
> > > > >> to
> > > > >>> point to the new file, the bolts get the updated config callback
> > and
> > > > can
> > > > >>> update their db files.  Inside the code, we wrap the MapDB
> portions
> > > to
> > > > >> make
> > > > >>> it transparent to downstream code."
> > > > >>>
> > > > >>> Seems a bit more complex than "refresh the hbase table". Afaik,
> > > either
> > > > >>> approach would require some sort of translation between GeoIP
> > source
> > > > >> format
> > > > >>> and target format, so that part is a wash imo.
> > > > >>>
> > > > >>> So, I'd really like to see, at least, an attempt to leverage
> HBase
> > > > >>> enrichment.
> > > > >>>
> > > > >>> -D...
> > > > >>>
> > > > >>>
> > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <
> cestella@gmail.com
> > >
> > > > >> wrote:
> > > > >>>
> > > > >>>> I think that it's a sensible thing to use MapDB for the geo
> > > > enrichment.
> > > > >>>> Let me state my reasoning:
> > > > >>>>
> > > > >>>>  - An HBase implementation  would necessitate a HBase scan
> > possibly
> > > > >>>>  hitting HDFS, which is expensive per-message.
> > > > >>>>  - An HBase implementation would necessitate a network hop and
> > MapDB
> > > > >>>>  would not.
> > > > >>>>
> > > > >>>> I also think this might be the beginning of a more general
> purpose
> > > > >> support
> > > > >>>> in Stellar for locally shipped, read-only MapDB lookups, which
> > might
> > > > be
> > > > >>>> interesting.
> > > > >>>>
> > > > >>>> In short, all quotes about premature optimization are sure to
> > apply
> > > to
> > > > >> my
> > > > >>>> reasoning, but I can't help but have my spidey senses tingle
> when
> > we
> > > > >>>> introduce a scan-per-message architecture.
> > > > >>>>
> > > > >>>> Casey
> > > > >>>>
> > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> > > > >> Dima.Kovalyov@sstech.us>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Hello Justin,
> > > > >>>>>
> > > > >>>>> Considering that Metron uses hbase tables for storing
> enrichment
> > > and
> > > > >>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
> > > > >>>>> Or MapDB can be used for enrichment and threatintel feeds
> instead
> > > of
> > > > >>>> hbase?
> > > > >>>>>
> > > > >>>>> - Dima
> > > > >>>>>
> > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > > > >>>>>> Hi all,
> > > > >>>>>>
> > > > >>>>>> As a bit of background, right now, GeoIP data is loaded into
> and
> > > > >>>> managed
> > > > >>>>> by
> > > > >>>>>> MySQL (the connectors are LGPL licensed and we need to sever
> our
> > > > Maven
> > > > >>>>>> dependency on it before next release). We currently depend on
> > and
> > > > >>>> install
> > > > >>>>>> an instance of MySQL (in each of the Management Pack, Ansible,
> > and
> > > > >>>> Docker
> > > > >>>>>> installs). In the topology, we use the JDBCAdapter to connect
> to
> > > > MySQL
> > > > >>>>> and
> > > > >>>>>> query for a given IP.  Additionally, it's a single point of
> > > failure
> > > > >> for
> > > > >>>>>> that particular enrichment right now.  If MySQL is down, geo
> > > > >> enrichment
> > > > >>>>>> can't occur.
> > > > >>>>>>
> > > > >>>>>> I'm proposing that we eliminate the use of MySQL entirely,
> > through
> > > > all
> > > > >>>>>> installation paths (which, unless I missed some, includes
> > Ansible,
> > > > the
> > > > >>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping
> > all
> > > > the
> > > > >>>>>> various MySQL setup and management through the code, along
> with
> > > all
> > > > >> the
> > > > >>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who
> wants
> > > to
> > > > >>>> setup
> > > > >>>>>> their own databases for enrichments and install connectors is
> > able
> > > > to
> > > > >>>> do
> > > > >>>>> so.
> > > > >>>>>>
> > > > >>>>>> In its place, I've looked at using MapDB, which is a really
> easy
> > > to
> > > > >> use
> > > > >>>>>> library for creating Java collections backed by a file (This
> is
> > > NOT
> > > > a
> > > > >>>>>> separate installation of anything, it's just a jar that
> manages
> > > > >>>>> interaction
> > > > >>>>>> with the file system).  Given the slow churn of the GeoIP
> files
> > (I
> > > > >>>>> believe
> > > > >>>>>> they get updated once a week), we can have a script that can
> be
> > > run
> > > > >>>> when
> > > > >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file
> > that
> > > > >> will
> > > > >>>>> be
> > > > >>>>>> used by the bolts, and places it into HDFS.  Finally, we
> update
> > a
> > > > >>>> config
> > > > >>>>> to
> > > > >>>>>> point to the new file, the bolts get the updated config
> callback
> > > and
> > > > >>>> can
> > > > >>>>>> update their db files.  Inside the code, we wrap the MapDB
> > > portions
> > > > to
> > > > >>>>> make
> > > > >>>>>> it transparent to downstream code.
> > > > >>>>>>
> > > > >>>>>> The particularly nice parts about using MapDB are that its
> ease
> > of
> > > > use
> > > > >>>>> plus
> > > > >>>>>> it offers the utilities we need out of the box to be able to
> > > support
> > > > >>>> the
> > > > >>>>>> operations we need on this (Keep in mind the GeoIP files use
> IP
> > > > ranges
> > > > >>>>> and
> > > > >>>>>> we need to be able to easily grab the appropriate range).
> > > > >>>>>>
> > > > >>>>>> The main point of concern I have about this is that when we
> grab
> > > the
> > > > >>>> HDFS
> > > > >>>>>> file during an update, given that multiple JVMs can be
> running,
> > we
> > > > >>>> don't
> > > > >>>>>> want them to clobber each other. I believe this can be avoided
> > by
> > > > >>>> simply
> > > > >>>>>> using each worker's working directory to store the file (and
> > > > >>>>> appropriately
> > > > >>>>>> ensure threads on the same JVM manage multithreading).  This
> > > should
> > > > >>>> keep
> > > > >>>>>> the JVMs (and the underlying DB files) entirely independent.
> > > > >>>>>>
> > > > >>>>>> This script would get called by the various installations
> during
> > > > >>>> startup
> > > > >>>>> to
> > > > >>>>>> do the initial setup.  After install, it can then be called on
> > > > demand
> > > > >>>> in
> > > > >>>>>> order.
> > > > >>>>>>
> > > > >>>>>> At this point, we should be all set, with everything running
> and
> > > > >>>>> updatable.
> > > > >>>>>>
> > > > >>>>>> Justin
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by Casey Stella <ce...@gmail.com>.

Ok, catching back up to this thread.  Yes, MM sounds like a really good
choice!

On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <dl...@gmail.com> wrote:

> Def agree on the parity point.
>
> I'm a little worried about Supervisor relocations for non-HBase solutions,
> but having much of the work done for us by MaxMind changes my preference to
> (in order)
>
> 1) MM API
> 2) HBase Enrichment
> 3) MapDB should the others prove not feasible
>
>
> -D...
>
>
> On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <ju...@gmail.com>
> wrote:
>
> > I definitely agree on checking out the MaxMind API.  I'll take a look at
> > it, but at first glance it looks like it does include everything we use.
> > Great find, JJ.
> >
> > More details on various people's points:
> >
> > As a note to anyone hopping in, Simon's point on the range lookup vs a
> key
> > lookup is why it becomes a Scan in HBase vs a Get.  As an addendum to
> what
> > Simon mentioned, denormalizing is easy enough and turns it into an easy
> > range lookup.
> >
> > To David's point, the MapDB does require a network hop, but it's once per
> > refresh of the data (Got a relevant callback? Grab new data, load it,
> swap
> > out) instead of (up to) once per message.  I would expect the same to be
> > true of the MaxMind db files.
> >
> > I'd also argue MapDB not really more complex than refreshing the HBase
> > table, because we potentially have to start worrying about things like
> > hashing and/or indices and even just general data represtation. It's
> > definitely correct that the file processing has to occur on either path,
> so
> > it really boils down to handling the callback and reloading the file vs
> > handling some of the standard HBasey things.  I don't think either is an
> > enormous amount of work (and both are almost certainly more work than
> > MaxMind's API)
> >
> > Regarding extensibility, I'd argue for parity with what we have first,
> then
> > build what we need from there.  Does anybody have any disagreement with
> > that approach for right now?
> >
> > Justin
> >
> > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <dl...@gmail.com>
> wrote:
> >
> > > It is interesting- save us a ton of effort, and has the right license.
> I
> > > think it's worth at least checking out.
> > >
> > > -D...
> > >
> > >
> > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> > > simon@simonellistonball.com> wrote:
> > >
> > > > I like that approach even more. That way we would only have to worry
> > > about
> > > > distributing the database file in binary format to all the supervisor
> > > nodes
> > > > on update.
> > > >
> > > > It would also make it easier for people to switch to the enterprise
> DB
> > > > potentially if they had the license.
> > > >
> > > > One slight issue with this might be for people who wanted to extend
> the
> > > > database. For example, organisations may want to add geo-enrichment
> to
> > > > their own private network addresses based modified versions of the
> geo
> > > > database. Currently we don’t really allow this, since we hard-code
> > > ignoring
> > > > private network classes into the geo enrichment adapter, but I can
> see
> > a
> > > > case where a global org might want to add their own ranges and
> > locations
> > > to
> > > > the data set. Does that make sense to anyone else?
> > > >
> > > > Simon
> > > >
> > > >
> > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jj...@gmail.com> wrote:
> > > > >
> > > > > Hello all,
> > > > >
> > > > > Can we leverage maxmind's Java client (
> > > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> > > > main/java/com/maxmind/geoip2)
> > > > > in this case? I believe it can directly read maxmind file. Plus I
> > think
> > > > it
> > > > > also has some support for caching as well.
> > > > >
> > > > > Thanks,
> > > > > JJ
> > > > >
> > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > > > > simon@simonellistonball.com> wrote:
> > > > >
> > > > >> I like the idea of MapDB, since we can essentially pull an
> instance
> > > into
> > > > >> each supervisor, so it makes a lot of sense for relatively small
> > > scale,
> > > > >> relatively static enrichments in general.
> > > > >>
> > > > >> Generally this feels like a caching problem, and would be for a
> > simple
> > > > >> key-value lookup. In that case I would agree with David Lyle on
> > using
> > > > HBase
> > > > >> as a source or truth and relying on caching.
> > > > >>
> > > > >> That said, GeoIP is a different lookup pattern, since it’s a range
> > > > lookup
> > > > >> then a key lookup (or if we denormalize the MaxMind data, just a
> > range
> > > > >> lookup) for that kind of thing MapDB with something like the BTree
> > > > seems a
> > > > >> good fit.
> > > > >>
> > > > >> Simon
> > > > >>
> > > > >>
> > > > >>> On 16 Jan 2017, at 16:28, David Lyle <dl...@gmail.com>
> wrote:
> > > > >>>
> > > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see
> it
> > > as
> > > > an
> > > > >>> HBase enrichment. If our current caching isn't enough to mitigate
> > the
> > > > >> above
> > > > >>> issues, we have a problem, don't we? Or do we not recommend HBase
> > > > >>> enrichment for per message enrichment in general?
> > > > >>>
> > > > >>> Also- can you elaborate on how MapDB would not require a network
> > hop?
> > > > >>> Doesn't this mean we would have to sync the enrichment data to
> each
> > > > Storm
> > > > >>> supervisor? HDFS could (probably would) have a network hop too,
> no?
> > > > >>>
> > > > >>> Fwiw -
> > > > >>> "In its place, I've looked at using MapDB, which is a really easy
> > to
> > > > use
> > > > >>> library for creating Java collections backed by a file (This is
> > NOT a
> > > > >>> separate installation of anything, it's just a jar that manages
> > > > >> interaction
> > > > >>> with the file system).  Given the slow churn of the GeoIP files
> (I
> > > > >> believe
> > > > >>> they get updated once a week), we can have a script that can be
> run
> > > > when
> > > > >>> needed, downloads the MaxMind tar file, builds the MapDB file
> that
> > > will
> > > > >> be
> > > > >>> used by the bolts, and places it into HDFS.  Finally, we update a
> > > > config
> > > > >> to
> > > > >>> point to the new file, the bolts get the updated config callback
> > and
> > > > can
> > > > >>> update their db files.  Inside the code, we wrap the MapDB
> portions
> > > to
> > > > >> make
> > > > >>> it transparent to downstream code."
> > > > >>>
> > > > >>> Seems a bit more complex than "refresh the hbase table". Afaik,
> > > either
> > > > >>> approach would require some sort of translation between GeoIP
> > source
> > > > >> format
> > > > >>> and target format, so that part is a wash imo.
> > > > >>>
> > > > >>> So, I'd really like to see, at least, an attempt to leverage
> HBase
> > > > >>> enrichment.
> > > > >>>
> > > > >>> -D...
> > > > >>>
> > > > >>>
> > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <
> cestella@gmail.com
> > >
> > > > >> wrote:
> > > > >>>
> > > > >>>> I think that it's a sensible thing to use MapDB for the geo
> > > > enrichment.
> > > > >>>> Let me state my reasoning:
> > > > >>>>
> > > > >>>>  - An HBase implementation  would necessitate a HBase scan
> > possibly
> > > > >>>>  hitting HDFS, which is expensive per-message.
> > > > >>>>  - An HBase implementation would necessitate a network hop and
> > MapDB
> > > > >>>>  would not.
> > > > >>>>
> > > > >>>> I also think this might be the beginning of a more general
> purpose
> > > > >> support
> > > > >>>> in Stellar for locally shipped, read-only MapDB lookups, which
> > might
> > > > be
> > > > >>>> interesting.
> > > > >>>>
> > > > >>>> In short, all quotes about premature optimization are sure to
> > apply
> > > to
> > > > >> my
> > > > >>>> reasoning, but I can't help but have my spidey senses tingle
> when
> > we
> > > > >>>> introduce a scan-per-message architecture.
> > > > >>>>
> > > > >>>> Casey
> > > > >>>>
> > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> > > > >> Dima.Kovalyov@sstech.us>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Hello Justin,
> > > > >>>>>
> > > > >>>>> Considering that Metron uses hbase tables for storing
> enrichment
> > > and
> > > > >>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
> > > > >>>>> Or MapDB can be used for enrichment and threatintel feeds
> instead
> > > of
> > > > >>>> hbase?
> > > > >>>>>
> > > > >>>>> - Dima
> > > > >>>>>
> > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > > > >>>>>> Hi all,
> > > > >>>>>>
> > > > >>>>>> As a bit of background, right now, GeoIP data is loaded into
> and
> > > > >>>> managed
> > > > >>>>> by
> > > > >>>>>> MySQL (the connectors are LGPL licensed and we need to sever
> our
> > > > Maven
> > > > >>>>>> dependency on it before next release). We currently depend on
> > and
> > > > >>>> install
> > > > >>>>>> an instance of MySQL (in each of the Management Pack, Ansible,
> > and
> > > > >>>> Docker
> > > > >>>>>> installs). In the topology, we use the JDBCAdapter to connect
> to
> > > > MySQL
> > > > >>>>> and
> > > > >>>>>> query for a given IP.  Additionally, it's a single point of
> > > failure
> > > > >> for
> > > > >>>>>> that particular enrichment right now.  If MySQL is down, geo
> > > > >> enrichment
> > > > >>>>>> can't occur.
> > > > >>>>>>
> > > > >>>>>> I'm proposing that we eliminate the use of MySQL entirely,
> > through
> > > > all
> > > > >>>>>> installation paths (which, unless I missed some, includes
> > Ansible,
> > > > the
> > > > >>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping
> > all
> > > > the
> > > > >>>>>> various MySQL setup and management through the code, along
> with
> > > all
> > > > >> the
> > > > >>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who
> wants
> > > to
> > > > >>>> setup
> > > > >>>>>> their own databases for enrichments and install connectors is
> > able
> > > > to
> > > > >>>> do
> > > > >>>>> so.
> > > > >>>>>>
> > > > >>>>>> In its place, I've looked at using MapDB, which is a really
> easy
> > > to
> > > > >> use
> > > > >>>>>> library for creating Java collections backed by a file (This
> is
> > > NOT
> > > > a
> > > > >>>>>> separate installation of anything, it's just a jar that
> manages
> > > > >>>>> interaction
> > > > >>>>>> with the file system).  Given the slow churn of the GeoIP
> files
> > (I
> > > > >>>>> believe
> > > > >>>>>> they get updated once a week), we can have a script that can
> be
> > > run
> > > > >>>> when
> > > > >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file
> > that
> > > > >> will
> > > > >>>>> be
> > > > >>>>>> used by the bolts, and places it into HDFS.  Finally, we
> update
> > a
> > > > >>>> config
> > > > >>>>> to
> > > > >>>>>> point to the new file, the bolts get the updated config
> callback
> > > and
> > > > >>>> can
> > > > >>>>>> update their db files.  Inside the code, we wrap the MapDB
> > > portions
> > > > to
> > > > >>>>> make
> > > > >>>>>> it transparent to downstream code.
> > > > >>>>>>
> > > > >>>>>> The particularly nice parts about using MapDB are that its
> ease
> > of
> > > > use
> > > > >>>>> plus
> > > > >>>>>> it offers the utilities we need out of the box to be able to
> > > support
> > > > >>>> the
> > > > >>>>>> operations we need on this (Keep in mind the GeoIP files use
> IP
> > > > ranges
> > > > >>>>> and
> > > > >>>>>> we need to be able to easily grab the appropriate range).
> > > > >>>>>>
> > > > >>>>>> The main point of concern I have about this is that when we
> grab
> > > the
> > > > >>>> HDFS
> > > > >>>>>> file during an update, given that multiple JVMs can be
> running,
> > we
> > > > >>>> don't
> > > > >>>>>> want them to clobber each other. I believe this can be avoided
> > by
> > > > >>>> simply
> > > > >>>>>> using each worker's working directory to store the file (and
> > > > >>>>> appropriately
> > > > >>>>>> ensure threads on the same JVM manage multithreading).  This
> > > should
> > > > >>>> keep
> > > > >>>>>> the JVMs (and the underlying DB files) entirely independent.
> > > > >>>>>>
> > > > >>>>>> This script would get called by the various installations
> during
> > > > >>>> startup
> > > > >>>>> to
> > > > >>>>>> do the initial setup.  After install, it can then be called on
> > > > demand
> > > > >>>> in
> > > > >>>>>> order.
> > > > >>>>>>
> > > > >>>>>> At this point, we should be all set, with everything running
> and
> > > > >>>>> updatable.
> > > > >>>>>>
> > > > >>>>>> Justin
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by David Lyle <dl...@gmail.com>.

Def agree on the parity point.

I'm a little worried about Supervisor relocations for non-HBase solutions,
but having much of the work done for us by MaxMind changes my preference to
(in order)

1) MM API
2) HBase Enrichment
3) MapDB should the others prove not feasible


-D...


On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <ju...@gmail.com> wrote:

> I definitely agree on checking out the MaxMind API.  I'll take a look at
> it, but at first glance it looks like it does include everything we use.
> Great find, JJ.
>
> More details on various people's points:
>
> As a note to anyone hopping in, Simon's point on the range lookup vs a key
> lookup is why it becomes a Scan in HBase vs a Get.  As an addendum to what
> Simon mentioned, denormalizing is easy enough and turns it into an easy
> range lookup.
>
> To David's point, the MapDB does require a network hop, but it's once per
> refresh of the data (Got a relevant callback? Grab new data, load it, swap
> out) instead of (up to) once per message.  I would expect the same to be
> true of the MaxMind db files.
>
> I'd also argue MapDB not really more complex than refreshing the HBase
> table, because we potentially have to start worrying about things like
> hashing and/or indices and even just general data represtation. It's
> definitely correct that the file processing has to occur on either path, so
> it really boils down to handling the callback and reloading the file vs
> handling some of the standard HBasey things.  I don't think either is an
> enormous amount of work (and both are almost certainly more work than
> MaxMind's API)
>
> Regarding extensibility, I'd argue for parity with what we have first, then
> build what we need from there.  Does anybody have any disagreement with
> that approach for right now?
>
> Justin
>
> On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <dl...@gmail.com> wrote:
>
> > It is interesting- save us a ton of effort, and has the right license. I
> > think it's worth at least checking out.
> >
> > -D...
> >
> >
> > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> > simon@simonellistonball.com> wrote:
> >
> > > I like that approach even more. That way we would only have to worry
> > about
> > > distributing the database file in binary format to all the supervisor
> > nodes
> > > on update.
> > >
> > > It would also make it easier for people to switch to the enterprise DB
> > > potentially if they had the license.
> > >
> > > One slight issue with this might be for people who wanted to extend the
> > > database. For example, organisations may want to add geo-enrichment to
> > > their own private network addresses based modified versions of the geo
> > > database. Currently we don’t really allow this, since we hard-code
> > ignoring
> > > private network classes into the geo enrichment adapter, but I can see
> a
> > > case where a global org might want to add their own ranges and
> locations
> > to
> > > the data set. Does that make sense to anyone else?
> > >
> > > Simon
> > >
> > >
> > > > On 16 Jan 2017, at 16:50, JJ Meyer <jj...@gmail.com> wrote:
> > > >
> > > > Hello all,
> > > >
> > > > Can we leverage maxmind's Java client (
> > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> > > main/java/com/maxmind/geoip2)
> > > > in this case? I believe it can directly read maxmind file. Plus I
> think
> > > it
> > > > also has some support for caching as well.
> > > >
> > > > Thanks,
> > > > JJ
> > > >
> > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > > > simon@simonellistonball.com> wrote:
> > > >
> > > >> I like the idea of MapDB, since we can essentially pull an instance
> > into
> > > >> each supervisor, so it makes a lot of sense for relatively small
> > scale,
> > > >> relatively static enrichments in general.
> > > >>
> > > >> Generally this feels like a caching problem, and would be for a
> simple
> > > >> key-value lookup. In that case I would agree with David Lyle on
> using
> > > HBase
> > > >> as a source or truth and relying on caching.
> > > >>
> > > >> That said, GeoIP is a different lookup pattern, since it’s a range
> > > lookup
> > > >> then a key lookup (or if we denormalize the MaxMind data, just a
> range
> > > >> lookup) for that kind of thing MapDB with something like the BTree
> > > seems a
> > > >> good fit.
> > > >>
> > > >> Simon
> > > >>
> > > >>
> > > >>> On 16 Jan 2017, at 16:28, David Lyle <dl...@gmail.com> wrote:
> > > >>>
> > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it
> > as
> > > an
> > > >>> HBase enrichment. If our current caching isn't enough to mitigate
> the
> > > >> above
> > > >>> issues, we have a problem, don't we? Or do we not recommend HBase
> > > >>> enrichment for per message enrichment in general?
> > > >>>
> > > >>> Also- can you elaborate on how MapDB would not require a network
> hop?
> > > >>> Doesn't this mean we would have to sync the enrichment data to each
> > > Storm
> > > >>> supervisor? HDFS could (probably would) have a network hop too, no?
> > > >>>
> > > >>> Fwiw -
> > > >>> "In its place, I've looked at using MapDB, which is a really easy
> to
> > > use
> > > >>> library for creating Java collections backed by a file (This is
> NOT a
> > > >>> separate installation of anything, it's just a jar that manages
> > > >> interaction
> > > >>> with the file system).  Given the slow churn of the GeoIP files (I
> > > >> believe
> > > >>> they get updated once a week), we can have a script that can be run
> > > when
> > > >>> needed, downloads the MaxMind tar file, builds the MapDB file that
> > will
> > > >> be
> > > >>> used by the bolts, and places it into HDFS.  Finally, we update a
> > > config
> > > >> to
> > > >>> point to the new file, the bolts get the updated config callback
> and
> > > can
> > > >>> update their db files.  Inside the code, we wrap the MapDB portions
> > to
> > > >> make
> > > >>> it transparent to downstream code."
> > > >>>
> > > >>> Seems a bit more complex than "refresh the hbase table". Afaik,
> > either
> > > >>> approach would require some sort of translation between GeoIP
> source
> > > >> format
> > > >>> and target format, so that part is a wash imo.
> > > >>>
> > > >>> So, I'd really like to see, at least, an attempt to leverage HBase
> > > >>> enrichment.
> > > >>>
> > > >>> -D...
> > > >>>
> > > >>>
> > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <cestella@gmail.com
> >
> > > >> wrote:
> > > >>>
> > > >>>> I think that it's a sensible thing to use MapDB for the geo
> > > enrichment.
> > > >>>> Let me state my reasoning:
> > > >>>>
> > > >>>>  - An HBase implementation  would necessitate a HBase scan
> possibly
> > > >>>>  hitting HDFS, which is expensive per-message.
> > > >>>>  - An HBase implementation would necessitate a network hop and
> MapDB
> > > >>>>  would not.
> > > >>>>
> > > >>>> I also think this might be the beginning of a more general purpose
> > > >> support
> > > >>>> in Stellar for locally shipped, read-only MapDB lookups, which
> might
> > > be
> > > >>>> interesting.
> > > >>>>
> > > >>>> In short, all quotes about premature optimization are sure to
> apply
> > to
> > > >> my
> > > >>>> reasoning, but I can't help but have my spidey senses tingle when
> we
> > > >>>> introduce a scan-per-message architecture.
> > > >>>>
> > > >>>> Casey
> > > >>>>
> > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> > > >> Dima.Kovalyov@sstech.us>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hello Justin,
> > > >>>>>
> > > >>>>> Considering that Metron uses hbase tables for storing enrichment
> > and
> > > >>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
> > > >>>>> Or MapDB can be used for enrichment and threatintel feeds instead
> > of
> > > >>>> hbase?
> > > >>>>>
> > > >>>>> - Dima
> > > >>>>>
> > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> As a bit of background, right now, GeoIP data is loaded into and
> > > >>>> managed
> > > >>>>> by
> > > >>>>>> MySQL (the connectors are LGPL licensed and we need to sever our
> > > Maven
> > > >>>>>> dependency on it before next release). We currently depend on
> and
> > > >>>> install
> > > >>>>>> an instance of MySQL (in each of the Management Pack, Ansible,
> and
> > > >>>> Docker
> > > >>>>>> installs). In the topology, we use the JDBCAdapter to connect to
> > > MySQL
> > > >>>>> and
> > > >>>>>> query for a given IP.  Additionally, it's a single point of
> > failure
> > > >> for
> > > >>>>>> that particular enrichment right now.  If MySQL is down, geo
> > > >> enrichment
> > > >>>>>> can't occur.
> > > >>>>>>
> > > >>>>>> I'm proposing that we eliminate the use of MySQL entirely,
> through
> > > all
> > > >>>>>> installation paths (which, unless I missed some, includes
> Ansible,
> > > the
> > > >>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping
> all
> > > the
> > > >>>>>> various MySQL setup and management through the code, along with
> > all
> > > >> the
> > > >>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants
> > to
> > > >>>> setup
> > > >>>>>> their own databases for enrichments and install connectors is
> able
> > > to
> > > >>>> do
> > > >>>>> so.
> > > >>>>>>
> > > >>>>>> In its place, I've looked at using MapDB, which is a really easy
> > to
> > > >> use
> > > >>>>>> library for creating Java collections backed by a file (This is
> > NOT
> > > a
> > > >>>>>> separate installation of anything, it's just a jar that manages
> > > >>>>> interaction
> > > >>>>>> with the file system).  Given the slow churn of the GeoIP files
> (I
> > > >>>>> believe
> > > >>>>>> they get updated once a week), we can have a script that can be
> > run
> > > >>>> when
> > > >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file
> that
> > > >> will
> > > >>>>> be
> > > >>>>>> used by the bolts, and places it into HDFS.  Finally, we update
> a
> > > >>>> config
> > > >>>>> to
> > > >>>>>> point to the new file, the bolts get the updated config callback
> > and
> > > >>>> can
> > > >>>>>> update their db files.  Inside the code, we wrap the MapDB
> > portions
> > > to
> > > >>>>> make
> > > >>>>>> it transparent to downstream code.
> > > >>>>>>
> > > >>>>>> The particularly nice parts about using MapDB are that its ease
> of
> > > use
> > > >>>>> plus
> > > >>>>>> it offers the utilities we need out of the box to be able to
> > support
> > > >>>> the
> > > >>>>>> operations we need on this (Keep in mind the GeoIP files use IP
> > > ranges
> > > >>>>> and
> > > >>>>>> we need to be able to easily grab the appropriate range).
> > > >>>>>>
> > > >>>>>> The main point of concern I have about this is that when we grab
> > the
> > > >>>> HDFS
> > > >>>>>> file during an update, given that multiple JVMs can be running,
> we
> > > >>>> don't
> > > >>>>>> want them to clobber each other. I believe this can be avoided
> by
> > > >>>> simply
> > > >>>>>> using each worker's working directory to store the file (and
> > > >>>>> appropriately
> > > >>>>>> ensure threads on the same JVM manage multithreading).  This
> > should
> > > >>>> keep
> > > >>>>>> the JVMs (and the underlying DB files) entirely independent.
> > > >>>>>>
> > > >>>>>> This script would get called by the various installations during
> > > >>>> startup
> > > >>>>> to
> > > >>>>>> do the initial setup.  After install, it can then be called on
> > > demand
> > > >>>> in
> > > >>>>>> order.
> > > >>>>>>
> > > >>>>>> At this point, we should be all set, with everything running and
> > > >>>>> updatable.
> > > >>>>>>
> > > >>>>>> Justin
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by Justin Leet <ju...@gmail.com>.

I definitely agree on checking out the MaxMind API.  I'll take a look at
it, but at first glance it looks like it does include everything we use.
Great find, JJ.

More details on various people's points:

As a note to anyone hopping in, Simon's point on the range lookup vs a key
lookup is why it becomes a Scan in HBase vs a Get.  As an addendum to what
Simon mentioned, denormalizing is easy enough and turns it into an easy
range lookup.

To David's point, the MapDB does require a network hop, but it's once per
refresh of the data (Got a relevant callback? Grab new data, load it, swap
out) instead of (up to) once per message.  I would expect the same to be
true of the MaxMind db files.

I'd also argue MapDB not really more complex than refreshing the HBase
table, because we potentially have to start worrying about things like
hashing and/or indices and even just general data represtation. It's
definitely correct that the file processing has to occur on either path, so
it really boils down to handling the callback and reloading the file vs
handling some of the standard HBasey things.  I don't think either is an
enormous amount of work (and both are almost certainly more work than
MaxMind's API)

Regarding extensibility, I'd argue for parity with what we have first, then
build what we need from there.  Does anybody have any disagreement with
that approach for right now?

Justin

On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <dl...@gmail.com> wrote:

> It is interesting- save us a ton of effort, and has the right license. I
> think it's worth at least checking out.
>
> -D...
>
>
> On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> simon@simonellistonball.com> wrote:
>
> > I like that approach even more. That way we would only have to worry
> about
> > distributing the database file in binary format to all the supervisor
> nodes
> > on update.
> >
> > It would also make it easier for people to switch to the enterprise DB
> > potentially if they had the license.
> >
> > One slight issue with this might be for people who wanted to extend the
> > database. For example, organisations may want to add geo-enrichment to
> > their own private network addresses based modified versions of the geo
> > database. Currently we don’t really allow this, since we hard-code
> ignoring
> > private network classes into the geo enrichment adapter, but I can see a
> > case where a global org might want to add their own ranges and locations
> to
> > the data set. Does that make sense to anyone else?
> >
> > Simon
> >
> >
> > > On 16 Jan 2017, at 16:50, JJ Meyer <jj...@gmail.com> wrote:
> > >
> > > Hello all,
> > >
> > > Can we leverage maxmind's Java client (
> > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> > main/java/com/maxmind/geoip2)
> > > in this case? I believe it can directly read maxmind file. Plus I think
> > it
> > > also has some support for caching as well.
> > >
> > > Thanks,
> > > JJ
> > >
> > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > > simon@simonellistonball.com> wrote:
> > >
> > >> I like the idea of MapDB, since we can essentially pull an instance
> into
> > >> each supervisor, so it makes a lot of sense for relatively small
> scale,
> > >> relatively static enrichments in general.
> > >>
> > >> Generally this feels like a caching problem, and would be for a simple
> > >> key-value lookup. In that case I would agree with David Lyle on using
> > HBase
> > >> as a source or truth and relying on caching.
> > >>
> > >> That said, GeoIP is a different lookup pattern, since it’s a range
> > lookup
> > >> then a key lookup (or if we denormalize the MaxMind data, just a range
> > >> lookup) for that kind of thing MapDB with something like the BTree
> > seems a
> > >> good fit.
> > >>
> > >> Simon
> > >>
> > >>
> > >>> On 16 Jan 2017, at 16:28, David Lyle <dl...@gmail.com> wrote:
> > >>>
> > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it
> as
> > an
> > >>> HBase enrichment. If our current caching isn't enough to mitigate the
> > >> above
> > >>> issues, we have a problem, don't we? Or do we not recommend HBase
> > >>> enrichment for per message enrichment in general?
> > >>>
> > >>> Also- can you elaborate on how MapDB would not require a network hop?
> > >>> Doesn't this mean we would have to sync the enrichment data to each
> > Storm
> > >>> supervisor? HDFS could (probably would) have a network hop too, no?
> > >>>
> > >>> Fwiw -
> > >>> "In its place, I've looked at using MapDB, which is a really easy to
> > use
> > >>> library for creating Java collections backed by a file (This is NOT a
> > >>> separate installation of anything, it's just a jar that manages
> > >> interaction
> > >>> with the file system).  Given the slow churn of the GeoIP files (I
> > >> believe
> > >>> they get updated once a week), we can have a script that can be run
> > when
> > >>> needed, downloads the MaxMind tar file, builds the MapDB file that
> will
> > >> be
> > >>> used by the bolts, and places it into HDFS.  Finally, we update a
> > config
> > >> to
> > >>> point to the new file, the bolts get the updated config callback and
> > can
> > >>> update their db files.  Inside the code, we wrap the MapDB portions
> to
> > >> make
> > >>> it transparent to downstream code."
> > >>>
> > >>> Seems a bit more complex than "refresh the hbase table". Afaik,
> either
> > >>> approach would require some sort of translation between GeoIP source
> > >> format
> > >>> and target format, so that part is a wash imo.
> > >>>
> > >>> So, I'd really like to see, at least, an attempt to leverage HBase
> > >>> enrichment.
> > >>>
> > >>> -D...
> > >>>
> > >>>
> > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ce...@gmail.com>
> > >> wrote:
> > >>>
> > >>>> I think that it's a sensible thing to use MapDB for the geo
> > enrichment.
> > >>>> Let me state my reasoning:
> > >>>>
> > >>>>  - An HBase implementation  would necessitate a HBase scan possibly
> > >>>>  hitting HDFS, which is expensive per-message.
> > >>>>  - An HBase implementation would necessitate a network hop and MapDB
> > >>>>  would not.
> > >>>>
> > >>>> I also think this might be the beginning of a more general purpose
> > >> support
> > >>>> in Stellar for locally shipped, read-only MapDB lookups, which might
> > be
> > >>>> interesting.
> > >>>>
> > >>>> In short, all quotes about premature optimization are sure to apply
> to
> > >> my
> > >>>> reasoning, but I can't help but have my spidey senses tingle when we
> > >>>> introduce a scan-per-message architecture.
> > >>>>
> > >>>> Casey
> > >>>>
> > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> > >> Dima.Kovalyov@sstech.us>
> > >>>> wrote:
> > >>>>
> > >>>>> Hello Justin,
> > >>>>>
> > >>>>> Considering that Metron uses hbase tables for storing enrichment
> and
> > >>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
> > >>>>> Or MapDB can be used for enrichment and threatintel feeds instead
> of
> > >>>> hbase?
> > >>>>>
> > >>>>> - Dima
> > >>>>>
> > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> As a bit of background, right now, GeoIP data is loaded into and
> > >>>> managed
> > >>>>> by
> > >>>>>> MySQL (the connectors are LGPL licensed and we need to sever our
> > Maven
> > >>>>>> dependency on it before next release). We currently depend on and
> > >>>> install
> > >>>>>> an instance of MySQL (in each of the Management Pack, Ansible, and
> > >>>> Docker
> > >>>>>> installs). In the topology, we use the JDBCAdapter to connect to
> > MySQL
> > >>>>> and
> > >>>>>> query for a given IP.  Additionally, it's a single point of
> failure
> > >> for
> > >>>>>> that particular enrichment right now.  If MySQL is down, geo
> > >> enrichment
> > >>>>>> can't occur.
> > >>>>>>
> > >>>>>> I'm proposing that we eliminate the use of MySQL entirely, through
> > all
> > >>>>>> installation paths (which, unless I missed some, includes Ansible,
> > the
> > >>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping all
> > the
> > >>>>>> various MySQL setup and management through the code, along with
> all
> > >> the
> > >>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants
> to
> > >>>> setup
> > >>>>>> their own databases for enrichments and install connectors is able
> > to
> > >>>> do
> > >>>>> so.
> > >>>>>>
> > >>>>>> In its place, I've looked at using MapDB, which is a really easy
> to
> > >> use
> > >>>>>> library for creating Java collections backed by a file (This is
> NOT
> > a
> > >>>>>> separate installation of anything, it's just a jar that manages
> > >>>>> interaction
> > >>>>>> with the file system).  Given the slow churn of the GeoIP files (I
> > >>>>> believe
> > >>>>>> they get updated once a week), we can have a script that can be
> run
> > >>>> when
> > >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file that
> > >> will
> > >>>>> be
> > >>>>>> used by the bolts, and places it into HDFS.  Finally, we update a
> > >>>> config
> > >>>>> to
> > >>>>>> point to the new file, the bolts get the updated config callback
> and
> > >>>> can
> > >>>>>> update their db files.  Inside the code, we wrap the MapDB
> portions
> > to
> > >>>>> make
> > >>>>>> it transparent to downstream code.
> > >>>>>>
> > >>>>>> The particularly nice parts about using MapDB are that its ease of
> > use
> > >>>>> plus
> > >>>>>> it offers the utilities we need out of the box to be able to
> support
> > >>>> the
> > >>>>>> operations we need on this (Keep in mind the GeoIP files use IP
> > ranges
> > >>>>> and
> > >>>>>> we need to be able to easily grab the appropriate range).
> > >>>>>>
> > >>>>>> The main point of concern I have about this is that when we grab
> the
> > >>>> HDFS
> > >>>>>> file during an update, given that multiple JVMs can be running, we
> > >>>> don't
> > >>>>>> want them to clobber each other. I believe this can be avoided by
> > >>>> simply
> > >>>>>> using each worker's working directory to store the file (and
> > >>>>> appropriately
> > >>>>>> ensure threads on the same JVM manage multithreading).  This
> should
> > >>>> keep
> > >>>>>> the JVMs (and the underlying DB files) entirely independent.
> > >>>>>>
> > >>>>>> This script would get called by the various installations during
> > >>>> startup
> > >>>>> to
> > >>>>>> do the initial setup.  After install, it can then be called on
> > demand
> > >>>> in
> > >>>>>> order.
> > >>>>>>
> > >>>>>> At this point, we should be all set, with everything running and
> > >>>>> updatable.
> > >>>>>>
> > >>>>>> Justin
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by David Lyle <dl...@gmail.com>.

It is interesting- save us a ton of effort, and has the right license. I
think it's worth at least checking out.

-D...


On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
simon@simonellistonball.com> wrote:

> I like that approach even more. That way we would only have to worry about
> distributing the database file in binary format to all the supervisor nodes
> on update.
>
> It would also make it easier for people to switch to the enterprise DB
> potentially if they had the license.
>
> One slight issue with this might be for people who wanted to extend the
> database. For example, organisations may want to add geo-enrichment to
> their own private network addresses based modified versions of the geo
> database. Currently we don’t really allow this, since we hard-code ignoring
> private network classes into the geo enrichment adapter, but I can see a
> case where a global org might want to add their own ranges and locations to
> the data set. Does that make sense to anyone else?
>
> Simon
>
>
> > On 16 Jan 2017, at 16:50, JJ Meyer <jj...@gmail.com> wrote:
> >
> > Hello all,
> >
> > Can we leverage maxmind's Java client (
> > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> main/java/com/maxmind/geoip2)
> > in this case? I believe it can directly read maxmind file. Plus I think
> it
> > also has some support for caching as well.
> >
> > Thanks,
> > JJ
> >
> > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > simon@simonellistonball.com> wrote:
> >
> >> I like the idea of MapDB, since we can essentially pull an instance into
> >> each supervisor, so it makes a lot of sense for relatively small scale,
> >> relatively static enrichments in general.
> >>
> >> Generally this feels like a caching problem, and would be for a simple
> >> key-value lookup. In that case I would agree with David Lyle on using
> HBase
> >> as a source or truth and relying on caching.
> >>
> >> That said, GeoIP is a different lookup pattern, since it’s a range
> lookup
> >> then a key lookup (or if we denormalize the MaxMind data, just a range
> >> lookup) for that kind of thing MapDB with something like the BTree
> seems a
> >> good fit.
> >>
> >> Simon
> >>
> >>
> >>> On 16 Jan 2017, at 16:28, David Lyle <dl...@gmail.com> wrote:
> >>>
> >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as
> an
> >>> HBase enrichment. If our current caching isn't enough to mitigate the
> >> above
> >>> issues, we have a problem, don't we? Or do we not recommend HBase
> >>> enrichment for per message enrichment in general?
> >>>
> >>> Also- can you elaborate on how MapDB would not require a network hop?
> >>> Doesn't this mean we would have to sync the enrichment data to each
> Storm
> >>> supervisor? HDFS could (probably would) have a network hop too, no?
> >>>
> >>> Fwiw -
> >>> "In its place, I've looked at using MapDB, which is a really easy to
> use
> >>> library for creating Java collections backed by a file (This is NOT a
> >>> separate installation of anything, it's just a jar that manages
> >> interaction
> >>> with the file system).  Given the slow churn of the GeoIP files (I
> >> believe
> >>> they get updated once a week), we can have a script that can be run
> when
> >>> needed, downloads the MaxMind tar file, builds the MapDB file that will
> >> be
> >>> used by the bolts, and places it into HDFS.  Finally, we update a
> config
> >> to
> >>> point to the new file, the bolts get the updated config callback and
> can
> >>> update their db files.  Inside the code, we wrap the MapDB portions to
> >> make
> >>> it transparent to downstream code."
> >>>
> >>> Seems a bit more complex than "refresh the hbase table". Afaik, either
> >>> approach would require some sort of translation between GeoIP source
> >> format
> >>> and target format, so that part is a wash imo.
> >>>
> >>> So, I'd really like to see, at least, an attempt to leverage HBase
> >>> enrichment.
> >>>
> >>> -D...
> >>>
> >>>
> >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ce...@gmail.com>
> >> wrote:
> >>>
> >>>> I think that it's a sensible thing to use MapDB for the geo
> enrichment.
> >>>> Let me state my reasoning:
> >>>>
> >>>>  - An HBase implementation  would necessitate a HBase scan possibly
> >>>>  hitting HDFS, which is expensive per-message.
> >>>>  - An HBase implementation would necessitate a network hop and MapDB
> >>>>  would not.
> >>>>
> >>>> I also think this might be the beginning of a more general purpose
> >> support
> >>>> in Stellar for locally shipped, read-only MapDB lookups, which might
> be
> >>>> interesting.
> >>>>
> >>>> In short, all quotes about premature optimization are sure to apply to
> >> my
> >>>> reasoning, but I can't help but have my spidey senses tingle when we
> >>>> introduce a scan-per-message architecture.
> >>>>
> >>>> Casey
> >>>>
> >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> >> Dima.Kovalyov@sstech.us>
> >>>> wrote:
> >>>>
> >>>>> Hello Justin,
> >>>>>
> >>>>> Considering that Metron uses hbase tables for storing enrichment and
> >>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
> >>>>> Or MapDB can be used for enrichment and threatintel feeds instead of
> >>>> hbase?
> >>>>>
> >>>>> - Dima
> >>>>>
> >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> As a bit of background, right now, GeoIP data is loaded into and
> >>>> managed
> >>>>> by
> >>>>>> MySQL (the connectors are LGPL licensed and we need to sever our
> Maven
> >>>>>> dependency on it before next release). We currently depend on and
> >>>> install
> >>>>>> an instance of MySQL (in each of the Management Pack, Ansible, and
> >>>> Docker
> >>>>>> installs). In the topology, we use the JDBCAdapter to connect to
> MySQL
> >>>>> and
> >>>>>> query for a given IP.  Additionally, it's a single point of failure
> >> for
> >>>>>> that particular enrichment right now.  If MySQL is down, geo
> >> enrichment
> >>>>>> can't occur.
> >>>>>>
> >>>>>> I'm proposing that we eliminate the use of MySQL entirely, through
> all
> >>>>>> installation paths (which, unless I missed some, includes Ansible,
> the
> >>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping all
> the
> >>>>>> various MySQL setup and management through the code, along with all
> >> the
> >>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
> >>>> setup
> >>>>>> their own databases for enrichments and install connectors is able
> to
> >>>> do
> >>>>> so.
> >>>>>>
> >>>>>> In its place, I've looked at using MapDB, which is a really easy to
> >> use
> >>>>>> library for creating Java collections backed by a file (This is NOT
> a
> >>>>>> separate installation of anything, it's just a jar that manages
> >>>>> interaction
> >>>>>> with the file system).  Given the slow churn of the GeoIP files (I
> >>>>> believe
> >>>>>> they get updated once a week), we can have a script that can be run
> >>>> when
> >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file that
> >> will
> >>>>> be
> >>>>>> used by the bolts, and places it into HDFS.  Finally, we update a
> >>>> config
> >>>>> to
> >>>>>> point to the new file, the bolts get the updated config callback and
> >>>> can
> >>>>>> update their db files.  Inside the code, we wrap the MapDB portions
> to
> >>>>> make
> >>>>>> it transparent to downstream code.
> >>>>>>
> >>>>>> The particularly nice parts about using MapDB are that its ease of
> use
> >>>>> plus
> >>>>>> it offers the utilities we need out of the box to be able to support
> >>>> the
> >>>>>> operations we need on this (Keep in mind the GeoIP files use IP
> ranges
> >>>>> and
> >>>>>> we need to be able to easily grab the appropriate range).
> >>>>>>
> >>>>>> The main point of concern I have about this is that when we grab the
> >>>> HDFS
> >>>>>> file during an update, given that multiple JVMs can be running, we
> >>>> don't
> >>>>>> want them to clobber each other. I believe this can be avoided by
> >>>> simply
> >>>>>> using each worker's working directory to store the file (and
> >>>>> appropriately
> >>>>>> ensure threads on the same JVM manage multithreading).  This should
> >>>> keep
> >>>>>> the JVMs (and the underlying DB files) entirely independent.
> >>>>>>
> >>>>>> This script would get called by the various installations during
> >>>> startup
> >>>>> to
> >>>>>> do the initial setup.  After install, it can then be called on
> demand
> >>>> in
> >>>>>> order.
> >>>>>>
> >>>>>> At this point, we should be all set, with everything running and
> >>>>> updatable.
> >>>>>>
> >>>>>> Justin
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by Simon Elliston Ball <si...@simonellistonball.com>.

I like that approach even more. That way we would only have to worry about distributing the database file in binary format to all the supervisor nodes on update.

It would also make it easier for people to switch to the enterprise DB potentially if they had the license. 

One slight issue with this might be for people who wanted to extend the database. For example, organisations may want to add geo-enrichment to their own private network addresses based modified versions of the geo database. Currently we don’t really allow this, since we hard-code ignoring private network classes into the geo enrichment adapter, but I can see a case where a global org might want to add their own ranges and locations to the data set. Does that make sense to anyone else?

Simon


> On 16 Jan 2017, at 16:50, JJ Meyer <jj...@gmail.com> wrote:
> 
> Hello all,
> 
> Can we leverage maxmind's Java client (
> https://github.com/maxmind/GeoIP2-java/tree/master/src/main/java/com/maxmind/geoip2)
> in this case? I believe it can directly read maxmind file. Plus I think it
> also has some support for caching as well.
> 
> Thanks,
> JJ
> 
> On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> simon@simonellistonball.com> wrote:
> 
>> I like the idea of MapDB, since we can essentially pull an instance into
>> each supervisor, so it makes a lot of sense for relatively small scale,
>> relatively static enrichments in general.
>> 
>> Generally this feels like a caching problem, and would be for a simple
>> key-value lookup. In that case I would agree with David Lyle on using HBase
>> as a source or truth and relying on caching.
>> 
>> That said, GeoIP is a different lookup pattern, since it’s a range lookup
>> then a key lookup (or if we denormalize the MaxMind data, just a range
>> lookup) for that kind of thing MapDB with something like the BTree seems a
>> good fit.
>> 
>> Simon
>> 
>> 
>>> On 16 Jan 2017, at 16:28, David Lyle <dl...@gmail.com> wrote:
>>> 
>>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an
>>> HBase enrichment. If our current caching isn't enough to mitigate the
>> above
>>> issues, we have a problem, don't we? Or do we not recommend HBase
>>> enrichment for per message enrichment in general?
>>> 
>>> Also- can you elaborate on how MapDB would not require a network hop?
>>> Doesn't this mean we would have to sync the enrichment data to each Storm
>>> supervisor? HDFS could (probably would) have a network hop too, no?
>>> 
>>> Fwiw -
>>> "In its place, I've looked at using MapDB, which is a really easy to use
>>> library for creating Java collections backed by a file (This is NOT a
>>> separate installation of anything, it's just a jar that manages
>> interaction
>>> with the file system).  Given the slow churn of the GeoIP files (I
>> believe
>>> they get updated once a week), we can have a script that can be run when
>>> needed, downloads the MaxMind tar file, builds the MapDB file that will
>> be
>>> used by the bolts, and places it into HDFS.  Finally, we update a config
>> to
>>> point to the new file, the bolts get the updated config callback and can
>>> update their db files.  Inside the code, we wrap the MapDB portions to
>> make
>>> it transparent to downstream code."
>>> 
>>> Seems a bit more complex than "refresh the hbase table". Afaik, either
>>> approach would require some sort of translation between GeoIP source
>> format
>>> and target format, so that part is a wash imo.
>>> 
>>> So, I'd really like to see, at least, an attempt to leverage HBase
>>> enrichment.
>>> 
>>> -D...
>>> 
>>> 
>>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ce...@gmail.com>
>> wrote:
>>> 
>>>> I think that it's a sensible thing to use MapDB for the geo enrichment.
>>>> Let me state my reasoning:
>>>> 
>>>>  - An HBase implementation  would necessitate a HBase scan possibly
>>>>  hitting HDFS, which is expensive per-message.
>>>>  - An HBase implementation would necessitate a network hop and MapDB
>>>>  would not.
>>>> 
>>>> I also think this might be the beginning of a more general purpose
>> support
>>>> in Stellar for locally shipped, read-only MapDB lookups, which might be
>>>> interesting.
>>>> 
>>>> In short, all quotes about premature optimization are sure to apply to
>> my
>>>> reasoning, but I can't help but have my spidey senses tingle when we
>>>> introduce a scan-per-message architecture.
>>>> 
>>>> Casey
>>>> 
>>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
>> Dima.Kovalyov@sstech.us>
>>>> wrote:
>>>> 
>>>>> Hello Justin,
>>>>> 
>>>>> Considering that Metron uses hbase tables for storing enrichment and
>>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
>>>>> Or MapDB can be used for enrichment and threatintel feeds instead of
>>>> hbase?
>>>>> 
>>>>> - Dima
>>>>> 
>>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> As a bit of background, right now, GeoIP data is loaded into and
>>>> managed
>>>>> by
>>>>>> MySQL (the connectors are LGPL licensed and we need to sever our Maven
>>>>>> dependency on it before next release). We currently depend on and
>>>> install
>>>>>> an instance of MySQL (in each of the Management Pack, Ansible, and
>>>> Docker
>>>>>> installs). In the topology, we use the JDBCAdapter to connect to MySQL
>>>>> and
>>>>>> query for a given IP.  Additionally, it's a single point of failure
>> for
>>>>>> that particular enrichment right now.  If MySQL is down, geo
>> enrichment
>>>>>> can't occur.
>>>>>> 
>>>>>> I'm proposing that we eliminate the use of MySQL entirely, through all
>>>>>> installation paths (which, unless I missed some, includes Ansible, the
>>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping all the
>>>>>> various MySQL setup and management through the code, along with all
>> the
>>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
>>>> setup
>>>>>> their own databases for enrichments and install connectors is able to
>>>> do
>>>>> so.
>>>>>> 
>>>>>> In its place, I've looked at using MapDB, which is a really easy to
>> use
>>>>>> library for creating Java collections backed by a file (This is NOT a
>>>>>> separate installation of anything, it's just a jar that manages
>>>>> interaction
>>>>>> with the file system).  Given the slow churn of the GeoIP files (I
>>>>> believe
>>>>>> they get updated once a week), we can have a script that can be run
>>>> when
>>>>>> needed, downloads the MaxMind tar file, builds the MapDB file that
>> will
>>>>> be
>>>>>> used by the bolts, and places it into HDFS.  Finally, we update a
>>>> config
>>>>> to
>>>>>> point to the new file, the bolts get the updated config callback and
>>>> can
>>>>>> update their db files.  Inside the code, we wrap the MapDB portions to
>>>>> make
>>>>>> it transparent to downstream code.
>>>>>> 
>>>>>> The particularly nice parts about using MapDB are that its ease of use
>>>>> plus
>>>>>> it offers the utilities we need out of the box to be able to support
>>>> the
>>>>>> operations we need on this (Keep in mind the GeoIP files use IP ranges
>>>>> and
>>>>>> we need to be able to easily grab the appropriate range).
>>>>>> 
>>>>>> The main point of concern I have about this is that when we grab the
>>>> HDFS
>>>>>> file during an update, given that multiple JVMs can be running, we
>>>> don't
>>>>>> want them to clobber each other. I believe this can be avoided by
>>>> simply
>>>>>> using each worker's working directory to store the file (and
>>>>> appropriately
>>>>>> ensure threads on the same JVM manage multithreading).  This should
>>>> keep
>>>>>> the JVMs (and the underlying DB files) entirely independent.
>>>>>> 
>>>>>> This script would get called by the various installations during
>>>> startup
>>>>> to
>>>>>> do the initial setup.  After install, it can then be called on demand
>>>> in
>>>>>> order.
>>>>>> 
>>>>>> At this point, we should be all set, with everything running and
>>>>> updatable.
>>>>>> 
>>>>>> Justin
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by JJ Meyer <jj...@gmail.com>.

Hello all,

Can we leverage maxmind's Java client (
https://github.com/maxmind/GeoIP2-java/tree/master/src/main/java/com/maxmind/geoip2)
in this case? I believe it can directly read maxmind file. Plus I think it
also has some support for caching as well.

Thanks,
JJ

On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
simon@simonellistonball.com> wrote:

> I like the idea of MapDB, since we can essentially pull an instance into
> each supervisor, so it makes a lot of sense for relatively small scale,
> relatively static enrichments in general.
>
> Generally this feels like a caching problem, and would be for a simple
> key-value lookup. In that case I would agree with David Lyle on using HBase
> as a source or truth and relying on caching.
>
> That said, GeoIP is a different lookup pattern, since it’s a range lookup
> then a key lookup (or if we denormalize the MaxMind data, just a range
> lookup) for that kind of thing MapDB with something like the BTree seems a
> good fit.
>
> Simon
>
>
> > On 16 Jan 2017, at 16:28, David Lyle <dl...@gmail.com> wrote:
> >
> > I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an
> > HBase enrichment. If our current caching isn't enough to mitigate the
> above
> > issues, we have a problem, don't we? Or do we not recommend HBase
> > enrichment for per message enrichment in general?
> >
> > Also- can you elaborate on how MapDB would not require a network hop?
> > Doesn't this mean we would have to sync the enrichment data to each Storm
> > supervisor? HDFS could (probably would) have a network hop too, no?
> >
> > Fwiw -
> > "In its place, I've looked at using MapDB, which is a really easy to use
> > library for creating Java collections backed by a file (This is NOT a
> > separate installation of anything, it's just a jar that manages
> interaction
> > with the file system).  Given the slow churn of the GeoIP files (I
> believe
> > they get updated once a week), we can have a script that can be run when
> > needed, downloads the MaxMind tar file, builds the MapDB file that will
> be
> > used by the bolts, and places it into HDFS.  Finally, we update a config
> to
> > point to the new file, the bolts get the updated config callback and can
> > update their db files.  Inside the code, we wrap the MapDB portions to
> make
> > it transparent to downstream code."
> >
> > Seems a bit more complex than "refresh the hbase table". Afaik, either
> > approach would require some sort of translation between GeoIP source
> format
> > and target format, so that part is a wash imo.
> >
> > So, I'd really like to see, at least, an attempt to leverage HBase
> > enrichment.
> >
> > -D...
> >
> >
> > On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ce...@gmail.com>
> wrote:
> >
> >> I think that it's a sensible thing to use MapDB for the geo enrichment.
> >> Let me state my reasoning:
> >>
> >>   - An HBase implementation  would necessitate a HBase scan possibly
> >>   hitting HDFS, which is expensive per-message.
> >>   - An HBase implementation would necessitate a network hop and MapDB
> >>   would not.
> >>
> >> I also think this might be the beginning of a more general purpose
> support
> >> in Stellar for locally shipped, read-only MapDB lookups, which might be
> >> interesting.
> >>
> >> In short, all quotes about premature optimization are sure to apply to
> my
> >> reasoning, but I can't help but have my spidey senses tingle when we
> >> introduce a scan-per-message architecture.
> >>
> >> Casey
> >>
> >> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> Dima.Kovalyov@sstech.us>
> >> wrote:
> >>
> >>> Hello Justin,
> >>>
> >>> Considering that Metron uses hbase tables for storing enrichment and
> >>> threatintel feeds, can we use Hbase for geo enrichment as well?
> >>> Or MapDB can be used for enrichment and threatintel feeds instead of
> >> hbase?
> >>>
> >>> - Dima
> >>>
> >>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> >>>> Hi all,
> >>>>
> >>>> As a bit of background, right now, GeoIP data is loaded into and
> >> managed
> >>> by
> >>>> MySQL (the connectors are LGPL licensed and we need to sever our Maven
> >>>> dependency on it before next release). We currently depend on and
> >> install
> >>>> an instance of MySQL (in each of the Management Pack, Ansible, and
> >> Docker
> >>>> installs). In the topology, we use the JDBCAdapter to connect to MySQL
> >>> and
> >>>> query for a given IP.  Additionally, it's a single point of failure
> for
> >>>> that particular enrichment right now.  If MySQL is down, geo
> enrichment
> >>>> can't occur.
> >>>>
> >>>> I'm proposing that we eliminate the use of MySQL entirely, through all
> >>>> installation paths (which, unless I missed some, includes Ansible, the
> >>>> Ambari Management Pack, and Docker).  We'd do this by dropping all the
> >>>> various MySQL setup and management through the code, along with all
> the
> >>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
> >> setup
> >>>> their own databases for enrichments and install connectors is able to
> >> do
> >>> so.
> >>>>
> >>>> In its place, I've looked at using MapDB, which is a really easy to
> use
> >>>> library for creating Java collections backed by a file (This is NOT a
> >>>> separate installation of anything, it's just a jar that manages
> >>> interaction
> >>>> with the file system).  Given the slow churn of the GeoIP files (I
> >>> believe
> >>>> they get updated once a week), we can have a script that can be run
> >> when
> >>>> needed, downloads the MaxMind tar file, builds the MapDB file that
> will
> >>> be
> >>>> used by the bolts, and places it into HDFS.  Finally, we update a
> >> config
> >>> to
> >>>> point to the new file, the bolts get the updated config callback and
> >> can
> >>>> update their db files.  Inside the code, we wrap the MapDB portions to
> >>> make
> >>>> it transparent to downstream code.
> >>>>
> >>>> The particularly nice parts about using MapDB are that its ease of use
> >>> plus
> >>>> it offers the utilities we need out of the box to be able to support
> >> the
> >>>> operations we need on this (Keep in mind the GeoIP files use IP ranges
> >>> and
> >>>> we need to be able to easily grab the appropriate range).
> >>>>
> >>>> The main point of concern I have about this is that when we grab the
> >> HDFS
> >>>> file during an update, given that multiple JVMs can be running, we
> >> don't
> >>>> want them to clobber each other. I believe this can be avoided by
> >> simply
> >>>> using each worker's working directory to store the file (and
> >>> appropriately
> >>>> ensure threads on the same JVM manage multithreading).  This should
> >> keep
> >>>> the JVMs (and the underlying DB files) entirely independent.
> >>>>
> >>>> This script would get called by the various installations during
> >> startup
> >>> to
> >>>> do the initial setup.  After install, it can then be called on demand
> >> in
> >>>> order.
> >>>>
> >>>> At this point, we should be all set, with everything running and
> >>> updatable.
> >>>>
> >>>> Justin
> >>>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by Casey Stella <ce...@gmail.com>.

+1 to the point about this being a different lookup pattern.  If this were
able to be done in a multi-get, I'd be all for HBase, but I worry about
scan performance, a historical sore point for HBase architectures.

On Mon, Jan 16, 2017 at 11:32 AM, Simon Elliston Ball <
simon@simonellistonball.com> wrote:

> I like the idea of MapDB, since we can essentially pull an instance into
> each supervisor, so it makes a lot of sense for relatively small scale,
> relatively static enrichments in general.
>
> Generally this feels like a caching problem, and would be for a simple
> key-value lookup. In that case I would agree with David Lyle on using HBase
> as a source or truth and relying on caching.
>
> That said, GeoIP is a different lookup pattern, since it’s a range lookup
> then a key lookup (or if we denormalize the MaxMind data, just a range
> lookup) for that kind of thing MapDB with something like the BTree seems a
> good fit.
>
> Simon
>
>
> > On 16 Jan 2017, at 16:28, David Lyle <dl...@gmail.com> wrote:
> >
> > I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an
> > HBase enrichment. If our current caching isn't enough to mitigate the
> above
> > issues, we have a problem, don't we? Or do we not recommend HBase
> > enrichment for per message enrichment in general?
> >
> > Also- can you elaborate on how MapDB would not require a network hop?
> > Doesn't this mean we would have to sync the enrichment data to each Storm
> > supervisor? HDFS could (probably would) have a network hop too, no?
> >
> > Fwiw -
> > "In its place, I've looked at using MapDB, which is a really easy to use
> > library for creating Java collections backed by a file (This is NOT a
> > separate installation of anything, it's just a jar that manages
> interaction
> > with the file system).  Given the slow churn of the GeoIP files (I
> believe
> > they get updated once a week), we can have a script that can be run when
> > needed, downloads the MaxMind tar file, builds the MapDB file that will
> be
> > used by the bolts, and places it into HDFS.  Finally, we update a config
> to
> > point to the new file, the bolts get the updated config callback and can
> > update their db files.  Inside the code, we wrap the MapDB portions to
> make
> > it transparent to downstream code."
> >
> > Seems a bit more complex than "refresh the hbase table". Afaik, either
> > approach would require some sort of translation between GeoIP source
> format
> > and target format, so that part is a wash imo.
> >
> > So, I'd really like to see, at least, an attempt to leverage HBase
> > enrichment.
> >
> > -D...
> >
> >
> > On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ce...@gmail.com>
> wrote:
> >
> >> I think that it's a sensible thing to use MapDB for the geo enrichment.
> >> Let me state my reasoning:
> >>
> >>   - An HBase implementation  would necessitate a HBase scan possibly
> >>   hitting HDFS, which is expensive per-message.
> >>   - An HBase implementation would necessitate a network hop and MapDB
> >>   would not.
> >>
> >> I also think this might be the beginning of a more general purpose
> support
> >> in Stellar for locally shipped, read-only MapDB lookups, which might be
> >> interesting.
> >>
> >> In short, all quotes about premature optimization are sure to apply to
> my
> >> reasoning, but I can't help but have my spidey senses tingle when we
> >> introduce a scan-per-message architecture.
> >>
> >> Casey
> >>
> >> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> Dima.Kovalyov@sstech.us>
> >> wrote:
> >>
> >>> Hello Justin,
> >>>
> >>> Considering that Metron uses hbase tables for storing enrichment and
> >>> threatintel feeds, can we use Hbase for geo enrichment as well?
> >>> Or MapDB can be used for enrichment and threatintel feeds instead of
> >> hbase?
> >>>
> >>> - Dima
> >>>
> >>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> >>>> Hi all,
> >>>>
> >>>> As a bit of background, right now, GeoIP data is loaded into and
> >> managed
> >>> by
> >>>> MySQL (the connectors are LGPL licensed and we need to sever our Maven
> >>>> dependency on it before next release). We currently depend on and
> >> install
> >>>> an instance of MySQL (in each of the Management Pack, Ansible, and
> >> Docker
> >>>> installs). In the topology, we use the JDBCAdapter to connect to MySQL
> >>> and
> >>>> query for a given IP.  Additionally, it's a single point of failure
> for
> >>>> that particular enrichment right now.  If MySQL is down, geo
> enrichment
> >>>> can't occur.
> >>>>
> >>>> I'm proposing that we eliminate the use of MySQL entirely, through all
> >>>> installation paths (which, unless I missed some, includes Ansible, the
> >>>> Ambari Management Pack, and Docker).  We'd do this by dropping all the
> >>>> various MySQL setup and management through the code, along with all
> the
> >>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
> >> setup
> >>>> their own databases for enrichments and install connectors is able to
> >> do
> >>> so.
> >>>>
> >>>> In its place, I've looked at using MapDB, which is a really easy to
> use
> >>>> library for creating Java collections backed by a file (This is NOT a
> >>>> separate installation of anything, it's just a jar that manages
> >>> interaction
> >>>> with the file system).  Given the slow churn of the GeoIP files (I
> >>> believe
> >>>> they get updated once a week), we can have a script that can be run
> >> when
> >>>> needed, downloads the MaxMind tar file, builds the MapDB file that
> will
> >>> be
> >>>> used by the bolts, and places it into HDFS.  Finally, we update a
> >> config
> >>> to
> >>>> point to the new file, the bolts get the updated config callback and
> >> can
> >>>> update their db files.  Inside the code, we wrap the MapDB portions to
> >>> make
> >>>> it transparent to downstream code.
> >>>>
> >>>> The particularly nice parts about using MapDB are that its ease of use
> >>> plus
> >>>> it offers the utilities we need out of the box to be able to support
> >> the
> >>>> operations we need on this (Keep in mind the GeoIP files use IP ranges
> >>> and
> >>>> we need to be able to easily grab the appropriate range).
> >>>>
> >>>> The main point of concern I have about this is that when we grab the
> >> HDFS
> >>>> file during an update, given that multiple JVMs can be running, we
> >> don't
> >>>> want them to clobber each other. I believe this can be avoided by
> >> simply
> >>>> using each worker's working directory to store the file (and
> >>> appropriately
> >>>> ensure threads on the same JVM manage multithreading).  This should
> >> keep
> >>>> the JVMs (and the underlying DB files) entirely independent.
> >>>>
> >>>> This script would get called by the various installations during
> >> startup
> >>> to
> >>>> do the initial setup.  After install, it can then be called on demand
> >> in
> >>>> order.
> >>>>
> >>>> At this point, we should be all set, with everything running and
> >>> updatable.
> >>>>
> >>>> Justin
> >>>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by Simon Elliston Ball <si...@simonellistonball.com>.

I like the idea of MapDB, since we can essentially pull an instance into each supervisor, so it makes a lot of sense for relatively small scale, relatively static enrichments in general. 

Generally this feels like a caching problem, and would be for a simple key-value lookup. In that case I would agree with David Lyle on using HBase as a source or truth and relying on caching. 

That said, GeoIP is a different lookup pattern, since it’s a range lookup then a key lookup (or if we denormalize the MaxMind data, just a range lookup) for that kind of thing MapDB with something like the BTree seems a good fit. 

Simon


> On 16 Jan 2017, at 16:28, David Lyle <dl...@gmail.com> wrote:
> 
> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an
> HBase enrichment. If our current caching isn't enough to mitigate the above
> issues, we have a problem, don't we? Or do we not recommend HBase
> enrichment for per message enrichment in general?
> 
> Also- can you elaborate on how MapDB would not require a network hop?
> Doesn't this mean we would have to sync the enrichment data to each Storm
> supervisor? HDFS could (probably would) have a network hop too, no?
> 
> Fwiw -
> "In its place, I've looked at using MapDB, which is a really easy to use
> library for creating Java collections backed by a file (This is NOT a
> separate installation of anything, it's just a jar that manages interaction
> with the file system).  Given the slow churn of the GeoIP files (I believe
> they get updated once a week), we can have a script that can be run when
> needed, downloads the MaxMind tar file, builds the MapDB file that will be
> used by the bolts, and places it into HDFS.  Finally, we update a config to
> point to the new file, the bolts get the updated config callback and can
> update their db files.  Inside the code, we wrap the MapDB portions to make
> it transparent to downstream code."
> 
> Seems a bit more complex than "refresh the hbase table". Afaik, either
> approach would require some sort of translation between GeoIP source format
> and target format, so that part is a wash imo.
> 
> So, I'd really like to see, at least, an attempt to leverage HBase
> enrichment.
> 
> -D...
> 
> 
> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ce...@gmail.com> wrote:
> 
>> I think that it's a sensible thing to use MapDB for the geo enrichment.
>> Let me state my reasoning:
>> 
>>   - An HBase implementation  would necessitate a HBase scan possibly
>>   hitting HDFS, which is expensive per-message.
>>   - An HBase implementation would necessitate a network hop and MapDB
>>   would not.
>> 
>> I also think this might be the beginning of a more general purpose support
>> in Stellar for locally shipped, read-only MapDB lookups, which might be
>> interesting.
>> 
>> In short, all quotes about premature optimization are sure to apply to my
>> reasoning, but I can't help but have my spidey senses tingle when we
>> introduce a scan-per-message architecture.
>> 
>> Casey
>> 
>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <Di...@sstech.us>
>> wrote:
>> 
>>> Hello Justin,
>>> 
>>> Considering that Metron uses hbase tables for storing enrichment and
>>> threatintel feeds, can we use Hbase for geo enrichment as well?
>>> Or MapDB can be used for enrichment and threatintel feeds instead of
>> hbase?
>>> 
>>> - Dima
>>> 
>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
>>>> Hi all,
>>>> 
>>>> As a bit of background, right now, GeoIP data is loaded into and
>> managed
>>> by
>>>> MySQL (the connectors are LGPL licensed and we need to sever our Maven
>>>> dependency on it before next release). We currently depend on and
>> install
>>>> an instance of MySQL (in each of the Management Pack, Ansible, and
>> Docker
>>>> installs). In the topology, we use the JDBCAdapter to connect to MySQL
>>> and
>>>> query for a given IP.  Additionally, it's a single point of failure for
>>>> that particular enrichment right now.  If MySQL is down, geo enrichment
>>>> can't occur.
>>>> 
>>>> I'm proposing that we eliminate the use of MySQL entirely, through all
>>>> installation paths (which, unless I missed some, includes Ansible, the
>>>> Ambari Management Pack, and Docker).  We'd do this by dropping all the
>>>> various MySQL setup and management through the code, along with all the
>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
>> setup
>>>> their own databases for enrichments and install connectors is able to
>> do
>>> so.
>>>> 
>>>> In its place, I've looked at using MapDB, which is a really easy to use
>>>> library for creating Java collections backed by a file (This is NOT a
>>>> separate installation of anything, it's just a jar that manages
>>> interaction
>>>> with the file system).  Given the slow churn of the GeoIP files (I
>>> believe
>>>> they get updated once a week), we can have a script that can be run
>> when
>>>> needed, downloads the MaxMind tar file, builds the MapDB file that will
>>> be
>>>> used by the bolts, and places it into HDFS.  Finally, we update a
>> config
>>> to
>>>> point to the new file, the bolts get the updated config callback and
>> can
>>>> update their db files.  Inside the code, we wrap the MapDB portions to
>>> make
>>>> it transparent to downstream code.
>>>> 
>>>> The particularly nice parts about using MapDB are that its ease of use
>>> plus
>>>> it offers the utilities we need out of the box to be able to support
>> the
>>>> operations we need on this (Keep in mind the GeoIP files use IP ranges
>>> and
>>>> we need to be able to easily grab the appropriate range).
>>>> 
>>>> The main point of concern I have about this is that when we grab the
>> HDFS
>>>> file during an update, given that multiple JVMs can be running, we
>> don't
>>>> want them to clobber each other. I believe this can be avoided by
>> simply
>>>> using each worker's working directory to store the file (and
>>> appropriately
>>>> ensure threads on the same JVM manage multithreading).  This should
>> keep
>>>> the JVMs (and the underlying DB files) entirely independent.
>>>> 
>>>> This script would get called by the various installations during
>> startup
>>> to
>>>> do the initial setup.  After install, it can then be called on demand
>> in
>>>> order.
>>>> 
>>>> At this point, we should be all set, with everything running and
>>> updatable.
>>>> 
>>>> Justin
>>>> 
>>> 
>>> 
>>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by David Lyle <dl...@gmail.com>.

I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an
HBase enrichment. If our current caching isn't enough to mitigate the above
issues, we have a problem, don't we? Or do we not recommend HBase
enrichment for per message enrichment in general?

Also- can you elaborate on how MapDB would not require a network hop?
Doesn't this mean we would have to sync the enrichment data to each Storm
supervisor? HDFS could (probably would) have a network hop too, no?

Fwiw -
"In its place, I've looked at using MapDB, which is a really easy to use
library for creating Java collections backed by a file (This is NOT a
separate installation of anything, it's just a jar that manages interaction
with the file system).  Given the slow churn of the GeoIP files (I believe
they get updated once a week), we can have a script that can be run when
needed, downloads the MaxMind tar file, builds the MapDB file that will be
used by the bolts, and places it into HDFS.  Finally, we update a config to
point to the new file, the bolts get the updated config callback and can
update their db files.  Inside the code, we wrap the MapDB portions to make
it transparent to downstream code."

Seems a bit more complex than "refresh the hbase table". Afaik, either
approach would require some sort of translation between GeoIP source format
and target format, so that part is a wash imo.

So, I'd really like to see, at least, an attempt to leverage HBase
enrichment.

-D...


On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ce...@gmail.com> wrote:

> I think that it's a sensible thing to use MapDB for the geo enrichment.
> Let me state my reasoning:
>
>    - An HBase implementation  would necessitate a HBase scan possibly
>    hitting HDFS, which is expensive per-message.
>    - An HBase implementation would necessitate a network hop and MapDB
>    would not.
>
> I also think this might be the beginning of a more general purpose support
> in Stellar for locally shipped, read-only MapDB lookups, which might be
> interesting.
>
> In short, all quotes about premature optimization are sure to apply to my
> reasoning, but I can't help but have my spidey senses tingle when we
> introduce a scan-per-message architecture.
>
> Casey
>
> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <Di...@sstech.us>
> wrote:
>
> > Hello Justin,
> >
> > Considering that Metron uses hbase tables for storing enrichment and
> > threatintel feeds, can we use Hbase for geo enrichment as well?
> > Or MapDB can be used for enrichment and threatintel feeds instead of
> hbase?
> >
> > - Dima
> >
> > On 01/16/2017 04:17 PM, Justin Leet wrote:
> > > Hi all,
> > >
> > > As a bit of background, right now, GeoIP data is loaded into and
> managed
> > by
> > > MySQL (the connectors are LGPL licensed and we need to sever our Maven
> > > dependency on it before next release). We currently depend on and
> install
> > > an instance of MySQL (in each of the Management Pack, Ansible, and
> Docker
> > > installs). In the topology, we use the JDBCAdapter to connect to MySQL
> > and
> > > query for a given IP.  Additionally, it's a single point of failure for
> > > that particular enrichment right now.  If MySQL is down, geo enrichment
> > > can't occur.
> > >
> > > I'm proposing that we eliminate the use of MySQL entirely, through all
> > > installation paths (which, unless I missed some, includes Ansible, the
> > > Ambari Management Pack, and Docker).  We'd do this by dropping all the
> > > various MySQL setup and management through the code, along with all the
> > > DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
> setup
> > > their own databases for enrichments and install connectors is able to
> do
> > so.
> > >
> > > In its place, I've looked at using MapDB, which is a really easy to use
> > > library for creating Java collections backed by a file (This is NOT a
> > > separate installation of anything, it's just a jar that manages
> > interaction
> > > with the file system).  Given the slow churn of the GeoIP files (I
> > believe
> > > they get updated once a week), we can have a script that can be run
> when
> > > needed, downloads the MaxMind tar file, builds the MapDB file that will
> > be
> > > used by the bolts, and places it into HDFS.  Finally, we update a
> config
> > to
> > > point to the new file, the bolts get the updated config callback and
> can
> > > update their db files.  Inside the code, we wrap the MapDB portions to
> > make
> > > it transparent to downstream code.
> > >
> > > The particularly nice parts about using MapDB are that its ease of use
> > plus
> > > it offers the utilities we need out of the box to be able to support
> the
> > > operations we need on this (Keep in mind the GeoIP files use IP ranges
> > and
> > > we need to be able to easily grab the appropriate range).
> > >
> > > The main point of concern I have about this is that when we grab the
> HDFS
> > > file during an update, given that multiple JVMs can be running, we
> don't
> > > want them to clobber each other. I believe this can be avoided by
> simply
> > > using each worker's working directory to store the file (and
> > appropriately
> > > ensure threads on the same JVM manage multithreading).  This should
> keep
> > > the JVMs (and the underlying DB files) entirely independent.
> > >
> > > This script would get called by the various installations during
> startup
> > to
> > > do the initial setup.  After install, it can then be called on demand
> in
> > > order.
> > >
> > > At this point, we should be all set, with everything running and
> > updatable.
> > >
> > > Justin
> > >
> >
> >
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by Casey Stella <ce...@gmail.com>.

I think that it's a sensible thing to use MapDB for the geo enrichment.
Let me state my reasoning:

   - An HBase implementation  would necessitate a HBase scan possibly
   hitting HDFS, which is expensive per-message.
   - An HBase implementation would necessitate a network hop and MapDB
   would not.

I also think this might be the beginning of a more general purpose support
in Stellar for locally shipped, read-only MapDB lookups, which might be
interesting.

In short, all quotes about premature optimization are sure to apply to my
reasoning, but I can't help but have my spidey senses tingle when we
introduce a scan-per-message architecture.

Casey

On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <Di...@sstech.us>
wrote:

> Hello Justin,
>
> Considering that Metron uses hbase tables for storing enrichment and
> threatintel feeds, can we use Hbase for geo enrichment as well?
> Or MapDB can be used for enrichment and threatintel feeds instead of hbase?
>
> - Dima
>
> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > Hi all,
> >
> > As a bit of background, right now, GeoIP data is loaded into and managed
> by
> > MySQL (the connectors are LGPL licensed and we need to sever our Maven
> > dependency on it before next release). We currently depend on and install
> > an instance of MySQL (in each of the Management Pack, Ansible, and Docker
> > installs). In the topology, we use the JDBCAdapter to connect to MySQL
> and
> > query for a given IP.  Additionally, it's a single point of failure for
> > that particular enrichment right now.  If MySQL is down, geo enrichment
> > can't occur.
> >
> > I'm proposing that we eliminate the use of MySQL entirely, through all
> > installation paths (which, unless I missed some, includes Ansible, the
> > Ambari Management Pack, and Docker).  We'd do this by dropping all the
> > various MySQL setup and management through the code, along with all the
> > DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to setup
> > their own databases for enrichments and install connectors is able to do
> so.
> >
> > In its place, I've looked at using MapDB, which is a really easy to use
> > library for creating Java collections backed by a file (This is NOT a
> > separate installation of anything, it's just a jar that manages
> interaction
> > with the file system).  Given the slow churn of the GeoIP files (I
> believe
> > they get updated once a week), we can have a script that can be run when
> > needed, downloads the MaxMind tar file, builds the MapDB file that will
> be
> > used by the bolts, and places it into HDFS.  Finally, we update a config
> to
> > point to the new file, the bolts get the updated config callback and can
> > update their db files.  Inside the code, we wrap the MapDB portions to
> make
> > it transparent to downstream code.
> >
> > The particularly nice parts about using MapDB are that its ease of use
> plus
> > it offers the utilities we need out of the box to be able to support the
> > operations we need on this (Keep in mind the GeoIP files use IP ranges
> and
> > we need to be able to easily grab the appropriate range).
> >
> > The main point of concern I have about this is that when we grab the HDFS
> > file during an update, given that multiple JVMs can be running, we don't
> > want them to clobber each other. I believe this can be avoided by simply
> > using each worker's working directory to store the file (and
> appropriately
> > ensure threads on the same JVM manage multithreading).  This should keep
> > the JVMs (and the underlying DB files) entirely independent.
> >
> > This script would get called by the various installations during startup
> to
> > do the initial setup.  After install, it can then be called on demand in
> > order.
> >
> > At this point, we should be all set, with everything running and
> updatable.
> >
> > Justin
> >
>
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Posted by Dima Kovalyov <Di...@sstech.us>.

Hello Justin,

Considering that Metron uses hbase tables for storing enrichment and
threatintel feeds, can we use Hbase for geo enrichment as well?
Or MapDB can be used for enrichment and threatintel feeds instead of hbase?

- Dima

On 01/16/2017 04:17 PM, Justin Leet wrote:
> Hi all,
>
> As a bit of background, right now, GeoIP data is loaded into and managed by
> MySQL (the connectors are LGPL licensed and we need to sever our Maven
> dependency on it before next release). We currently depend on and install
> an instance of MySQL (in each of the Management Pack, Ansible, and Docker
> installs). In the topology, we use the JDBCAdapter to connect to MySQL and
> query for a given IP.  Additionally, it's a single point of failure for
> that particular enrichment right now.  If MySQL is down, geo enrichment
> can't occur.
>
> I'm proposing that we eliminate the use of MySQL entirely, through all
> installation paths (which, unless I missed some, includes Ansible, the
> Ambari Management Pack, and Docker).  We'd do this by dropping all the
> various MySQL setup and management through the code, along with all the
> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to setup
> their own databases for enrichments and install connectors is able to do so.
>
> In its place, I've looked at using MapDB, which is a really easy to use
> library for creating Java collections backed by a file (This is NOT a
> separate installation of anything, it's just a jar that manages interaction
> with the file system).  Given the slow churn of the GeoIP files (I believe
> they get updated once a week), we can have a script that can be run when
> needed, downloads the MaxMind tar file, builds the MapDB file that will be
> used by the bolts, and places it into HDFS.  Finally, we update a config to
> point to the new file, the bolts get the updated config callback and can
> update their db files.  Inside the code, we wrap the MapDB portions to make
> it transparent to downstream code.
>
> The particularly nice parts about using MapDB are that its ease of use plus
> it offers the utilities we need out of the box to be able to support the
> operations we need on this (Keep in mind the GeoIP files use IP ranges and
> we need to be able to easily grab the appropriate range).
>
> The main point of concern I have about this is that when we grab the HDFS
> file during an update, given that multiple JVMs can be running, we don't
> want them to clobber each other. I believe this can be avoided by simply
> using each worker's working directory to store the file (and appropriately
> ensure threads on the same JVM manage multithreading).  This should keep
> the JVMs (and the underlying DB files) entirely independent.
>
> This script would get called by the various installations during startup to
> do the initial setup.  After install, it can then be called on demand in
> order.
>
> At this point, we should be all set, with everything running and updatable.
>
> Justin
>