You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by Bryan Taylor <bt...@rackspace.com> on 2015/12/10 02:47:35 UTC

External Data Refresh for GeoLocation

Hi Taylor,

Good point. This is probably a better title now.

The problem geolocation runs into can be summarized as: "I have bolt that
depends on external data that can fit in memory and that changes in batch.
How can I update for a new batch without blocking the stream?"



On 12/9/15 7:31 PM, "P. Taylor Goetz" <pt...@gmail.com> wrote:

>Just a suggestion, but this thread has diverged from the original subject
>line. Which is fine and happens all the time.
>
>When it does happen, one thing you can/should do is change the subject
>line (e.g "FooBar (was re: Hello"). That makes it easier for subscribers
>to notice the discussion branch, and also makes it easy to follow when
>viewing the email archives.
>
>Just $.02 from a mentor. :)
>
>-Taylor
>
>> On Dec 9, 2015, at 8:14 PM, Bryan Taylor <bt...@rackspace.com> wrote:
>> 
>> Hi James,
>> 
>> It seems the solution to "stop the world while we rebuild the cache" is
>>to
>> rebuild the new cache out of band in a separate thread/spout and then
>> inject a reference to it once it is complete.
>> 
>> Actually, I'm curious how updating the data works now with MySQL. The
>> "Setting Up GeoLite Data" page describes the initial load using a LOAD
>> DATA INFILE command in the MySQL shell, which would duplicate records if
>> it was run a second time.
>> 
>> I suppose you could delete and load all the new rows in one big atomic
>> transaction, but there could be cute issues if you commit as you go
>>during
>> the table data rebuild, since queries that happen along the way would be
>> hitting a blend of the two datasets. I've used partition swapping for
>>this
>> with some databases, but I'm not sure if MySQL supports that feature. A
>> similar idea is to load the new data into a completely new table, but
>> expose it via a view and then recompile the view when the new table is
>> done. This has to get a lock on the view, which will stop the world
>>while
>> the view definition changes, but that's a pretty short duration.
>> 
>> Bryan
>> 
>> 
>> 
>>> On 12/9/15 6:20 PM, "James Sirota" <js...@hortonworks.com> wrote:
>>> 
>>> So the nature of the problem was that as we were processing ~1.3
>>>million
>>> of messages per second the time it took for the in-memory DB to update
>>> caused the Storm tuples to back up to a point where this would bring
>>>down
>>> the topology.  We also had problems during initialization.  I don’t
>>>know
>>> if this feature exists now, but at the time we couldn’t figure out a
>>>way
>>> to have the topology deploy and wait for all the instances of Geo bolt
>>>to
>>> finish reading their data and signal back that they were ready.  So at
>>> initialization they would get blasted with tuples and fall over.  We
>>> solved that problem at the time by delaying our ingest 30 seconds to
>>>give
>>> the topology a chance to fully come up.  But eventually we decided we
>>> needed to simplify things so we abandoned the in-memory route.
>>> 
>>> Thanks,
>>> James    
>>> 
>>> 
>>> 
>>>> On 12/9/15, 5:35 PM, "Bryan Taylor" <bt...@rackspace.com> wrote:
>>>> 
>>>> 
>>>> The GeoLite site says they update once a month, so I assume something
>>>>can
>>>> check for this and grab the new file. It seems like a fun problem to
>>>>have
>>>> this also trigger a rebuild of the in-memory cache and swap it out
>>>>live.
>>>> This seems like it would be a useful streaming enrichment pattern,
>>>>where
>>>> the configuration data for the enrichment changes.
>>>> 
>>>> This does raise another interesting question about what do we expect
>>>>the
>>>> memory profile of the steam processing to be. 70Mb + 40Mb isn't that
>>>>big
>>>> by itself, but when is it worth it? and how do operators take
>>>>advantage
>>>> of
>>>> more system memory if they have it.
>>>> 
>>>> 
>>>>> On 12/9/15 4:32 PM, "James Sirota" <js...@hortonworks.com> wrote:
>>>>> 
>>>>> Hi Bryan,
>>>>> 
>>>>> We had HSQLDB at one point, but we were struggling to make these
>>>>>bolts
>>>>> reliable.  Also, the geo data needs to be periodically updated and
>>>>>it¹s
>>>>> easier to do when it¹s decoupled.
>>>>> 
>>>>> Thanks,
>>>>> James
>>>>> 
>>>>> 
>>>>> 
>>>>>>


Re: External Data Refresh for GeoLocation

Posted by "P. Taylor Goetz" <pt...@gmail.com>.
This might be a good case for the distributed cache/blob store feature coming in Storm 1.0 (hopefully released this month).

It lets you update topology resources without restarting the topology.

I'll try to dig up more info. When I'm not on a phone.

-Taylor

> On Dec 9, 2015, at 8:47 PM, Bryan Taylor <bt...@rackspace.com> wrote:
> 
> Hi Taylor,
> 
> Good point. This is probably a better title now.
> 
> The problem geolocation runs into can be summarized as: "I have bolt that
> depends on external data that can fit in memory and that changes in batch.
> How can I update for a new batch without blocking the stream?"
> 
> 
> 
>> On 12/9/15 7:31 PM, "P. Taylor Goetz" <pt...@gmail.com> wrote:
>> 
>> Just a suggestion, but this thread has diverged from the original subject
>> line. Which is fine and happens all the time.
>> 
>> When it does happen, one thing you can/should do is change the subject
>> line (e.g "FooBar (was re: Hello"). That makes it easier for subscribers
>> to notice the discussion branch, and also makes it easy to follow when
>> viewing the email archives.
>> 
>> Just $.02 from a mentor. :)
>> 
>> -Taylor
>> 
>>> On Dec 9, 2015, at 8:14 PM, Bryan Taylor <bt...@rackspace.com> wrote:
>>> 
>>> Hi James,
>>> 
>>> It seems the solution to "stop the world while we rebuild the cache" is
>>> to
>>> rebuild the new cache out of band in a separate thread/spout and then
>>> inject a reference to it once it is complete.
>>> 
>>> Actually, I'm curious how updating the data works now with MySQL. The
>>> "Setting Up GeoLite Data" page describes the initial load using a LOAD
>>> DATA INFILE command in the MySQL shell, which would duplicate records if
>>> it was run a second time.
>>> 
>>> I suppose you could delete and load all the new rows in one big atomic
>>> transaction, but there could be cute issues if you commit as you go
>>> during
>>> the table data rebuild, since queries that happen along the way would be
>>> hitting a blend of the two datasets. I've used partition swapping for
>>> this
>>> with some databases, but I'm not sure if MySQL supports that feature. A
>>> similar idea is to load the new data into a completely new table, but
>>> expose it via a view and then recompile the view when the new table is
>>> done. This has to get a lock on the view, which will stop the world
>>> while
>>> the view definition changes, but that's a pretty short duration.
>>> 
>>> Bryan
>>> 
>>> 
>>> 
>>>> On 12/9/15 6:20 PM, "James Sirota" <js...@hortonworks.com> wrote:
>>>> 
>>>> So the nature of the problem was that as we were processing ~1.3
>>>> million
>>>> of messages per second the time it took for the in-memory DB to update
>>>> caused the Storm tuples to back up to a point where this would bring
>>>> down
>>>> the topology.  We also had problems during initialization.  I don’t
>>>> know
>>>> if this feature exists now, but at the time we couldn’t figure out a
>>>> way
>>>> to have the topology deploy and wait for all the instances of Geo bolt
>>>> to
>>>> finish reading their data and signal back that they were ready.  So at
>>>> initialization they would get blasted with tuples and fall over.  We
>>>> solved that problem at the time by delaying our ingest 30 seconds to
>>>> give
>>>> the topology a chance to fully come up.  But eventually we decided we
>>>> needed to simplify things so we abandoned the in-memory route.
>>>> 
>>>> Thanks,
>>>> James    
>>>> 
>>>> 
>>>> 
>>>>> On 12/9/15, 5:35 PM, "Bryan Taylor" <bt...@rackspace.com> wrote:
>>>>> 
>>>>> 
>>>>> The GeoLite site says they update once a month, so I assume something
>>>>> can
>>>>> check for this and grab the new file. It seems like a fun problem to
>>>>> have
>>>>> this also trigger a rebuild of the in-memory cache and swap it out
>>>>> live.
>>>>> This seems like it would be a useful streaming enrichment pattern,
>>>>> where
>>>>> the configuration data for the enrichment changes.
>>>>> 
>>>>> This does raise another interesting question about what do we expect
>>>>> the
>>>>> memory profile of the steam processing to be. 70Mb + 40Mb isn't that
>>>>> big
>>>>> by itself, but when is it worth it? and how do operators take
>>>>> advantage
>>>>> of
>>>>> more system memory if they have it.
>>>>> 
>>>>> 
>>>>>> On 12/9/15 4:32 PM, "James Sirota" <js...@hortonworks.com> wrote:
>>>>>> 
>>>>>> Hi Bryan,
>>>>>> 
>>>>>> We had HSQLDB at one point, but we were struggling to make these
>>>>>> bolts
>>>>>> reliable.  Also, the geo data needs to be periodically updated and
>>>>>> it¹s
>>>>>> easier to do when it¹s decoupled.
>>>>>> 
>>>>>> Thanks,
>>>>>> James
>>>>>> 
>>>>>> 
>>>>>> 
>