You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Jared winick <ja...@gmail.com> on 2012/04/24 15:35:31 UTC

Trendulo - A Twitter Analytics Demo on Accumulo

I gave an Introduction to Apache
Accumulo<http://www.slideshare.net/jaredwinick/introduction-to-apache-accumulo>
presentation
last month at the Boulder/Denver
Meetup<http://www.meetup.com/Boulder-Denver-Big-Data/events/55277392/>
where
I demoed an application that used Accumulo to provide real-time and
historical access to words/phrases seen in Twitter messages as well as
daily trend analysis. I finally got the demo polished up a bit and running
on Amazon EC2 where it can be found at http://trendulo.com.

Trendulo is still pretty Alpha at this point so please feel free to add to
the existing documented issues at
https://github.com/jaredwinick/trendulo where
you can also obviously find the source.

As an example, the following link will show the launch of Instagram's
Android client, followed by Facebook's purchase and then a small increase
in general "chatter" about the product http://goo.gl/XcCG8

Let me know if anyone has any questions or comments.  Feel free to tweet
@trendulo any interesting searches and I can retweet them out.

Jared

Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Keith Turner <ke...@deenlo.com>.
Jared

Thats awesome!

What happened on Mar 19 and 20?

Keith

On Tue, Apr 24, 2012 at 9:35 AM, Jared winick <ja...@gmail.com> wrote:
> I gave an Introduction to Apache Accumulo presentation last month at
> the Boulder/Denver Meetup where I demoed an application that used Accumulo
> to provide real-time and historical access to words/phrases seen in Twitter
> messages as well as daily trend analysis. I finally got the demo polished up
> a bit and running on Amazon EC2 where it can be found
> at http://trendulo.com.
>
> Trendulo is still pretty Alpha at this point so please feel free to add to
> the existing documented issues at
> https://github.com/jaredwinick/trendulo where you can also obviously find
> the source.
>
> As an example, the following link will show the launch of Instagram's
> Android client, followed by Facebook's purchase and then a small increase in
> general "chatter" about the product http://goo.gl/XcCG8
>
> Let me know if anyone has any questions or comments.  Feel free to tweet
> @trendulo any interesting searches and I can retweet them out.
>
> Jared
>
>

Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Jason Trost <ja...@gmail.com>.
This is awesome Jared.  Thanks for sharing.

On Tue, Apr 24, 2012 at 9:35 AM, Jared winick <ja...@gmail.com> wrote:

> I gave an Introduction to Apache Accumulo<http://www.slideshare.net/jaredwinick/introduction-to-apache-accumulo> presentation
> last month at the Boulder/Denver Meetup<http://www.meetup.com/Boulder-Denver-Big-Data/events/55277392/> where
> I demoed an application that used Accumulo to provide real-time and
> historical access to words/phrases seen in Twitter messages as well as
> daily trend analysis. I finally got the demo polished up a bit and running
> on Amazon EC2 where it can be found at http://trendulo.com.
>
> Trendulo is still pretty Alpha at this point so please feel free to add to
> the existing documented issues at  https://github.com/jaredwinick/trendulo where
> you can also obviously find the source.
>
> As an example, the following link will show the launch of Instagram's
> Android client, followed by Facebook's purchase and then a small increase
> in general "chatter" about the product http://goo.gl/XcCG8
>
> Let me know if anyone has any questions or comments.  Feel free to tweet
> @trendulo any interesting searches and I can retweet them out.
>
> Jared
>
>
>

RE: EXTERNAL: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by "Cardon, Tejay E" <te...@lmco.com>.
Very nice.  This will be fun to play with

From: Jared winick [mailto:jaredwinick@gmail.com]
Sent: Tuesday, April 24, 2012 7:36 AM
To: user@accumulo.apache.org
Subject: EXTERNAL: Trendulo - A Twitter Analytics Demo on Accumulo

I gave an Introduction to Apache Accumulo<http://www.slideshare.net/jaredwinick/introduction-to-apache-accumulo> presentation last month at the Boulder/Denver Meetup<http://www.meetup.com/Boulder-Denver-Big-Data/events/55277392/> where I demoed an application that used Accumulo to provide real-time and historical access to words/phrases seen in Twitter messages as well as daily trend analysis. I finally got the demo polished up a bit and running on Amazon EC2 where it can be found at http://trendulo.com.

Trendulo is still pretty Alpha at this point so please feel free to add to the existing documented issues at  https://github.com/jaredwinick/trendulo where you can also obviously find the source.

As an example, the following link will show the launch of Instagram's Android client, followed by Facebook's purchase and then a small increase in general "chatter" about the product http://goo.gl/XcCG8

Let me know if anyone has any questions or comments.  Feel free to tweet @trendulo any interesting searches and I can retweet them out.

Jared



Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Eric Newton <er...@gmail.com>.
Aw, man, I'm not going to get anything done today!  This is fun!

-Eric

On Tue, Apr 24, 2012 at 9:35 AM, Jared winick <ja...@gmail.com> wrote:

> I gave an Introduction to Apache Accumulo<http://www.slideshare.net/jaredwinick/introduction-to-apache-accumulo> presentation
> last month at the Boulder/Denver Meetup<http://www.meetup.com/Boulder-Denver-Big-Data/events/55277392/> where
> I demoed an application that used Accumulo to provide real-time and
> historical access to words/phrases seen in Twitter messages as well as
> daily trend analysis. I finally got the demo polished up a bit and running
> on Amazon EC2 where it can be found at http://trendulo.com.
>
> Trendulo is still pretty Alpha at this point so please feel free to add to
> the existing documented issues at  https://github.com/jaredwinick/trendulo where
> you can also obviously find the source.
>
> As an example, the following link will show the launch of Instagram's
> Android client, followed by Facebook's purchase and then a small increase
> in general "chatter" about the product http://goo.gl/XcCG8
>
> Let me know if anyone has any questions or comments.  Feel free to tweet
> @trendulo any interesting searches and I can retweet them out.
>
> Jared
>
>
>

Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Keith Turner <ke...@deenlo.com>.
Jared,

Searching for the word school is neat, you can clearly see the weekends.

The domain name is cool.

Keith

On Tue, Apr 24, 2012 at 9:35 AM, Jared winick <ja...@gmail.com> wrote:
> I gave an Introduction to Apache Accumulo presentation last month at
> the Boulder/Denver Meetup where I demoed an application that used Accumulo
> to provide real-time and historical access to words/phrases seen in Twitter
> messages as well as daily trend analysis. I finally got the demo polished up
> a bit and running on Amazon EC2 where it can be found
> at http://trendulo.com.
>
> Trendulo is still pretty Alpha at this point so please feel free to add to
> the existing documented issues at
> https://github.com/jaredwinick/trendulo where you can also obviously find
> the source.
>
> As an example, the following link will show the launch of Instagram's
> Android client, followed by Facebook's purchase and then a small increase in
> general "chatter" about the product http://goo.gl/XcCG8
>
> Let me know if anyone has any questions or comments.  Feel free to tweet
> @trendulo any interesting searches and I can retweet them out.
>
> Jared
>
>

Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Aaron Cordova <aa...@cordovas.org>.
On Apr 25, 2012, at 3:10 PM, Jared winick wrote:

> Aaron, I am using EBS now and I haven't seen any problems, that said my load is obviously not extreme.  When I initially moved things from my home workstation to EC2, I had a few months of tweets to ingest. For that initial ingest I did run with local instance storage as I saw extremely variable performance when I first tried EBS. The instance storage was better, though not as good as what I see on bare metal. 

Thanks for the info. I get the sense that you can scale up a single server more easily using EBS since you can attach like 10 volumes and RAID them up together. More vols might mean less variability too depending on how you configure RAID.

> Jared


Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Jared winick <ja...@gmail.com>.
Here is an up-to-date estimate. I naively reported disk usage as the "Disk
Used" field under the Accumulo Master section of the monitor. Currently it
appears I am only actually using ~26 GB of storage for my Accumulo tables.
This is based on the "% Used" * "Unreplicated Capacity" fields in the
NameNode section of the monitor which is also corroborated by looking the
the file system usage for the HDFS data directories. I have no other data
in HDFS.

Dec 24 - Apr 30 = 128 days
3.0 billion entries / 128 days = 23.4 million entries/day
23.4 million entries/day / 1.2 million tweets/day  ~ 20 entries/tweet  (not
sure if I misrepresented the number of tweets per day as 3 million before,
but it is about 1.2)

26GB / ( 128 * 1.2e6 ) ~ 182 bytes/tweet

I am using the VARLEN encoding for the SummingCombiner which probably helps
save a lot of space as I would imagine there are a lot of entries with a
very small count as the language used on Twitter is far from normal.

On Fri, Apr 27, 2012 at 1:09 PM, Eric Newton <er...@gmail.com> wrote:

>
> On Wed, Apr 25, 2012 at 3:10 PM, Jared winick <ja...@gmail.com>wrote:
>
>> I am not exactly sure how to answer the question about storage size per
>> tweet as I am not actually storing the original tweet and if a counter
>> already exists for an n-gram/time period, then incrementing that counter
>> doesn't increase the storage size. I can follow up with the current storage
>> I am using though.
>>
>
> I see I can make some estimates based on the information in your talk. The
> slides are awesome, btw.
>
> Using the information you provided: Dec 24 - March 12... that's 88 days.
>  2.6e9 entries, 3 million-ish tweets per day:
>
> 2.6e9 / (3e6 * 88)
>
> ~10 entries per tweet.
>
> Also, you report disk usage of 72G,  which I will interpret as 72 * (1024
> ** 3) bytes.
>
> So, each tweet, on average occupies: 72G / (88 * 3e6) Or, ~300 bytes.
>
> -Eric
>

Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Eric Newton <er...@gmail.com>.
On Wed, Apr 25, 2012 at 3:10 PM, Jared winick <ja...@gmail.com> wrote:

> I am not exactly sure how to answer the question about storage size per
> tweet as I am not actually storing the original tweet and if a counter
> already exists for an n-gram/time period, then incrementing that counter
> doesn't increase the storage size. I can follow up with the current storage
> I am using though.
>

I see I can make some estimates based on the information in your talk. The
slides are awesome, btw.

Using the information you provided: Dec 24 - March 12... that's 88 days.
 2.6e9 entries, 3 million-ish tweets per day:

2.6e9 / (3e6 * 88)

~10 entries per tweet.

Also, you report disk usage of 72G,  which I will interpret as 72 * (1024
** 3) bytes.

So, each tweet, on average occupies: 72G / (88 * 3e6) Or, ~300 bytes.

-Eric

Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Jared winick <ja...@gmail.com>.
So it is pretty brute force at ingest time to enable queries to be fast and
efficient. For each tweet it builds all 1,2, and 3-grams from the message
in the tweet. So an example message of:

"i can has cheezburger"

would be translated into the following n-grams

"i", "can", "has", "cheezburger", "i can", "can has", "has cheezburger", "i
can has", "can has cheezburger"

then for each n-gram, it keeps a daily and hourly counter using a
SummingCombiner. The data model looks like:

rowId: n-gram
cf: DAY or HOUR
cq: date value (ex. 20120425)
value: counter

so a single tweet turns into many key-values for each n-gram/time period. I
would have to verify but on average I think it works out to about 1 tweet
to 60 key-values. I end up seeing from a few hundred entries/sec inserted
in the middle of the night to about 2000 entries/sec during peak evening
times.

I am not exactly sure how to answer the question about storage size per
tweet as I am not actually storing the original tweet and if a counter
already exists for an n-gram/time period, then incrementing that counter
doesn't increase the storage size. I can follow up with the current storage
I am using though.

Aaron, I am using EBS now and I haven't seen any problems, that said my
load is obviously not extreme.  When I initially moved things from my home
workstation to EC2, I had a few months of tweets to ingest. For that
initial ingest I did run with local instance storage as I saw extremely
variable performance when I first tried EBS. The instance storage was
better, though not as good as what I see on bare metal.

Jared

On Wed, Apr 25, 2012 at 7:43 AM, Aaron Cordova <aa...@cordovas.org> wrote:

> Speaking of storage - are you using EBS or local instance storage?
>
> On Apr 25, 2012, at 8:52 AM, Eric Newton wrote:
>
> How many key-values does a single tweet become, on average?  What's the
> storage size per tweet?
>
> On Wed, Apr 25, 2012 at 12:17 AM, Jared winick <ja...@gmail.com>wrote:
>
>> Thanks for the kind words, I appreciate it. Keith, my ingest process
>> was down on Mar 19-20, so that is why I am missing data for that
>> period.
>>
>> For those who are curious, I am receiving about 1.2 million tweets a
>> day and have about 3 billion entries in my main table.  I am actually
>> getting by with everything running on an EC2 medium instance, which is
>> obviously very far from ideal but I am trying to stay on a budget.
>>
>> I hope to add new features as time allows, things like near real-time
>> trending and geospatial analytics.  If anyone has any ideas for
>> features they think would be interesting, just let me know or add them
>> as issues on the github page.
>>
>> On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi
>> <bi...@ugov.gov> wrote:
>> > That's so cool that I'm creating a new section for it on our page of
>> links:
>> > http://accumulo.apache.org/papers.html
>> >
>> > Billie
>> >
>> > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <
>> jaredwinick@gmail.com> wrote:
>> >> I gave an Introduction to Apache Accumulo presentation last month at
>> >> the Boulder/Denver Meetup where I demoed an application that used
>> >> Accumulo to provide real-time and historical access to words/phrases
>> >> seen in Twitter messages as well as daily trend analysis. I finally
>> >> got the demo polished up a bit and running on Amazon EC2 where it can
>> >> be found at http://trendulo.com .
>> >>
>> >> Trendulo is still pretty Alpha at this point so please feel free to
>> >> add to the existing documented issues at
>> >> https://github.com/jaredwinick/trendulo where you can also obviously
>> >> find the source.
>> >>
>> >>
>> >> As an example, the following link will show the launch of Instagram's
>> >> Android client, followed by Facebook's purchase and then a small
>> >> increase in general "chatter" about the product http://goo.gl/XcCG8
>> >>
>> >>
>> >> Let me know if anyone has any questions or comments. Feel free to
>> >> tweet @trendulo any interesting searches and I can retweet them out.
>> >>
>> >>
>> >> Jared
>>
>
>
>

Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Aaron Cordova <aa...@cordovas.org>.
Speaking of storage - are you using EBS or local instance storage? 

On Apr 25, 2012, at 8:52 AM, Eric Newton wrote:

> How many key-values does a single tweet become, on average?  What's the storage size per tweet?
> 
> On Wed, Apr 25, 2012 at 12:17 AM, Jared winick <ja...@gmail.com> wrote:
> Thanks for the kind words, I appreciate it. Keith, my ingest process
> was down on Mar 19-20, so that is why I am missing data for that
> period.
> 
> For those who are curious, I am receiving about 1.2 million tweets a
> day and have about 3 billion entries in my main table.  I am actually
> getting by with everything running on an EC2 medium instance, which is
> obviously very far from ideal but I am trying to stay on a budget.
> 
> I hope to add new features as time allows, things like near real-time
> trending and geospatial analytics.  If anyone has any ideas for
> features they think would be interesting, just let me know or add them
> as issues on the github page.
> 
> On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi
> <bi...@ugov.gov> wrote:
> > That's so cool that I'm creating a new section for it on our page of links:
> > http://accumulo.apache.org/papers.html
> >
> > Billie
> >
> > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <ja...@gmail.com> wrote:
> >> I gave an Introduction to Apache Accumulo presentation last month at
> >> the Boulder/Denver Meetup where I demoed an application that used
> >> Accumulo to provide real-time and historical access to words/phrases
> >> seen in Twitter messages as well as daily trend analysis. I finally
> >> got the demo polished up a bit and running on Amazon EC2 where it can
> >> be found at http://trendulo.com .
> >>
> >> Trendulo is still pretty Alpha at this point so please feel free to
> >> add to the existing documented issues at
> >> https://github.com/jaredwinick/trendulo where you can also obviously
> >> find the source.
> >>
> >>
> >> As an example, the following link will show the launch of Instagram's
> >> Android client, followed by Facebook's purchase and then a small
> >> increase in general "chatter" about the product http://goo.gl/XcCG8
> >>
> >>
> >> Let me know if anyone has any questions or comments. Feel free to
> >> tweet @trendulo any interesting searches and I can retweet them out.
> >>
> >>
> >> Jared
> 


Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Eric Newton <er...@gmail.com>.
How many key-values does a single tweet become, on average?  What's the
storage size per tweet?

On Wed, Apr 25, 2012 at 12:17 AM, Jared winick <ja...@gmail.com>wrote:

> Thanks for the kind words, I appreciate it. Keith, my ingest process
> was down on Mar 19-20, so that is why I am missing data for that
> period.
>
> For those who are curious, I am receiving about 1.2 million tweets a
> day and have about 3 billion entries in my main table.  I am actually
> getting by with everything running on an EC2 medium instance, which is
> obviously very far from ideal but I am trying to stay on a budget.
>
> I hope to add new features as time allows, things like near real-time
> trending and geospatial analytics.  If anyone has any ideas for
> features they think would be interesting, just let me know or add them
> as issues on the github page.
>
> On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi
> <bi...@ugov.gov> wrote:
> > That's so cool that I'm creating a new section for it on our page of
> links:
> > http://accumulo.apache.org/papers.html
> >
> > Billie
> >
> > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <
> jaredwinick@gmail.com> wrote:
> >> I gave an Introduction to Apache Accumulo presentation last month at
> >> the Boulder/Denver Meetup where I demoed an application that used
> >> Accumulo to provide real-time and historical access to words/phrases
> >> seen in Twitter messages as well as daily trend analysis. I finally
> >> got the demo polished up a bit and running on Amazon EC2 where it can
> >> be found at http://trendulo.com .
> >>
> >> Trendulo is still pretty Alpha at this point so please feel free to
> >> add to the existing documented issues at
> >> https://github.com/jaredwinick/trendulo where you can also obviously
> >> find the source.
> >>
> >>
> >> As an example, the following link will show the launch of Instagram's
> >> Android client, followed by Facebook's purchase and then a small
> >> increase in general "chatter" about the product http://goo.gl/XcCG8
> >>
> >>
> >> Let me know if anyone has any questions or comments. Feel free to
> >> tweet @trendulo any interesting searches and I can retweet them out.
> >>
> >>
> >> Jared
>

Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Jared winick <ja...@gmail.com>.
Thanks for the kind words, I appreciate it. Keith, my ingest process
was down on Mar 19-20, so that is why I am missing data for that
period.

For those who are curious, I am receiving about 1.2 million tweets a
day and have about 3 billion entries in my main table.  I am actually
getting by with everything running on an EC2 medium instance, which is
obviously very far from ideal but I am trying to stay on a budget.

I hope to add new features as time allows, things like near real-time
trending and geospatial analytics.  If anyone has any ideas for
features they think would be interesting, just let me know or add them
as issues on the github page.

On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi
<bi...@ugov.gov> wrote:
> That's so cool that I'm creating a new section for it on our page of links:
> http://accumulo.apache.org/papers.html
>
> Billie
>
> On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <ja...@gmail.com> wrote:
>> I gave an Introduction to Apache Accumulo presentation last month at
>> the Boulder/Denver Meetup where I demoed an application that used
>> Accumulo to provide real-time and historical access to words/phrases
>> seen in Twitter messages as well as daily trend analysis. I finally
>> got the demo polished up a bit and running on Amazon EC2 where it can
>> be found at http://trendulo.com .
>>
>> Trendulo is still pretty Alpha at this point so please feel free to
>> add to the existing documented issues at
>> https://github.com/jaredwinick/trendulo where you can also obviously
>> find the source.
>>
>>
>> As an example, the following link will show the launch of Instagram's
>> Android client, followed by Facebook's purchase and then a small
>> increase in general "chatter" about the product http://goo.gl/XcCG8
>>
>>
>> Let me know if anyone has any questions or comments. Feel free to
>> tweet @trendulo any interesting searches and I can retweet them out.
>>
>>
>> Jared

Re: Trendulo - A Twitter Analytics Demo on Accumulo

Posted by Billie J Rinaldi <bi...@ugov.gov>.
That's so cool that I'm creating a new section for it on our page of links:
http://accumulo.apache.org/papers.html

Billie

On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <ja...@gmail.com> wrote:
> I gave an Introduction to Apache Accumulo presentation last month at
> the Boulder/Denver Meetup where I demoed an application that used
> Accumulo to provide real-time and historical access to words/phrases
> seen in Twitter messages as well as daily trend analysis. I finally
> got the demo polished up a bit and running on Amazon EC2 where it can
> be found at http://trendulo.com .
> 
> Trendulo is still pretty Alpha at this point so please feel free to
> add to the existing documented issues at
> https://github.com/jaredwinick/trendulo where you can also obviously
> find the source.
> 
> 
> As an example, the following link will show the launch of Instagram's
> Android client, followed by Facebook's purchase and then a small
> increase in general "chatter" about the product http://goo.gl/XcCG8
> 
> 
> Let me know if anyone has any questions or comments. Feel free to
> tweet @trendulo any interesting searches and I can retweet them out.
> 
> 
> Jared