You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Cool Techi <co...@outlook.com> on 2014/05/09 19:39:17 UTC

Storing tweets For WC2014

Hi,
We have a requirement from one of our customers to provide search and analytics on the upcoming Soccer World cup, given the sheer volume of tweet's that would be generated at such an event I cannot imagine what would be required to store this in solr. 
It would be great if there can be some pointer's on the scale or hardware required, number of shards that should be created etc. Some requirement,
All the tweets should be searchable (approximately 100million tweets/date  * 60 Days of event). All fields on tweets should be searchable/facet on numeric and date fields. Facets would be run on TwitterId's (unique users), tweet created on date, Location, Sentiment (some fields which we generate)

If anyone has attempted anything like this it would be helpful.
Regards,Rohit

Re: Storing tweets For WC2014

Posted by Michael Della Bitta <mi...@appinions.com>.

Some of the data providers for Twitter offer a search API. Depending on
what you're doing, you might not even need to host this yourself.

My company does do search and analytics over tweets, but by the time we end
up indexing them, we've winnowed down the initial set to 10% of what we've
initially ingested, which itself is a fraction of the total set of tweets
as our data provider has let us filter for the ones that have the keywords
we want.

Our news index approaches the size of what you're talking about within an
order of magnitude (where 'news' is really an index of sentences taken from
news reports, along with metadata about the document the news came from).
Overall, we're hosting about 310 million records (give or take depending
where in the sharding cycle we're on) in a cluster of 5 AWS i2.xlarge boxes.

This setup indexes from our feeds in real time, which means there's no mass
loading. Additionally, we generally do bulk data collection across only 3
days of data, so if you're looking to do a mess of reporting against your
full set, take that into consideration.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Fri, May 9, 2014 at 1:39 PM, Cool Techi <co...@outlook.com> wrote:

> Hi,
> We have a requirement from one of our customers to provide search and
> analytics on the upcoming Soccer World cup, given the sheer volume of
> tweet's that would be generated at such an event I cannot imagine what
> would be required to store this in solr.
> It would be great if there can be some pointer's on the scale or hardware
> required, number of shards that should be created etc. Some requirement,
> All the tweets should be searchable (approximately 100million tweets/date
>  * 60 Days of event). All fields on tweets should be searchable/facet on
> numeric and date fields. Facets would be run on TwitterId's (unique users),
> tweet created on date, Location, Sentiment (some fields which we generate)
>
> If anyone has attempted anything like this it would be helpful.
> Regards,Rohit
>

Re: Storing tweets For WC2014

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

That's a lot of tweets. There is an article talking about smaller
scale lessons, might be still useful:
http://ricston.com/blog/guerrilla-search-solr-run-3-million-documents-search-15month-machine/

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Sat, May 10, 2014 at 12:39 AM, Cool Techi <co...@outlook.com> wrote:
> Hi,
> We have a requirement from one of our customers to provide search and analytics on the upcoming Soccer World cup, given the sheer volume of tweet's that would be generated at such an event I cannot imagine what would be required to store this in solr.
> It would be great if there can be some pointer's on the scale or hardware required, number of shards that should be created etc. Some requirement,
> All the tweets should be searchable (approximately 100million tweets/date  * 60 Days of event). All fields on tweets should be searchable/facet on numeric and date fields. Facets would be run on TwitterId's (unique users), tweet created on date, Location, Sentiment (some fields which we generate)
>
> If anyone has attempted anything like this it would be helpful.
> Regards,Rohit
>

Re: Storing tweets For WC2014

Posted by Aman Tandon <am...@gmail.com>.

I haven't tried situation this this but as per your requirements, you can
make the schema for defining all those fields required by you like, date,
location, etc you can also configure the faceting form solrconfig.xml if
you want the same for every request.

You should give it a try by allocating the 2-4GB of heap space then you can
increase the size by testing it on heavy load.
All the hardware kind of parameters are pluggable, you have to try it by
yourself. If problems arises then should look at the solr logs if there is
a issue related the memory then you can allocate more memory by visualizing
the GC graphs.

I am not an expert, i am just a newbie in solr, may be some points are not
well explained by me, but you should try by experimenting it, I guess you
have  a sufficient time before july ;) .

With Regards
Aman Tandon

On Fri, May 9, 2014 at 11:09 PM, Cool Techi <co...@outlook.com> wrote:

> Hi,
> We have a requirement from one of our customers to provide search and
> analytics on the upcoming Soccer World cup, given the sheer volume of
> tweet's that would be generated at such an event I cannot imagine what
> would be required to store this in solr.
> It would be great if there can be some pointer's on the scale or hardware
> required, number of shards that should be created etc. Some requirement,
> All the tweets should be searchable (approximately 100million tweets/date
>  * 60 Days of event). All fields on tweets should be searchable/facet on
> numeric and date fields. Facets would be run on TwitterId's (unique users),
> tweet created on date, Location, Sentiment (some fields which we generate)
>
> If anyone has attempted anything like this it would be helpful.
> Regards,Rohit
>