You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by phoey <ph...@gmail.com> on 2010/11/30 12:44:12 UTC

SOLR for Log analysis feasibility

We are looking into building a reporting feature and investigating solutions
which will allow us to search though our logs for downloads, searches and
view history.

Each log item is relatively small

download history

<add>
	<doc>
		<field name="uuid">item123-v1</field>
		<field name="market">photography</field>
		<field name="name">item 1</field>
		<field name="userid">1</field>
		<field name="version">1</field>
		<field name="downloadType">hires</field>
		<field name="itemId">123</field>
		<field name="timestamp">2009-11-07T14:50:54Z</field>
	</doc>
</add> 

search history

<add>
	<doc>
		<field name="uuid">1</field>
		<field name="query">brand assets</field>
		<field name="userid">1</field>
		<field name="timestamp">2009-11-07T14:50:54Z</field>
	</doc>
</add>

view history

<add>
	<doc>
		<field name="uuid">1</field>
		<field name="itemId">123</field>
		<field name="userid">1</field>
		<field name="timestamp">2009-11-07T14:50:54Z</field>
	</doc>
</add>


and we reckon that we could have around 10 - 30 million log records for each
type (downloads, searches, views) so 70 million records in total but
obviously must scale higher.

concurrent users will be around 10 - 20 (relatively low)

new logs will be imported as a batch overnight.

Because we have some previous experience with SOLR and because the interface
needs to have full-text searching and filtering we built a prototype using
SOLR 4.0. We used the new field collapsing feature within SOLR 4.0 to
collapse on groups of data. For example view History needs to collapse on
itemId. Each row will then show the frequency on how many views the item has
had. This is achieved by the number of items which have been grouped.  

The requirements for the solution is to be schemaless to allow adding new
fields to new documents easier, and have a powerful search interface, both
which SOLR can do.

QUESTIONS

Our prototype is working as expected but im unsure if

1. has anyone got experience with using SOLR for log analysis.
2. SOLR can scale but when is the limit when i should start considering
about sharding the index. It should be fine with 100+ million records.
3. We are using a nightly build of SOLR for the "field collapsing" feature.
Would it be possible to patch SOLR 1.4.1 with the SOLR-236 patch? has anyone
used this in production?

thanks
-- 
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-for-Log-analysis-feasibility-tp1992202p1992202.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR for Log analysis feasibility

Posted by phoey <ph...@gmail.com>.
my thoughts exactly that it may seem fairly straightforward but i fear for
when a client wants a perfectly reasonable new feature to be added to their
report and SOLR simply cannot support this feature. 

i am hoping we wont have any real issues with scalability as Loggly because
we dont index and store large documents of data within SOLR. Most of our
documents will be very small.

Does anyone have any experience with using field collapsing in a production
environment?

thank you for all your replies. 

Joe

 

 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-for-Log-analysis-feasibility-tp1992202p1998360.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR for Log analysis feasibility

Posted by Peter Sturge <pe...@gmail.com>.
We do a lot of precisely this sort of thing. Ours is a commercial
product (Honeycomb Lexicon) that extracts behavioural information from
logs, events and network data (don't worry, I'm not pushing this on
you!) - only to say that there are a lot of considerations beyond base
Solr when it comes to handling log, event and other 'transient' data
streams.
Aside from the obvious issues of horizontal scaling, reliable
delivery/retry/replication etc., there are other important issues,
particularly with regards data classification, reporting engines and
numerous other items.
It's one of those things that's sounds perfectly reasonable at the
outset, but all sorts of things crop up the deeper you get into it.

Peter


On Tue, Nov 30, 2010 at 11:44 AM, phoey <ph...@gmail.com> wrote:
>
> We are looking into building a reporting feature and investigating solutions
> which will allow us to search though our logs for downloads, searches and
> view history.
>
> Each log item is relatively small
>
> download history
>
> <add>
>        <doc>
>                <field name="uuid">item123-v1</field>
>                <field name="market">photography</field>
>                <field name="name">item 1</field>
>                <field name="userid">1</field>
>                <field name="version">1</field>
>                <field name="downloadType">hires</field>
>                <field name="itemId">123</field>
>                <field name="timestamp">2009-11-07T14:50:54Z</field>
>        </doc>
> </add>
>
> search history
>
> <add>
>        <doc>
>                <field name="uuid">1</field>
>                <field name="query">brand assets</field>
>                <field name="userid">1</field>
>                <field name="timestamp">2009-11-07T14:50:54Z</field>
>        </doc>
> </add>
>
> view history
>
> <add>
>        <doc>
>                <field name="uuid">1</field>
>                <field name="itemId">123</field>
>                <field name="userid">1</field>
>                <field name="timestamp">2009-11-07T14:50:54Z</field>
>        </doc>
> </add>
>
>
> and we reckon that we could have around 10 - 30 million log records for each
> type (downloads, searches, views) so 70 million records in total but
> obviously must scale higher.
>
> concurrent users will be around 10 - 20 (relatively low)
>
> new logs will be imported as a batch overnight.
>
> Because we have some previous experience with SOLR and because the interface
> needs to have full-text searching and filtering we built a prototype using
> SOLR 4.0. We used the new field collapsing feature within SOLR 4.0 to
> collapse on groups of data. For example view History needs to collapse on
> itemId. Each row will then show the frequency on how many views the item has
> had. This is achieved by the number of items which have been grouped.
>
> The requirements for the solution is to be schemaless to allow adding new
> fields to new documents easier, and have a powerful search interface, both
> which SOLR can do.
>
> QUESTIONS
>
> Our prototype is working as expected but im unsure if
>
> 1. has anyone got experience with using SOLR for log analysis.
> 2. SOLR can scale but when is the limit when i should start considering
> about sharding the index. It should be fine with 100+ million records.
> 3. We are using a nightly build of SOLR for the "field collapsing" feature.
> Would it be possible to patch SOLR 1.4.1 with the SOLR-236 patch? has anyone
> used this in production?
>
> thanks
> --
> View this message in context: http://lucene.472066.n3.nabble.com/SOLR-for-Log-analysis-feasibility-tp1992202p1992202.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: SOLR for Log analysis feasibility

Posted by Stefan Matheis <ma...@googlemail.com>.
i know, it's not solr .. but perhaps you should have a look at it:
http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/

On Tue, Nov 30, 2010 at 12:58 PM, Peter Karich <pe...@yahoo.de> wrote:

>  take a look into this:
> http://vimeo.com/16102543
>
> for that amount of data it isn't that easy :-)
>
>
>  We are looking into building a reporting feature and investigating
>> solutions
>> which will allow us to search though our logs for downloads, searches and
>> view history.
>>
>> Each log item is relatively small
>>
>> download history
>>
>> <add>
>>        <doc>
>>                <field name="uuid">item123-v1</field>
>>                <field name="market">photography</field>
>>                <field name="name">item 1</field>
>>                <field name="userid">1</field>
>>                <field name="version">1</field>
>>                <field name="downloadType">hires</field>
>>                <field name="itemId">123</field>
>>                <field name="timestamp">2009-11-07T14:50:54Z</field>
>>        </doc>
>> </add>
>>
>> search history
>>
>> <add>
>>        <doc>
>>                <field name="uuid">1</field>
>>                <field name="query">brand assets</field>
>>                <field name="userid">1</field>
>>                <field name="timestamp">2009-11-07T14:50:54Z</field>
>>        </doc>
>> </add>
>>
>> view history
>>
>> <add>
>>        <doc>
>>                <field name="uuid">1</field>
>>                <field name="itemId">123</field>
>>                <field name="userid">1</field>
>>                <field name="timestamp">2009-11-07T14:50:54Z</field>
>>        </doc>
>> </add>
>>
>>
>> and we reckon that we could have around 10 - 30 million log records for
>> each
>> type (downloads, searches, views) so 70 million records in total but
>> obviously must scale higher.
>>
>> concurrent users will be around 10 - 20 (relatively low)
>>
>> new logs will be imported as a batch overnight.
>>
>> Because we have some previous experience with SOLR and because the
>> interface
>> needs to have full-text searching and filtering we built a prototype using
>> SOLR 4.0. We used the new field collapsing feature within SOLR 4.0 to
>> collapse on groups of data. For example view History needs to collapse on
>> itemId. Each row will then show the frequency on how many views the item
>> has
>> had. This is achieved by the number of items which have been grouped.
>>
>> The requirements for the solution is to be schemaless to allow adding new
>> fields to new documents easier, and have a powerful search interface, both
>> which SOLR can do.
>>
>> QUESTIONS
>>
>> Our prototype is working as expected but im unsure if
>>
>> 1. has anyone got experience with using SOLR for log analysis.
>> 2. SOLR can scale but when is the limit when i should start considering
>> about sharding the index. It should be fine with 100+ million records.
>> 3. We are using a nightly build of SOLR for the "field collapsing"
>> feature.
>> Would it be possible to patch SOLR 1.4.1 with the SOLR-236 patch? has
>> anyone
>> used this in production?
>>
>> thanks
>>
>
>
> --
> http://jetwick.com twitter search prototype
>
>

Re: SOLR for Log analysis feasibility

Posted by Peter Karich <pe...@yahoo.de>.
  take a look into this:
http://vimeo.com/16102543

for that amount of data it isn't that easy :-)

> We are looking into building a reporting feature and investigating solutions
> which will allow us to search though our logs for downloads, searches and
> view history.
>
> Each log item is relatively small
>
> download history
>
> <add>
> 	<doc>
> 		<field name="uuid">item123-v1</field>
> 		<field name="market">photography</field>
> 		<field name="name">item 1</field>
> 		<field name="userid">1</field>
> 		<field name="version">1</field>
> 		<field name="downloadType">hires</field>
> 		<field name="itemId">123</field>
> 		<field name="timestamp">2009-11-07T14:50:54Z</field>
> 	</doc>
> </add>
>
> search history
>
> <add>
> 	<doc>
> 		<field name="uuid">1</field>
> 		<field name="query">brand assets</field>
> 		<field name="userid">1</field>
> 		<field name="timestamp">2009-11-07T14:50:54Z</field>
> 	</doc>
> </add>
>
> view history
>
> <add>
> 	<doc>
> 		<field name="uuid">1</field>
> 		<field name="itemId">123</field>
> 		<field name="userid">1</field>
> 		<field name="timestamp">2009-11-07T14:50:54Z</field>
> 	</doc>
> </add>
>
>
> and we reckon that we could have around 10 - 30 million log records for each
> type (downloads, searches, views) so 70 million records in total but
> obviously must scale higher.
>
> concurrent users will be around 10 - 20 (relatively low)
>
> new logs will be imported as a batch overnight.
>
> Because we have some previous experience with SOLR and because the interface
> needs to have full-text searching and filtering we built a prototype using
> SOLR 4.0. We used the new field collapsing feature within SOLR 4.0 to
> collapse on groups of data. For example view History needs to collapse on
> itemId. Each row will then show the frequency on how many views the item has
> had. This is achieved by the number of items which have been grouped.
>
> The requirements for the solution is to be schemaless to allow adding new
> fields to new documents easier, and have a powerful search interface, both
> which SOLR can do.
>
> QUESTIONS
>
> Our prototype is working as expected but im unsure if
>
> 1. has anyone got experience with using SOLR for log analysis.
> 2. SOLR can scale but when is the limit when i should start considering
> about sharding the index. It should be fine with 100+ million records.
> 3. We are using a nightly build of SOLR for the "field collapsing" feature.
> Would it be possible to patch SOLR 1.4.1 with the SOLR-236 patch? has anyone
> used this in production?
>
> thanks


-- 
http://jetwick.com twitter search prototype