You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by crocket <cr...@gmail.com> on 2013/01/24 14:06:45 UTC

How do I best store my IRC log data in lucene indexes?

I have three data I want to store, search, and restore.
It is for logging IRC messages.

NICK
  time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
UTC+0
  network=
  me=0 or 1
  old=
  new=

KICKED
  time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
UTC+0
  network=
  chan=
  msg=
  kicker=
  mynick=

MSG
  time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
UTC+0
  network=
  chan=
  msg=
  me=0 or 1
  nick=

Below are ideas for IRC log search web UI.

[] Main UI : network("", freenode, ...) | channel("", ...) | nick | message
  1) network and channel have dropdown boxes. nick and message are text
boxes.
  2) duration, network, and nick can be applied to every data.
  3) channel and message are applicable to KICKED and MSG.

[] Facets
  1) duration(1day, 1 week, 1 month, 1 year, all) <-- just like google
search tools
  2) ...

[] Category search(categories registered as facets)
  1) network
  2) channel

Is it better to store NICK, KICKED, and MSG in one index directory or to
store them in separate index directories?

Are there other things that I should know or consider?

Re: How do I best store my IRC log data in lucene indexes?

Posted by Ian Lea <ia...@gmail.com>.
Adding a message type field is the way to do it.

Then you can use QueryWrapperFilter and CachingWrapperFilter, something like

Term t = new Term("messtype", messtype);
TermQuery tq = new TermQuery(t);
QueryWrapperFilter qwf = new QueryWrapperFilter(tq);
CachingWrapperFilter cwf = new CachingWrapperFilter(qwf);

and use cwf as the filter in your searches.  Make sure that the values
for messtype in the query are specified as they are stored in the
index. "NICK" != "nick".


All beginners should read Lucene In Action.


--
Ian.


On Fri, Jan 25, 2013 at 12:55 PM, crocket <cr...@gmail.com> wrote:
> Do you mean
> http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/analysis/CachingTokenFilter.htmlby
> a cached filter?
> And how would you restrict searches to particular message types fast with a
> cached filter?
> I'm a beginner.
>
>
> On Fri, Jan 25, 2013 at 6:51 PM, Ian Lea <ia...@gmail.com> wrote:
>
>> Unless there's good reason not to (massive size?  different systems?
>> conflicting update schedules?) I'd store everything in the one index.
>>
>> Consider a cached filter for fast restriction of searches to
>> particular message types.
>>
>>
>> --
>> Ian.
>>
>>
>> On Thu, Jan 24, 2013 at 1:06 PM, crocket <cr...@gmail.com> wrote:
>> > I have three data I want to store, search, and restore.
>> > It is for logging IRC messages.
>> >
>> > NICK
>> >   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
>> > UTC+0
>> >   network=
>> >   me=0 or 1
>> >   old=
>> >   new=
>> >
>> > KICKED
>> >   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
>> > UTC+0
>> >   network=
>> >   chan=
>> >   msg=
>> >   kicker=
>> >   mynick=
>> >
>> > MSG
>> >   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
>> > UTC+0
>> >   network=
>> >   chan=
>> >   msg=
>> >   me=0 or 1
>> >   nick=
>> >
>> > Below are ideas for IRC log search web UI.
>> >
>> > [] Main UI : network("", freenode, ...) | channel("", ...) | nick |
>> message
>> >   1) network and channel have dropdown boxes. nick and message are text
>> > boxes.
>> >   2) duration, network, and nick can be applied to every data.
>> >   3) channel and message are applicable to KICKED and MSG.
>> >
>> > [] Facets
>> >   1) duration(1day, 1 week, 1 month, 1 year, all) <-- just like google
>> > search tools
>> >   2) ...
>> >
>> > [] Category search(categories registered as facets)
>> >   1) network
>> >   2) channel
>> >
>> > Is it better to store NICK, KICKED, and MSG in one index directory or to
>> > store them in separate index directories?
>> >
>> > Are there other things that I should know or consider?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How do I best store my IRC log data in lucene indexes?

Posted by crocket <cr...@gmail.com>.
Do you mean
http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/analysis/CachingTokenFilter.htmlby
a cached filter?
And how would you restrict searches to particular message types fast with a
cached filter?
I'm a beginner.


On Fri, Jan 25, 2013 at 6:51 PM, Ian Lea <ia...@gmail.com> wrote:

> Unless there's good reason not to (massive size?  different systems?
> conflicting update schedules?) I'd store everything in the one index.
>
> Consider a cached filter for fast restriction of searches to
> particular message types.
>
>
> --
> Ian.
>
>
> On Thu, Jan 24, 2013 at 1:06 PM, crocket <cr...@gmail.com> wrote:
> > I have three data I want to store, search, and restore.
> > It is for logging IRC messages.
> >
> > NICK
> >   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
> > UTC+0
> >   network=
> >   me=0 or 1
> >   old=
> >   new=
> >
> > KICKED
> >   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
> > UTC+0
> >   network=
> >   chan=
> >   msg=
> >   kicker=
> >   mynick=
> >
> > MSG
> >   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
> > UTC+0
> >   network=
> >   chan=
> >   msg=
> >   me=0 or 1
> >   nick=
> >
> > Below are ideas for IRC log search web UI.
> >
> > [] Main UI : network("", freenode, ...) | channel("", ...) | nick |
> message
> >   1) network and channel have dropdown boxes. nick and message are text
> > boxes.
> >   2) duration, network, and nick can be applied to every data.
> >   3) channel and message are applicable to KICKED and MSG.
> >
> > [] Facets
> >   1) duration(1day, 1 week, 1 month, 1 year, all) <-- just like google
> > search tools
> >   2) ...
> >
> > [] Category search(categories registered as facets)
> >   1) network
> >   2) channel
> >
> > Is it better to store NICK, KICKED, and MSG in one index directory or to
> > store them in separate index directories?
> >
> > Are there other things that I should know or consider?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How do I best store my IRC log data in lucene indexes?

Posted by crocket <cr...@gmail.com>.
How do you propose I differentiate different message types if I put all of
them in one index directory?
I thought of adding a message type field, but it doesn't seem to be a good
way.


On Fri, Jan 25, 2013 at 6:51 PM, Ian Lea <ia...@gmail.com> wrote:

> Unless there's good reason not to (massive size?  different systems?
> conflicting update schedules?) I'd store everything in the one index.
>
> Consider a cached filter for fast restriction of searches to
> particular message types.
>
>
> --
> Ian.
>
>
> On Thu, Jan 24, 2013 at 1:06 PM, crocket <cr...@gmail.com> wrote:
> > I have three data I want to store, search, and restore.
> > It is for logging IRC messages.
> >
> > NICK
> >   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
> > UTC+0
> >   network=
> >   me=0 or 1
> >   old=
> >   new=
> >
> > KICKED
> >   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
> > UTC+0
> >   network=
> >   chan=
> >   msg=
> >   kicker=
> >   mynick=
> >
> > MSG
> >   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
> > UTC+0
> >   network=
> >   chan=
> >   msg=
> >   me=0 or 1
> >   nick=
> >
> > Below are ideas for IRC log search web UI.
> >
> > [] Main UI : network("", freenode, ...) | channel("", ...) | nick |
> message
> >   1) network and channel have dropdown boxes. nick and message are text
> > boxes.
> >   2) duration, network, and nick can be applied to every data.
> >   3) channel and message are applicable to KICKED and MSG.
> >
> > [] Facets
> >   1) duration(1day, 1 week, 1 month, 1 year, all) <-- just like google
> > search tools
> >   2) ...
> >
> > [] Category search(categories registered as facets)
> >   1) network
> >   2) channel
> >
> > Is it better to store NICK, KICKED, and MSG in one index directory or to
> > store them in separate index directories?
> >
> > Are there other things that I should know or consider?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How do I best store my IRC log data in lucene indexes?

Posted by Ian Lea <ia...@gmail.com>.
Unless there's good reason not to (massive size?  different systems?
conflicting update schedules?) I'd store everything in the one index.

Consider a cached filter for fast restriction of searches to
particular message types.


--
Ian.


On Thu, Jan 24, 2013 at 1:06 PM, crocket <cr...@gmail.com> wrote:
> I have three data I want to store, search, and restore.
> It is for logging IRC messages.
>
> NICK
>   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
> UTC+0
>   network=
>   me=0 or 1
>   old=
>   new=
>
> KICKED
>   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
> UTC+0
>   network=
>   chan=
>   msg=
>   kicker=
>   mynick=
>
> MSG
>   time=the number of seconds passed since the epoch, 1970-01-01 00:00:00
> UTC+0
>   network=
>   chan=
>   msg=
>   me=0 or 1
>   nick=
>
> Below are ideas for IRC log search web UI.
>
> [] Main UI : network("", freenode, ...) | channel("", ...) | nick | message
>   1) network and channel have dropdown boxes. nick and message are text
> boxes.
>   2) duration, network, and nick can be applied to every data.
>   3) channel and message are applicable to KICKED and MSG.
>
> [] Facets
>   1) duration(1day, 1 week, 1 month, 1 year, all) <-- just like google
> search tools
>   2) ...
>
> [] Category search(categories registered as facets)
>   1) network
>   2) channel
>
> Is it better to store NICK, KICKED, and MSG in one index directory or to
> store them in separate index directories?
>
> Are there other things that I should know or consider?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org