You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Joe Attardi <ja...@gmail.com> on 2007/08/01 17:31:50 UTC

More IP/MAC indexing questions

Hi again, everyone. First of all, I want to thank everyone for their
extremely helpful replies so far.
Also, I just started reading the book "Lucene in Action" last night. So far
it's an awesome book, so a big thanks to the authors.

Anyhow, on to my question. As I've mentioned in several of my previous
messages, I am indexing different pieces of information about servers - in
particular, my question is about indexing the IP address and MAC address.

Using the StandardAnalyzer, an IP is kept as a single token ("192.168.1.100"),
and a MAC is broken up into one token per octet ("00", "17", "fd", "14",
"d3", "2a"). Many searches will be for partial IPs or MACs ("192.168",
"00:17:fd", etc).

Are either of these methods of indexing the addresses (single token vs
per-octet token) more or less efficient than the other when indexing large
numbers of these?

-- 
Joe Attardi
jattardi@gmail.com
http://thinksincode.blogspot.com/

Re: More IP/MAC indexing questions

Posted by Erick Erickson <er...@gmail.com>.

I suspect you're going to have to deal with wildcards if you really want
this functionality.

Erick

On 8/1/07, Joe Attardi <ja...@gmail.com> wrote:
>
> On 8/1/07, Erick Erickson <er...@gmail.com> wrote:
> >
> > Use a SpanNearQuery with a slop of 0 and specify true for ordering.
> > What that will do is require that the segments you specify must appear
> > in order with no gaps. You have to construct this yourself since there's
> > no support for SpanQueries in the QueryParser yet. This'll avoid having
> > to deal with Wildcards, which have their own issues (try searching on
> > a thread "I just don't understand wildcards at all" for an exposition
> from
> > "the guys" on this.
>
>
> Thanks Erick, I'll try this. My only other question here though, is what
> if
> they specify an incomplete octet of an address? For example, I want '
> 192.168.10' to match 192.168.10.1 and 192.168.100.1. How can I do this
> without wildcards, is there a way to put a PrefixQuery into the Span
> Query?
>
> Sorry if I don't make any sense
>

Re: More IP/MAC indexing questions

Posted by Mike Klaas <mi...@gmail.com>.

On 1-Aug-07, at 11:34 AM, Joe Attardi wrote:

> On 8/1/07, Erick Erickson <er...@gmail.com> wrote:
>>
>> Use a SpanNearQuery with a slop of 0 and specify true for ordering.
>> What that will do is require that the segments you specify must  
>> appear
>> in order with no gaps. You have to construct this yourself since  
>> there's
>> no support for SpanQueries in the QueryParser yet. This'll avoid  
>> having
>> to deal with Wildcards, which have their own issues (try searching on
>> a thread "I just don't understand wildcards at all" for an  
>> exposition from
>> "the guys" on this.
>
>
> Thanks Erick, I'll try this. My only other question here though, is  
> what if
> they specify an incomplete octet of an address? For example, I want '
> 192.168.10' to match 192.168.10.1 and 192.168.100.1. How can I do this
> without wildcards, is there a way to put a PrefixQuery into the  
> Span Query?

If 192 168 10 1 are separate tokens, then a phrase query on "192 168  
10" will find it.  If it is a single token, then a wildcard or regex  
query is necessary.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: More IP/MAC indexing questions

Posted by Joe Attardi <ja...@gmail.com>.

On 8/1/07, Erick Erickson <er...@gmail.com> wrote:
>
> Use a SpanNearQuery with a slop of 0 and specify true for ordering.
> What that will do is require that the segments you specify must appear
> in order with no gaps. You have to construct this yourself since there's
> no support for SpanQueries in the QueryParser yet. This'll avoid having
> to deal with Wildcards, which have their own issues (try searching on
> a thread "I just don't understand wildcards at all" for an exposition from
> "the guys" on this.

Thanks Erick, I'll try this. My only other question here though, is what if
they specify an incomplete octet of an address? For example, I want '
192.168.10' to match 192.168.10.1 and 192.168.100.1. How can I do this
without wildcards, is there a way to put a PrefixQuery into the Span Query?

Sorry if I don't make any sense

Re: More IP/MAC indexing questions

Posted by Erick Erickson <er...@gmail.com>.

Think of a custom analyzer class rather than an custom query parser. The
QueryParser uses your analyzer, so it all just "comes along".

Here's the approach I'd try first, off the top of my head....

Yes, break the IP and etc. up into octets and index them
tokenized.

Use a SpanNearQuery with a slop of 0 and specify true for ordering.
What that will do is require that the segments you specify must appear
in order with no gaps. You have to construct this yourself since there's
no support for SpanQueries in the QueryParser yet. This'll avoid having
to deal with Wildcards, which have their own issues (try searching on
a thread "I just don't understand wildcards at all" for an exposition from
"the guys" on this.

Best
Erick

On 8/1/07, Joe Attardi <ja...@gmail.com> wrote:
>
> Hi Erick,
>
> First, consider using your own analyzer and/or breaking the IP addresses
> > up by substituting ' ' for '.' upon input.
>
> Do you mean breaking the IP up into one token for each segment, like
> ["192",
> "168", "1", "100"] ?
>
>
>
> > But on to your question. Please post what you mean by
> > "a large number". 10,000? 1,000,000,000? we have no clue
> > from your posts so far...
>
> I apologize for the lack of details. A large part of the data will be
> wireless MAC addresses detected over the air, so it depends on the site.
> But
> I suppose, worst case, we're looking at thousands or tens of thousands.
> Comparatively speaking, then, I guess it's not such a large number
> compared
> to some of the other questions discussed on the list.
>
> That said, efficiency is hugely overrated at this stage of your
> > design. I'd personally use whatever is easiest and run some
> > tests.
> >
> > Just index them as single (unbroken) tokens to start and search
> > your partial address with PrefixQuery.
>
> This is what I was thinking originally, too. Although there could be times
> where they are searching for a piece at the end of the address, which is
> why
> my original post had me building a WildcardQuery.
>
> The system will be searching log messages, too, and for that I'll use the
> more normal StandardAnalyzer/QueryParser approach.
>
> So what I am thinking of doing going forward is creating a custom query
> parser class, that basically has special cases (IP addresses, MAC
> addresses)
> where the query must be more customized, and in the other cases fall
> through
> to the standard QueryParser class. Does this sound like a good idea?
>
> Thanks again for your continued help!
>

Re: More IP/MAC indexing questions

Posted by Joe Attardi <ja...@gmail.com>.

Hi Erick,

First, consider using your own analyzer and/or breaking the IP addresses
> up by substituting ' ' for '.' upon input.

Do you mean breaking the IP up into one token for each segment, like ["192",
"168", "1", "100"] ?



> But on to your question. Please post what you mean by
> "a large number". 10,000? 1,000,000,000? we have no clue
> from your posts so far...

I apologize for the lack of details. A large part of the data will be
wireless MAC addresses detected over the air, so it depends on the site. But
I suppose, worst case, we're looking at thousands or tens of thousands.
Comparatively speaking, then, I guess it's not such a large number compared
to some of the other questions discussed on the list.

That said, efficiency is hugely overrated at this stage of your
> design. I'd personally use whatever is easiest and run some
> tests.
>
> Just index them as single (unbroken) tokens to start and search
> your partial address with PrefixQuery.

This is what I was thinking originally, too. Although there could be times
where they are searching for a piece at the end of the address, which is why
my original post had me building a WildcardQuery.

The system will be searching log messages, too, and for that I'll use the
more normal StandardAnalyzer/QueryParser approach.

So what I am thinking of doing going forward is creating a custom query
parser class, that basically has special cases (IP addresses, MAC addresses)
where the query must be more customized, and in the other cases fall through
to the standard QueryParser class. Does this sound like a good idea?

Thanks again for your continued help!

Re: More IP/MAC indexing questions

Posted by Erick Erickson <er...@gmail.com>.

First, consider using your own analyzer and/or breaking the IP addresses
up by substituting ' ' for '.' upon input. Otherwise, you'll have endless
issues as time passes......

But on to your question. Please post what you mean by
"a large number". 10,000? 1,000,000,000? we have no clue
from your posts so far...

That said, efficiency is hugely overrated at this stage of your
design. I'd personally use whatever is easiest and run some
tests.

Just index them as single (unbroken) tokens to start and search
your partial address with PrefixQuery. Or index them as
individual tokens and create a SpanFirstQuery. Or...

And measure <G>.

Best
Erick

On 8/1/07, Joe Attardi <ja...@gmail.com> wrote:
>
> Hi again, everyone. First of all, I want to thank everyone for their
> extremely helpful replies so far.
> Also, I just started reading the book "Lucene in Action" last night. So
> far
> it's an awesome book, so a big thanks to the authors.
>
> Anyhow, on to my question. As I've mentioned in several of my previous
> messages, I am indexing different pieces of information about servers - in
> particular, my question is about indexing the IP address and MAC address.
>
> Using the StandardAnalyzer, an IP is kept as a single token ("
> 192.168.1.100"),
> and a MAC is broken up into one token per octet ("00", "17", "fd", "14",
> "d3", "2a"). Many searches will be for partial IPs or MACs ("192.168",
> "00:17:fd", etc).
>
> Are either of these methods of indexing the addresses (single token vs
> per-octet token) more or less efficient than the other when indexing large
> numbers of these?
>
> --
> Joe Attardi
> jattardi@gmail.com
> http://thinksincode.blogspot.com/
>