You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Gergely Nagy <fo...@gmail.com> on 2015/02/09 08:53:38 UTC

Indexing and searching a DateTime range

Hi Lucene users,

I am in the beginning of implementing a Lucene application which would
supposedly search through some log files.

One of the requirements is to return results between a time range. Let's
say these are two lines in a series of log files:
2015-02-08 00:02:06.852Z INFO...
...
2015-02-08 18:02:04.012Z INFO...

Now I need to search for these lines and return all the text in-between. I
was using this demo application to build an index:
http://lucene.apache.org/core/4_10_3/demo/src-html/org/apache/lucene/demo/IndexFiles.html

After that my first thought was using a term range query like this:
        TermRangeQuery query = TermRangeQuery.newStringRange("contents",
"2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true, true);

But for some reason this didn't return any results.

Then I was Googling for a while how to solve this problem, but all the
datetime examples I found are searching based on a much simpler field.
Those examples usually use a field like this:
doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));

So I was wondering, how can I index these log files to make a range query
work on them? Any ideas? Maybe my approach is completely wrong. I am still
new to Lucene so any help is appreciated.

Thank you.

Gergely Nagy

Re: Indexing and searching a DateTime range

Posted by Gergely Nagy <fo...@gmail.com>.
Thank you Uwe!

Your reply is very useful and insightful. Your workflow matches my
requirements exactly.

My confusion was coming from the fact that I didn't understand what the
Analyzers are doing. Actually I am still wondering, isn't it possible to
provide an abstraction on Lucene side to make the indexing be part of the
Lucene index processing mechanism?

It seems quite odd that I have to do something which is strongly coupled to
Lucene (building the index), outside Lucene.

In any case, I will take your advice. I hope other people will also find
this info useful.

Best regards,
Gergely Nagy

2015-02-10 18:35 GMT+09:00 Uwe Schindler <uw...@thetaphi.de>:

> Hi,
>
> > OK. I found the Alfresco code on GitHub. So it's open source it seems.
> >
> > And I found the DateTimeAnalyser, so I will just take that code as a
> starting
> > point:
> >
> https://github.com/lsbueno/alfresco/tree/master/root/projects/repository/
> > source/java/org/alfresco/repo/search/impl/lucene/analysis
>
> This won't help you:
> a) its outdated code from very early Lucene versions
> b) it would be slow, because it does not use the numeric features of
> Lucene, so your code would be very slow if you search for date ranges
>
> Basically, I don't really understand your problem:
> If you use Lucene directly you are responsible for processing the text
> before it goes into the index. If you want to create a Lucene Document per
> Line, it is your turn to do this. Lucene has no functionality to split
> documents. You have to process your input and bring it into a format that
> Lucene wants: "Documents" consisting of "Key/Value" pairs. Analyzers are
> only there for processing one specific field and tokenize the input (so the
> index contains words and not the whole field as one term). Analyzers have
> nothing to do with Analysis of the structure of Log lines (because they
> would only work on one field, which does not help for structured queries
> like on date).
>
> So basically your indexing workflow is:
>
> - Open Log file
> - Read log file line by line
> - Create a Lucene IndexDocument instance
> - Extract "interesting" key/value pairs from your log file, e.g. by using
> regular expressions (like Logstash does). Basically this would for example
> "detect" the date, class name from Log4J files, or whatever else
> - Put those key/value pairs as fields (numeric, text,...)  to the Lucene
> IndexDocument: One field for the date, one field for message content, one
> field for classname,... (those fields don't need to be stored, unless you
> want to display only them in search results, see below).
> - In addition, it is wise to add an additional Lucene TextField instance
> (that is also STORED=TRUE, INDEXED=TRUE with good Analyzer) that contains
> the whole line (redundant). By STORING it, you are able to return the whole
> log line in your search results
> - Index the document
> - Process next line
>
> If you don't want to write this code on your own, use Logstash and
> Elasticsearch (or write a separate plugin for Logstash that indexes to
> lucene). But your comment is strange: You say: Elasticsearch and Logstah is
> too slow for many log lines. How should then Lucene be faster?
> Elasticsearch also uses Lucene under the hood. The main problem if its slow
> is in most cases incorrect data types while indexing (like using a text
> field for dates and doing ranges). It is the same like indexing a number in
> a relational database as String and then do "like" queries instead of real
> numeric comparisons - just wrong and slow.
>
> Uwe
>
> > Thank you for everybody for the time to respond.
> >
> > 2015-02-10 9:55 GMT+09:00 Gergely Nagy <fo...@gmail.com>:
> >
> > > Thank you Barry, I really appreciate your time to respond,
> > >
> > > Let me clarify this a little bit more. I think it was not clear.
> > >
> > > I know how to parse dates, this is not the question here. (See my
> > > previous
> > > email: "how can I pipe my converter logic into the indexing process?")
> > >
> > > All of your solutions guys would work fine if I wanted to index
> > > per-document. Which I do NOT want to do. What I would like to do to
> > > index per log line.
> > >
> > > I need to do a full text search, but with the additional requirement
> > > to filter those search hits by DateTime range.
> > >
> > > I hope this makes it clearer. So any suggestions how to do that?
> > >
> > > Sidenote: I saw that Alfresco implemented this analyzer, called
> > > DateTimeAnalyzer, but Alfresco is not open source. So I was wondering
> > > how to implement the same. Actually after wondering for 2 days, I
> > > became convinced that writing an Analyzer should be the way to go. I
> > > will post my solution later if I have a working code.
> > >
> > > 2015-02-10 8:50 GMT+09:00 Barry Coughlan <b....@gmail.com>:
> > >
> > >> Hi Gergely,
> > >>
> > >> Writing an analyzer would work but it is unnecessarily complicated.
> > >> You could just parse the date from the string in your input code and
> > >> index it in the LongField like this:
> > >>
> > >> SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd
> > >> HH:mm:ss.S'Z'"); format.setTimeZone(TimeZone.getTimeZone("UTC"));
> > >> long t = format.parse("2015-02-08 00:02:06.123Z INFO...").getTime();
> > >>
> > >> Barry
> > >>
> > >> On Tue, Feb 10, 2015 at 12:21 AM, Gergely Nagy <fo...@gmail.com>
> > wrote:
> > >>
> > >> > Thank you for taking your time to respond Karthik,
> > >> >
> > >> > Can you show me an example how to convert DateTime to milliseconds?
> > >> > I
> > >> mean
> > >> > how can I pipe my converter logic into the indexing process?
> > >> >
> > >> > I suspect I need to write my own Analyzer/Tokenizer to achieve
> > >> > this. Is this correct?
> > >> >
> > >> > 2015-02-09 22:58 GMT+09:00 KARTHIK SHIVAKUMAR
> > <ns...@gmail.com>:
> > >> >
> > >> > > Hi
> > >> > >
> > >> > > Long time ago,.. I used to store datetime in millisecond .
> > >> > >
> > >> > > TermRangequery used to work in perfect condition....
> > >> > >
> > >> > > Convert all datetime to millisecond and index the same.
> > >> > >
> > >> > > On search condition again convert datetime to millisecond and use
> > >> > > TermRangequery.
> > >> > >
> > >> > > With regards
> > >> > > Karthik
> > >> > > On Feb 9, 2015 1:24 PM, "Gergely Nagy" <fo...@gmail.com> wrote:
> > >> > >
> > >> > > > Hi Lucene users,
> > >> > > >
> > >> > > > I am in the beginning of implementing a Lucene application
> > >> > > > which
> > >> would
> > >> > > > supposedly search through some log files.
> > >> > > >
> > >> > > > One of the requirements is to return results between a time
> range.
> > >> > Let's
> > >> > > > say these are two lines in a series of log files:
> > >> > > > 2015-02-08 00:02:06.852Z INFO...
> > >> > > > ...
> > >> > > > 2015-02-08 18:02:04.012Z INFO...
> > >> > > >
> > >> > > > Now I need to search for these lines and return all the text
> > >> > in-between.
> > >> > > I
> > >> > > > was using this demo application to build an index:
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >> http://lucene.apache.org/core/4_10_3/demo/src-
> > html/org/apache/lucene/
> > >> demo/IndexFiles.html
> > >> > > >
> > >> > > > After that my first thought was using a term range query like
> this:
> > >> > > >         TermRangeQuery query =
> > >> > TermRangeQuery.newStringRange("contents",
> > >> > > > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true,
> > >> > > > true);
> > >> > > >
> > >> > > > But for some reason this didn't return any results.
> > >> > > >
> > >> > > > Then I was Googling for a while how to solve this problem, but
> > >> > > > all
> > >> the
> > >> > > > datetime examples I found are searching based on a much simpler
> > >> field.
> > >> > > > Those examples usually use a field like this:
> > >> > > > doc.add(new LongField("modified", file.lastModified(),
> > >> Field.Store.NO
> > >> > ));
> > >> > > >
> > >> > > > So I was wondering, how can I index these log files to make a
> > >> > > > range
> > >> > query
> > >> > > > work on them? Any ideas? Maybe my approach is completely
> > wrong.
> > >> > > > I am
> > >> > > still
> > >> > > > new to Lucene so any help is appreciated.
> > >> > > >
> > >> > > > Thank you.
> > >> > > >
> > >> > > > Gergely Nagy
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Indexing and searching a DateTime range

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

> OK. I found the Alfresco code on GitHub. So it's open source it seems.
> 
> And I found the DateTimeAnalyser, so I will just take that code as a starting
> point:
> https://github.com/lsbueno/alfresco/tree/master/root/projects/repository/
> source/java/org/alfresco/repo/search/impl/lucene/analysis

This won't help you:
a) its outdated code from very early Lucene versions
b) it would be slow, because it does not use the numeric features of Lucene, so your code would be very slow if you search for date ranges

Basically, I don't really understand your problem:
If you use Lucene directly you are responsible for processing the text before it goes into the index. If you want to create a Lucene Document per Line, it is your turn to do this. Lucene has no functionality to split documents. You have to process your input and bring it into a format that Lucene wants: "Documents" consisting of "Key/Value" pairs. Analyzers are only there for processing one specific field and tokenize the input (so the index contains words and not the whole field as one term). Analyzers have nothing to do with Analysis of the structure of Log lines (because they would only work on one field, which does not help for structured queries like on date).

So basically your indexing workflow is:

- Open Log file
- Read log file line by line
- Create a Lucene IndexDocument instance
- Extract "interesting" key/value pairs from your log file, e.g. by using regular expressions (like Logstash does). Basically this would for example "detect" the date, class name from Log4J files, or whatever else
- Put those key/value pairs as fields (numeric, text,...)  to the Lucene IndexDocument: One field for the date, one field for message content, one field for classname,... (those fields don't need to be stored, unless you want to display only them in search results, see below).
- In addition, it is wise to add an additional Lucene TextField instance (that is also STORED=TRUE, INDEXED=TRUE with good Analyzer) that contains the whole line (redundant). By STORING it, you are able to return the whole log line in your search results
- Index the document
- Process next line

If you don't want to write this code on your own, use Logstash and Elasticsearch (or write a separate plugin for Logstash that indexes to lucene). But your comment is strange: You say: Elasticsearch and Logstah is too slow for many log lines. How should then Lucene be faster? Elasticsearch also uses Lucene under the hood. The main problem if its slow is in most cases incorrect data types while indexing (like using a text field for dates and doing ranges). It is the same like indexing a number in a relational database as String and then do "like" queries instead of real numeric comparisons - just wrong and slow.

Uwe

> Thank you for everybody for the time to respond.
> 
> 2015-02-10 9:55 GMT+09:00 Gergely Nagy <fo...@gmail.com>:
> 
> > Thank you Barry, I really appreciate your time to respond,
> >
> > Let me clarify this a little bit more. I think it was not clear.
> >
> > I know how to parse dates, this is not the question here. (See my
> > previous
> > email: "how can I pipe my converter logic into the indexing process?")
> >
> > All of your solutions guys would work fine if I wanted to index
> > per-document. Which I do NOT want to do. What I would like to do to
> > index per log line.
> >
> > I need to do a full text search, but with the additional requirement
> > to filter those search hits by DateTime range.
> >
> > I hope this makes it clearer. So any suggestions how to do that?
> >
> > Sidenote: I saw that Alfresco implemented this analyzer, called
> > DateTimeAnalyzer, but Alfresco is not open source. So I was wondering
> > how to implement the same. Actually after wondering for 2 days, I
> > became convinced that writing an Analyzer should be the way to go. I
> > will post my solution later if I have a working code.
> >
> > 2015-02-10 8:50 GMT+09:00 Barry Coughlan <b....@gmail.com>:
> >
> >> Hi Gergely,
> >>
> >> Writing an analyzer would work but it is unnecessarily complicated.
> >> You could just parse the date from the string in your input code and
> >> index it in the LongField like this:
> >>
> >> SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd
> >> HH:mm:ss.S'Z'"); format.setTimeZone(TimeZone.getTimeZone("UTC"));
> >> long t = format.parse("2015-02-08 00:02:06.123Z INFO...").getTime();
> >>
> >> Barry
> >>
> >> On Tue, Feb 10, 2015 at 12:21 AM, Gergely Nagy <fo...@gmail.com>
> wrote:
> >>
> >> > Thank you for taking your time to respond Karthik,
> >> >
> >> > Can you show me an example how to convert DateTime to milliseconds?
> >> > I
> >> mean
> >> > how can I pipe my converter logic into the indexing process?
> >> >
> >> > I suspect I need to write my own Analyzer/Tokenizer to achieve
> >> > this. Is this correct?
> >> >
> >> > 2015-02-09 22:58 GMT+09:00 KARTHIK SHIVAKUMAR
> <ns...@gmail.com>:
> >> >
> >> > > Hi
> >> > >
> >> > > Long time ago,.. I used to store datetime in millisecond .
> >> > >
> >> > > TermRangequery used to work in perfect condition....
> >> > >
> >> > > Convert all datetime to millisecond and index the same.
> >> > >
> >> > > On search condition again convert datetime to millisecond and use
> >> > > TermRangequery.
> >> > >
> >> > > With regards
> >> > > Karthik
> >> > > On Feb 9, 2015 1:24 PM, "Gergely Nagy" <fo...@gmail.com> wrote:
> >> > >
> >> > > > Hi Lucene users,
> >> > > >
> >> > > > I am in the beginning of implementing a Lucene application
> >> > > > which
> >> would
> >> > > > supposedly search through some log files.
> >> > > >
> >> > > > One of the requirements is to return results between a time range.
> >> > Let's
> >> > > > say these are two lines in a series of log files:
> >> > > > 2015-02-08 00:02:06.852Z INFO...
> >> > > > ...
> >> > > > 2015-02-08 18:02:04.012Z INFO...
> >> > > >
> >> > > > Now I need to search for these lines and return all the text
> >> > in-between.
> >> > > I
> >> > > > was using this demo application to build an index:
> >> > > >
> >> > > >
> >> > >
> >> >
> >> http://lucene.apache.org/core/4_10_3/demo/src-
> html/org/apache/lucene/
> >> demo/IndexFiles.html
> >> > > >
> >> > > > After that my first thought was using a term range query like this:
> >> > > >         TermRangeQuery query =
> >> > TermRangeQuery.newStringRange("contents",
> >> > > > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true,
> >> > > > true);
> >> > > >
> >> > > > But for some reason this didn't return any results.
> >> > > >
> >> > > > Then I was Googling for a while how to solve this problem, but
> >> > > > all
> >> the
> >> > > > datetime examples I found are searching based on a much simpler
> >> field.
> >> > > > Those examples usually use a field like this:
> >> > > > doc.add(new LongField("modified", file.lastModified(),
> >> Field.Store.NO
> >> > ));
> >> > > >
> >> > > > So I was wondering, how can I index these log files to make a
> >> > > > range
> >> > query
> >> > > > work on them? Any ideas? Maybe my approach is completely
> wrong.
> >> > > > I am
> >> > > still
> >> > > > new to Lucene so any help is appreciated.
> >> > > >
> >> > > > Thank you.
> >> > > >
> >> > > > Gergely Nagy
> >> > > >
> >> > >
> >> >
> >>
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing and searching a DateTime range

Posted by Gergely Nagy <fo...@gmail.com>.
OK. I found the Alfresco code on GitHub. So it's open source it seems.

And I found the DateTimeAnalyser, so I will just take that code as a
starting point:
https://github.com/lsbueno/alfresco/tree/master/root/projects/repository/source/java/org/alfresco/repo/search/impl/lucene/analysis

Thank you for everybody for the time to respond.

2015-02-10 9:55 GMT+09:00 Gergely Nagy <fo...@gmail.com>:

> Thank you Barry, I really appreciate your time to respond,
>
> Let me clarify this a little bit more. I think it was not clear.
>
> I know how to parse dates, this is not the question here. (See my previous
> email: "how can I pipe my converter logic into the indexing process?")
>
> All of your solutions guys would work fine if I wanted to index
> per-document. Which I do NOT want to do. What I would like to do to index
> per log line.
>
> I need to do a full text search, but with the additional requirement to
> filter those search hits by DateTime range.
>
> I hope this makes it clearer. So any suggestions how to do that?
>
> Sidenote: I saw that Alfresco implemented this analyzer, called
> DateTimeAnalyzer, but Alfresco is not open source. So I was wondering how
> to implement the same. Actually after wondering for 2 days, I became
> convinced that writing an Analyzer should be the way to go. I will post my
> solution later if I have a working code.
>
> 2015-02-10 8:50 GMT+09:00 Barry Coughlan <b....@gmail.com>:
>
>> Hi Gergely,
>>
>> Writing an analyzer would work but it is unnecessarily complicated. You
>> could just parse the date from the string in your input code and index it
>> in the LongField like this:
>>
>> SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd
>> HH:mm:ss.S'Z'");
>> format.setTimeZone(TimeZone.getTimeZone("UTC"));
>> long t = format.parse("2015-02-08 00:02:06.123Z INFO...").getTime();
>>
>> Barry
>>
>> On Tue, Feb 10, 2015 at 12:21 AM, Gergely Nagy <fo...@gmail.com> wrote:
>>
>> > Thank you for taking your time to respond Karthik,
>> >
>> > Can you show me an example how to convert DateTime to milliseconds? I
>> mean
>> > how can I pipe my converter logic into the indexing process?
>> >
>> > I suspect I need to write my own Analyzer/Tokenizer to achieve this. Is
>> > this correct?
>> >
>> > 2015-02-09 22:58 GMT+09:00 KARTHIK SHIVAKUMAR <ns...@gmail.com>:
>> >
>> > > Hi
>> > >
>> > > Long time ago,.. I used to store datetime in millisecond .
>> > >
>> > > TermRangequery used to work in perfect condition....
>> > >
>> > > Convert all datetime to millisecond and index the same.
>> > >
>> > > On search condition again convert datetime to millisecond and use
>> > > TermRangequery.
>> > >
>> > > With regards
>> > > Karthik
>> > > On Feb 9, 2015 1:24 PM, "Gergely Nagy" <fo...@gmail.com> wrote:
>> > >
>> > > > Hi Lucene users,
>> > > >
>> > > > I am in the beginning of implementing a Lucene application which
>> would
>> > > > supposedly search through some log files.
>> > > >
>> > > > One of the requirements is to return results between a time range.
>> > Let's
>> > > > say these are two lines in a series of log files:
>> > > > 2015-02-08 00:02:06.852Z INFO...
>> > > > ...
>> > > > 2015-02-08 18:02:04.012Z INFO...
>> > > >
>> > > > Now I need to search for these lines and return all the text
>> > in-between.
>> > > I
>> > > > was using this demo application to build an index:
>> > > >
>> > > >
>> > >
>> >
>> http://lucene.apache.org/core/4_10_3/demo/src-html/org/apache/lucene/demo/IndexFiles.html
>> > > >
>> > > > After that my first thought was using a term range query like this:
>> > > >         TermRangeQuery query =
>> > TermRangeQuery.newStringRange("contents",
>> > > > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true, true);
>> > > >
>> > > > But for some reason this didn't return any results.
>> > > >
>> > > > Then I was Googling for a while how to solve this problem, but all
>> the
>> > > > datetime examples I found are searching based on a much simpler
>> field.
>> > > > Those examples usually use a field like this:
>> > > > doc.add(new LongField("modified", file.lastModified(),
>> Field.Store.NO
>> > ));
>> > > >
>> > > > So I was wondering, how can I index these log files to make a range
>> > query
>> > > > work on them? Any ideas? Maybe my approach is completely wrong. I am
>> > > still
>> > > > new to Lucene so any help is appreciated.
>> > > >
>> > > > Thank you.
>> > > >
>> > > > Gergely Nagy
>> > > >
>> > >
>> >
>>
>
>

Re: Indexing and searching a DateTime range

Posted by Gergely Nagy <fo...@gmail.com>.
Thank you Barry, I really appreciate your time to respond,

Let me clarify this a little bit more. I think it was not clear.

I know how to parse dates, this is not the question here. (See my previous
email: "how can I pipe my converter logic into the indexing process?")

All of your solutions guys would work fine if I wanted to index
per-document. Which I do NOT want to do. What I would like to do to index
per log line.

I need to do a full text search, but with the additional requirement to
filter those search hits by DateTime range.

I hope this makes it clearer. So any suggestions how to do that?

Sidenote: I saw that Alfresco implemented this analyzer, called
DateTimeAnalyzer, but Alfresco is not open source. So I was wondering how
to implement the same. Actually after wondering for 2 days, I became
convinced that writing an Analyzer should be the way to go. I will post my
solution later if I have a working code.

2015-02-10 8:50 GMT+09:00 Barry Coughlan <b....@gmail.com>:

> Hi Gergely,
>
> Writing an analyzer would work but it is unnecessarily complicated. You
> could just parse the date from the string in your input code and index it
> in the LongField like this:
>
> SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.S'Z'");
> format.setTimeZone(TimeZone.getTimeZone("UTC"));
> long t = format.parse("2015-02-08 00:02:06.123Z INFO...").getTime();
>
> Barry
>
> On Tue, Feb 10, 2015 at 12:21 AM, Gergely Nagy <fo...@gmail.com> wrote:
>
> > Thank you for taking your time to respond Karthik,
> >
> > Can you show me an example how to convert DateTime to milliseconds? I
> mean
> > how can I pipe my converter logic into the indexing process?
> >
> > I suspect I need to write my own Analyzer/Tokenizer to achieve this. Is
> > this correct?
> >
> > 2015-02-09 22:58 GMT+09:00 KARTHIK SHIVAKUMAR <ns...@gmail.com>:
> >
> > > Hi
> > >
> > > Long time ago,.. I used to store datetime in millisecond .
> > >
> > > TermRangequery used to work in perfect condition....
> > >
> > > Convert all datetime to millisecond and index the same.
> > >
> > > On search condition again convert datetime to millisecond and use
> > > TermRangequery.
> > >
> > > With regards
> > > Karthik
> > > On Feb 9, 2015 1:24 PM, "Gergely Nagy" <fo...@gmail.com> wrote:
> > >
> > > > Hi Lucene users,
> > > >
> > > > I am in the beginning of implementing a Lucene application which
> would
> > > > supposedly search through some log files.
> > > >
> > > > One of the requirements is to return results between a time range.
> > Let's
> > > > say these are two lines in a series of log files:
> > > > 2015-02-08 00:02:06.852Z INFO...
> > > > ...
> > > > 2015-02-08 18:02:04.012Z INFO...
> > > >
> > > > Now I need to search for these lines and return all the text
> > in-between.
> > > I
> > > > was using this demo application to build an index:
> > > >
> > > >
> > >
> >
> http://lucene.apache.org/core/4_10_3/demo/src-html/org/apache/lucene/demo/IndexFiles.html
> > > >
> > > > After that my first thought was using a term range query like this:
> > > >         TermRangeQuery query =
> > TermRangeQuery.newStringRange("contents",
> > > > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true, true);
> > > >
> > > > But for some reason this didn't return any results.
> > > >
> > > > Then I was Googling for a while how to solve this problem, but all
> the
> > > > datetime examples I found are searching based on a much simpler
> field.
> > > > Those examples usually use a field like this:
> > > > doc.add(new LongField("modified", file.lastModified(),
> Field.Store.NO
> > ));
> > > >
> > > > So I was wondering, how can I index these log files to make a range
> > query
> > > > work on them? Any ideas? Maybe my approach is completely wrong. I am
> > > still
> > > > new to Lucene so any help is appreciated.
> > > >
> > > > Thank you.
> > > >
> > > > Gergely Nagy
> > > >
> > >
> >
>

Re: Indexing and searching a DateTime range

Posted by Barry Coughlan <b....@gmail.com>.
Hi Gergely,

Writing an analyzer would work but it is unnecessarily complicated. You
could just parse the date from the string in your input code and index it
in the LongField like this:

SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.S'Z'");
format.setTimeZone(TimeZone.getTimeZone("UTC"));
long t = format.parse("2015-02-08 00:02:06.123Z INFO...").getTime();

Barry

On Tue, Feb 10, 2015 at 12:21 AM, Gergely Nagy <fo...@gmail.com> wrote:

> Thank you for taking your time to respond Karthik,
>
> Can you show me an example how to convert DateTime to milliseconds? I mean
> how can I pipe my converter logic into the indexing process?
>
> I suspect I need to write my own Analyzer/Tokenizer to achieve this. Is
> this correct?
>
> 2015-02-09 22:58 GMT+09:00 KARTHIK SHIVAKUMAR <ns...@gmail.com>:
>
> > Hi
> >
> > Long time ago,.. I used to store datetime in millisecond .
> >
> > TermRangequery used to work in perfect condition....
> >
> > Convert all datetime to millisecond and index the same.
> >
> > On search condition again convert datetime to millisecond and use
> > TermRangequery.
> >
> > With regards
> > Karthik
> > On Feb 9, 2015 1:24 PM, "Gergely Nagy" <fo...@gmail.com> wrote:
> >
> > > Hi Lucene users,
> > >
> > > I am in the beginning of implementing a Lucene application which would
> > > supposedly search through some log files.
> > >
> > > One of the requirements is to return results between a time range.
> Let's
> > > say these are two lines in a series of log files:
> > > 2015-02-08 00:02:06.852Z INFO...
> > > ...
> > > 2015-02-08 18:02:04.012Z INFO...
> > >
> > > Now I need to search for these lines and return all the text
> in-between.
> > I
> > > was using this demo application to build an index:
> > >
> > >
> >
> http://lucene.apache.org/core/4_10_3/demo/src-html/org/apache/lucene/demo/IndexFiles.html
> > >
> > > After that my first thought was using a term range query like this:
> > >         TermRangeQuery query =
> TermRangeQuery.newStringRange("contents",
> > > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true, true);
> > >
> > > But for some reason this didn't return any results.
> > >
> > > Then I was Googling for a while how to solve this problem, but all the
> > > datetime examples I found are searching based on a much simpler field.
> > > Those examples usually use a field like this:
> > > doc.add(new LongField("modified", file.lastModified(), Field.Store.NO
> ));
> > >
> > > So I was wondering, how can I index these log files to make a range
> query
> > > work on them? Any ideas? Maybe my approach is completely wrong. I am
> > still
> > > new to Lucene so any help is appreciated.
> > >
> > > Thank you.
> > >
> > > Gergely Nagy
> > >
> >
>

Re: Indexing and searching a DateTime range

Posted by Gergely Nagy <fo...@gmail.com>.
Thank you for taking your time to respond Karthik,

Can you show me an example how to convert DateTime to milliseconds? I mean
how can I pipe my converter logic into the indexing process?

I suspect I need to write my own Analyzer/Tokenizer to achieve this. Is
this correct?

2015-02-09 22:58 GMT+09:00 KARTHIK SHIVAKUMAR <ns...@gmail.com>:

> Hi
>
> Long time ago,.. I used to store datetime in millisecond .
>
> TermRangequery used to work in perfect condition....
>
> Convert all datetime to millisecond and index the same.
>
> On search condition again convert datetime to millisecond and use
> TermRangequery.
>
> With regards
> Karthik
> On Feb 9, 2015 1:24 PM, "Gergely Nagy" <fo...@gmail.com> wrote:
>
> > Hi Lucene users,
> >
> > I am in the beginning of implementing a Lucene application which would
> > supposedly search through some log files.
> >
> > One of the requirements is to return results between a time range. Let's
> > say these are two lines in a series of log files:
> > 2015-02-08 00:02:06.852Z INFO...
> > ...
> > 2015-02-08 18:02:04.012Z INFO...
> >
> > Now I need to search for these lines and return all the text in-between.
> I
> > was using this demo application to build an index:
> >
> >
> http://lucene.apache.org/core/4_10_3/demo/src-html/org/apache/lucene/demo/IndexFiles.html
> >
> > After that my first thought was using a term range query like this:
> >         TermRangeQuery query = TermRangeQuery.newStringRange("contents",
> > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true, true);
> >
> > But for some reason this didn't return any results.
> >
> > Then I was Googling for a while how to solve this problem, but all the
> > datetime examples I found are searching based on a much simpler field.
> > Those examples usually use a field like this:
> > doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
> >
> > So I was wondering, how can I index these log files to make a range query
> > work on them? Any ideas? Maybe my approach is completely wrong. I am
> still
> > new to Lucene so any help is appreciated.
> >
> > Thank you.
> >
> > Gergely Nagy
> >
>

Re: Indexing and searching a DateTime range

Posted by KARTHIK SHIVAKUMAR <ns...@gmail.com>.
Hi

Long time ago,.. I used to store datetime in millisecond .

TermRangequery used to work in perfect condition....

Convert all datetime to millisecond and index the same.

On search condition again convert datetime to millisecond and use
TermRangequery.

With regards
Karthik
On Feb 9, 2015 1:24 PM, "Gergely Nagy" <fo...@gmail.com> wrote:

> Hi Lucene users,
>
> I am in the beginning of implementing a Lucene application which would
> supposedly search through some log files.
>
> One of the requirements is to return results between a time range. Let's
> say these are two lines in a series of log files:
> 2015-02-08 00:02:06.852Z INFO...
> ...
> 2015-02-08 18:02:04.012Z INFO...
>
> Now I need to search for these lines and return all the text in-between. I
> was using this demo application to build an index:
>
> http://lucene.apache.org/core/4_10_3/demo/src-html/org/apache/lucene/demo/IndexFiles.html
>
> After that my first thought was using a term range query like this:
>         TermRangeQuery query = TermRangeQuery.newStringRange("contents",
> "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true, true);
>
> But for some reason this didn't return any results.
>
> Then I was Googling for a while how to solve this problem, but all the
> datetime examples I found are searching based on a much simpler field.
> Those examples usually use a field like this:
> doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
>
> So I was wondering, how can I index these log files to make a range query
> work on them? Any ideas? Maybe my approach is completely wrong. I am still
> new to Lucene so any help is appreciated.
>
> Thank you.
>
> Gergely Nagy
>

Re: Indexing and searching a DateTime range

Posted by Gergely Nagy <fo...@gmail.com>.
Thank you for the great answer Uwe!

Sadly my department rejected the above combination of using Logstash +
Elasticsearch. According to their experience, elastic search works fine on
about 3 days of log data, but slows terribly down providing the magnitude
of 3 months of data or so.

But I will take a look at Logstash anyway. After skimming through Logstash
documentation I can see that there are so called Logstash "outputs":
http://logstash.net/docs/1.4.2/tutorials/getting-started-with-logstash

What do you think, is it possible to use Logstash as a preprocessor which
outputs the filtered logs and feeds them into my Lucene app?

Or if that's not a good idea, can you elaborate on how can I do this
preprocessing you are referring to? Do you mean implementing an Analyzer
like these?
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-summary.html

Thank you,
Gergely Nagy

2015-02-09 17:10 GMT+09:00 Uwe Schindler <uw...@thetaphi.de>:

> Hi,
>
> > I am in the beginning of implementing a Lucene application which would
> > supposedly search through some log files.
> >
> > One of the requirements is to return results between a time range. Let's
> say
> > these are two lines in a series of log files:
> > 2015-02-08 00:02:06.852Z INFO...
> > ...
> > 2015-02-08 18:02:04.012Z INFO...
> >
> > Now I need to search for these lines and return all the text in-between.
> I was
> > using this demo application to build an index:
> > http://lucene.apache.org/core/4_10_3/demo/src-
> > html/org/apache/lucene/demo/IndexFiles.html
> >
> > After that my first thought was using a term range query like this:
> >         TermRangeQuery query =
> > TermRangeQuery.newStringRange("contents",
> > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true, true);
> >
> > But for some reason this didn't return any results.
>
> Lucene tokenizes the text, so you can search for terms ("words"). Those
> dates are splitted into several terms. In general, this is not the way to
> search on numeric / date range:
> - it is horribly slow, because there are many terms in that "content"
> field.
>
> > Then I was Googling for a while how to solve this problem, but all the
> > datetime examples I found are searching based on a much simpler field.
> > Those examples usually use a field like this:
> > doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
>
> That is the way to do it. Log files are "structured", so you need to do
> preprocessing. You have to put the different information into different
> fields (like the "modified" field, as mentioned in your example). You can
> still fill the "contents" field as you did above with all information to do
> plain fulltext search (like finding a log line based on some message
> contents), but in addition, you use other fields for more specific searches
> like ranges. In Lucene you generally fill several fields with the redundant
> information (like dates in fulltext field and some extra timestamp field).
>
> The information you return to the user can be put into a "stored" only
> field. This one is returned with search results.
>
> > So I was wondering, how can I index these log files to make a range query
> > work on them? Any ideas? Maybe my approach is completely wrong. I am
> > still new to Lucene so any help is appreciated.
>
> The first aproach is wrong, the second approach is right. You just have to
> make your field definitions correct.
>
> An alternative would be to use Logstash in combination with Elasticsearch,
> which is based on Lucene. This has everything you want to do already
> implemented for log files.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Indexing and searching a DateTime range

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

> I am in the beginning of implementing a Lucene application which would
> supposedly search through some log files.
> 
> One of the requirements is to return results between a time range. Let's say
> these are two lines in a series of log files:
> 2015-02-08 00:02:06.852Z INFO...
> ...
> 2015-02-08 18:02:04.012Z INFO...
> 
> Now I need to search for these lines and return all the text in-between. I was
> using this demo application to build an index:
> http://lucene.apache.org/core/4_10_3/demo/src-
> html/org/apache/lucene/demo/IndexFiles.html
> 
> After that my first thought was using a term range query like this:
>         TermRangeQuery query =
> TermRangeQuery.newStringRange("contents",
> "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true, true);
> 
> But for some reason this didn't return any results.

Lucene tokenizes the text, so you can search for terms ("words"). Those dates are splitted into several terms. In general, this is not the way to search on numeric / date range:
- it is horribly slow, because there are many terms in that "content" field.

> Then I was Googling for a while how to solve this problem, but all the
> datetime examples I found are searching based on a much simpler field.
> Those examples usually use a field like this:
> doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));

That is the way to do it. Log files are "structured", so you need to do preprocessing. You have to put the different information into different fields (like the "modified" field, as mentioned in your example). You can still fill the "contents" field as you did above with all information to do plain fulltext search (like finding a log line based on some message contents), but in addition, you use other fields for more specific searches like ranges. In Lucene you generally fill several fields with the redundant information (like dates in fulltext field and some extra timestamp field).

The information you return to the user can be put into a "stored" only field. This one is returned with search results.

> So I was wondering, how can I index these log files to make a range query
> work on them? Any ideas? Maybe my approach is completely wrong. I am
> still new to Lucene so any help is appreciated.

The first aproach is wrong, the second approach is right. You just have to make your field definitions correct.

An alternative would be to use Logstash in combination with Elasticsearch, which is based on Lucene. This has everything you want to do already implemented for log files.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org