You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Altimatic <ch...@gmail.com> on 2010/01/15 14:35:03 UTC

Finding frequency of regex query match in a field

Hi All, 

I have an application that has to count the frequency that a specific
regular expression is matched on a particular field for each document in an
indexed directory.

For example. 

Lets say I have 2 documents in the directory and each document has 3 fields,
"table", "column" and "data".

Example Doc(s):
//***************************************************************
Document doc1 = new Document();
doc1.add(new Field("table", "EMPLOYEE_US", Field.Store.NO,
Field.Index.ANALYZED);
doc1.add(new Field("column", "F_NAME", Field.Store.NO,
Field.Index.ANALYZED);
doc1.add(new Field("data", "Chris Hank Tony Cody Tom Tina Crystal",
Field.Store.NO, Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS);

Document doc2 = new Document();
doc2.add(new Field("table", "EMPLOYEE_CA", Field.Store.NO,
Field.Index.ANALYZED);
doc2.add(new Field("column", "F_NAME", Field.Store.NO,
Field.Index.ANALYZED);
doc2.add(new Field("data", "Bob Billy Tom Toby Charles Krista Madonna",
Field.Store.NO, Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS);

//I know I can  create a query to search for a regular expression and that
will return each
//document that contains a match.

IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),
true, 
                                          
IndexWriter.MaxFieldLength.LIMITED);
writer.addDocument(doc);
writer.optimize();
writer.close();
searcher = new IndexSearcher(directory);

RegexQuery query = new RegexQuery( newTerm("data", "^T.*)); 
ScoreDoc[] hits = searcher.search(query, null,
maxNumOfHits).scoreDocs;//grab the score docs and go through them to find
the documents that contain a match

//*****************************************************


The code above will tell me that both doc1 and doc2 contain a match for the
constructed query. 

However I need to know how many times the regular expression was matched in
each document. ie.

doc1 = 3
doc2 = 2

I hope I am being clear...and thanks in advance.


Cheers

-- 
View this message in context: http://old.nabble.com/Finding-frequency-of-regex-query-match-in-a-field-tp27175303p27175303.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Finding frequency of regex query match in a field

Posted by Altimatic <ch...@gmail.com>.
I think so. I'll try to write up a quick demo app to see if it will work for
what I require.

Thanks for the prompt reply.


Simon Willnauer wrote:
> 
> One way to do it is to use the RegexTermEnum and iterate through your
> terms manually.
> 
> like the following pseudo code:
> te = RegexTermEnum(reader, Term("^T.*"), regexpCapabilities)
> while( te.next() ):
>   t = te.term()
>   td = reader.termDocs(t)
>   while(td.next()):
>     freqOfTermInCurrentDoc = td.freq()
>     doc = td.doc()
>     #...do something with it
> 
> does that make sense to you?
> 
> simon
> 
> 
> On Fri, Jan 15, 2010 at 2:35 PM, Altimatic <ch...@gmail.com>
> wrote:
>>
>> Hi All,
>>
>> I have an application that has to count the frequency that a specific
>> regular expression is matched on a particular field for each document in
>> an
>> indexed directory.
>>
>> For example.
>>
>> Lets say I have 2 documents in the directory and each document has 3
>> fields,
>> "table", "column" and "data".
>>
>> Example Doc(s):
>> //***************************************************************
>> Document doc1 = new Document();
>> doc1.add(new Field("table", "EMPLOYEE_US", Field.Store.NO,
>> Field.Index.ANALYZED);
>> doc1.add(new Field("column", "F_NAME", Field.Store.NO,
>> Field.Index.ANALYZED);
>> doc1.add(new Field("data", "Chris Hank Tony Cody Tom Tina Crystal",
>> Field.Store.NO, Field.Index.ANALYZED,
>> Field.TermVector.WITH_POSITIONS_OFFSETS);
>>
>> Document doc2 = new Document();
>> doc2.add(new Field("table", "EMPLOYEE_CA", Field.Store.NO,
>> Field.Index.ANALYZED);
>> doc2.add(new Field("column", "F_NAME", Field.Store.NO,
>> Field.Index.ANALYZED);
>> doc2.add(new Field("data", "Bob Billy Tom Toby Charles Krista Madonna",
>> Field.Store.NO, Field.Index.ANALYZED,
>> Field.TermVector.WITH_POSITIONS_OFFSETS);
>>
>> //I know I can  create a query to search for a regular expression and
>> that
>> will return each
>> //document that contains a match.
>>
>> IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),
>> true,
>>
>> IndexWriter.MaxFieldLength.LIMITED);
>> writer.addDocument(doc);
>> writer.optimize();
>> writer.close();
>> searcher = new IndexSearcher(directory);
>>
>> RegexQuery query = new RegexQuery( newTerm("data", "^T.*));
>> ScoreDoc[] hits = searcher.search(query, null,
>> maxNumOfHits).scoreDocs;//grab the score docs and go through them to find
>> the documents that contain a match
>>
>> //*****************************************************
>>
>>
>> The code above will tell me that both doc1 and doc2 contain a match for
>> the
>> constructed query.
>>
>> However I need to know how many times the regular expression was matched
>> in
>> each document. ie.
>>
>> doc1 = 3
>> doc2 = 2
>>
>> I hope I am being clear...and thanks in advance.
>>
>>
>> Cheers
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Finding-frequency-of-regex-query-match-in-a-field-tp27175303p27175303.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Re%3A-Finding-frequency-of-regex-query-match-in-a-field-tp27177425p27178763.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Finding frequency of regex query match in a field

Posted by Simon Willnauer <si...@googlemail.com>.
One way to do it is to use the RegexTermEnum and iterate through your
terms manually.

like the following pseudo code:
te = RegexTermEnum(reader, Term("^T.*"), regexpCapabilities)
while( te.next() ):
  t = te.term()
  td = reader.termDocs(t)
  while(td.next()):
    freqOfTermInCurrentDoc = td.freq()
    doc = td.doc()
    #...do something with it

does that make sense to you?

simon


On Fri, Jan 15, 2010 at 2:35 PM, Altimatic <ch...@gmail.com> wrote:
>
> Hi All,
>
> I have an application that has to count the frequency that a specific
> regular expression is matched on a particular field for each document in an
> indexed directory.
>
> For example.
>
> Lets say I have 2 documents in the directory and each document has 3 fields,
> "table", "column" and "data".
>
> Example Doc(s):
> //***************************************************************
> Document doc1 = new Document();
> doc1.add(new Field("table", "EMPLOYEE_US", Field.Store.NO,
> Field.Index.ANALYZED);
> doc1.add(new Field("column", "F_NAME", Field.Store.NO,
> Field.Index.ANALYZED);
> doc1.add(new Field("data", "Chris Hank Tony Cody Tom Tina Crystal",
> Field.Store.NO, Field.Index.ANALYZED,
> Field.TermVector.WITH_POSITIONS_OFFSETS);
>
> Document doc2 = new Document();
> doc2.add(new Field("table", "EMPLOYEE_CA", Field.Store.NO,
> Field.Index.ANALYZED);
> doc2.add(new Field("column", "F_NAME", Field.Store.NO,
> Field.Index.ANALYZED);
> doc2.add(new Field("data", "Bob Billy Tom Toby Charles Krista Madonna",
> Field.Store.NO, Field.Index.ANALYZED,
> Field.TermVector.WITH_POSITIONS_OFFSETS);
>
> //I know I can  create a query to search for a regular expression and that
> will return each
> //document that contains a match.
>
> IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),
> true,
>
> IndexWriter.MaxFieldLength.LIMITED);
> writer.addDocument(doc);
> writer.optimize();
> writer.close();
> searcher = new IndexSearcher(directory);
>
> RegexQuery query = new RegexQuery( newTerm("data", "^T.*));
> ScoreDoc[] hits = searcher.search(query, null,
> maxNumOfHits).scoreDocs;//grab the score docs and go through them to find
> the documents that contain a match
>
> //*****************************************************
>
>
> The code above will tell me that both doc1 and doc2 contain a match for the
> constructed query.
>
> However I need to know how many times the regular expression was matched in
> each document. ie.
>
> doc1 = 3
> doc2 = 2
>
> I hope I am being clear...and thanks in advance.
>
>
> Cheers
>
> --
> View this message in context: http://old.nabble.com/Finding-frequency-of-regex-query-match-in-a-field-tp27175303p27175303.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Finding frequency of regex query match in a field

Posted by Altimatic <ch...@gmail.com>.
I forgot to mention that I am using Lucene 3.0.0. 
-- 
View this message in context: http://old.nabble.com/Finding-frequency-of-regex-query-match-in-a-field-tp27175303p27175915.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org