You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Dave Golombek <da...@blackducksoftware.com> on 2007/10/23 16:52:35 UTC

Making Highlighter.mergeContiguousFragments() public

I was wondering if people thought that making
Highlighter.mergeContiguousFragments() public (and non-final) would be
acceptable. In my application, I want to strip all fragments with score == 0
before merging the fragments (to get the minimal matching section, but still
in order), and the easiest way to do so would be to override that method.
Not a big deal, but I thought other people might find it useful. Making the
method public static would also achieve the same result, allowing me to call
the function separately.

Thanks,
Dave Golombek
Black Duck Software, Inc.
http://www.blackducksoftware.com 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Making Highlighter.mergeContiguousFragments() public

Posted by Mark Miller <ma...@gmail.com>.

Ahhh....the reason the second Shapes is not highlighted is that the 
Highlighter highlights based on what caused the hit in Lucene...and 
Lucene does not look for every shape within 4 paragraphs of 
distribution...after it finds one such occurrence it says "sweet, a 
match" and moves on...it does not look for another such match. If there 
where another occurrence of distribution on the other side of the last 
Shapes, then it would cause its own match. This may be because of how I 
have implemented my WithinSpanQuery or it may be how Span queries in 
lucene work in general. I will investigate and let you know. I am 
thinking that Lucene just quits after finding the span its looking 
for...it says ok I found distribution, now find shape within 4 
paragraphs after distribution...okay found it...I don't think it then 
says ok...is there another shape after that shape within 4...and then 
another and another?

Ill look into it further. Certainly though , other than this possible 
oddity...everything is working.

- Mark

Dave Golombek wrote:
> I was wondering if people thought that making
> Highlighter.mergeContiguousFragments() public (and non-final) would be
> acceptable. In my application, I want to strip all fragments with score == 0
> before merging the fragments (to get the minimal matching section, but still
> in order), and the easiest way to do so would be to override that method.
> Not a big deal, but I thought other people might find it useful. Making the
> method public static would also achieve the same result, allowing me to call
> the function separately.
>
> Thanks,
> Dave Golombek
> Black Duck Software, Inc.
> http://www.blackducksoftware.com 
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Hits.score mystery

Posted by Tom Conlon <to...@2ls.com>.

Hi Grant,

> but you should have a look at Searcher.explain() 

I was half-expecting this answer. :(

The query is very basic and the scoring seems completely arbitrary.
Documents with the same number of ocurrences and (seemingly)
distribution are being given widely different scores. 

> Chris Hostetter
> NOTE: the score returned by Hits is not a "percentage" ... 
> a score of 0.9 from 1 query isn't better then a score of 0.1 from
another query.

Thanks for re-emphasizing these points (I was aware of them in any
event).

Tom

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: 31 October 2007 19:17
To: java-user@lucene.apache.org
Subject: Re: Hits.score mystery

Not sure what UI you are referring to, but you should have a look at
Searcher.explain() for giving you information about why a particular
document scored the way it does

-Grant

On Oct 31, 2007, at 2:14 PM, Tom Conlon wrote:

> Hi All,
>
> Query:	systems AND 2000
> Results: 	558 total matching documents
>
> I'm returning the document plus hits.score(i) * 100 but when the 
> relevance is examined in the User interface it doesn't seem to be 
> working.
>
>
> E.g.  'rough' feedback in terms of occurences
>
> 61.txt	18.356403	100%	(13 occurences)
> 119.txt	17.865013	 97% 	(13 occurences)
> ...
> 45.txt	8.600986	 47%  (18 occurences)
> ...
> 8.rtf		2.7724645	 15%  (10 occurences)
>
> Is there something else I need to do or am missing?
>
> Thanks,
> Tom
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!
http://www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hits.score mystery

Posted by Grant Ingersoll <gs...@apache.org>.

Not sure what UI you are referring to, but you should have a look at  
Searcher.explain() for giving you information about why a particular  
document scored the way it does

-Grant

On Oct 31, 2007, at 2:14 PM, Tom Conlon wrote:

> Hi All,
>
> Query:	systems AND 2000
> Results: 	558 total matching documents
>
> I'm returning the document plus hits.score(i) * 100 but when the
> relevance is examined in the User interface it doesn't seem to be
> working.
>
>
> E.g.  'rough' feedback in terms of occurences
>
> 61.txt	18.356403	100%	(13 occurences)
> 119.txt	17.865013	 97% 	(13 occurences)
> ...
> 45.txt	8.600986	 47%  (18 occurences)
> ...
> 8.rtf		2.7724645	 15%  (10 occurences)
>
> Is there something else I need to do or am missing?
>
> Thanks,
> Tom
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hits.score mystery

Posted by Erick Erickson <er...@gmail.com>.

Well, you might have to pre-process your strings before you
give them to an analyzer. Or roll your own analyzer.

What you're asking for, in effect, is an analyzer "that does
exactly what I want it to, nothing more and nothing less". But
the problem is that there is nothing general about what you want.
That is, leaving in # and ++ is completely arbitrary so I don't
think there are any canned analyzers out there that'll do what you
want.

But it's pretty simple to write a regular expression that'll remove
(actually, replace with spaces), anything that you want to. So I'd
think about that approach and then feed your lower-case/whitespace
analyzer the results.

Best
Erick

On 11/1/07, Tom Conlon <to...@2ls.com> wrote:
>
> The reason seems to be that I found I needed to implement an analyser that
> lowercases terms as well as *not* ignoring trailing characters such as #, +.
> (i.e. I needed to match C# and C++)
>
> public final class LowercaseWhitespaceAnalyzer extends Analyzer
> {
>   public TokenStream tokenStream(String fieldName, Reader reader) {
>     return new LowercaseWhitespaceTokenizer(reader);
>   }
> }
>
> Problem now exists that "system," etc is not matched against "system".
>
> Can anyone point to an example of a combination of analyser/tokeniser (or
> other method) that gets around this please?
>
> Thanks,
> Tom
>
>
> -----Original Message-----
> From: Tom Conlon [mailto:tomc@2ls.com]
> Sent: 01 November 2007 09:18
> To: java-user@lucene.apache.org
> Subject: RE: Hits.score mystery
>
> Thanks Daniel,
>
> I'm using Searcher.explain() & luke to try to understand the reasons for
> the score.
>
> -----Original Message-----
> From: Daniel Naber [mailto:lucenelist2007@danielnaber.de]
> Sent: 01 November 2007 08:19
> To: java-user@lucene.apache.org
> Subject: Re: Hits.score mystery
>
> On Wednesday 31 October 2007 19:14, Tom Conlon wrote:
>
> > 119.txt17.865013 97% (13 occurences) 45.txt8.600986 47%
> > (18 occurences)
>
> 45.txt might be a document with more therms so that its score is lower
> although it contains more matches.
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Hits.score mystery

Posted by Mark Miller <ma...@gmail.com>.

One of many options is to copy the StandardAnalyzer but change it so 
that + and # are considered letters.

Just add + and # to the LETTER definition in the JavaCC file if you are 
using a release, or the JFlex file if you are working off Trunk (your 
prob using a release but the new JFlex analyzer is mega faster).

Tom Conlon wrote:
> The reason seems to be that I found I needed to implement an analyser that lowercases terms as well as *not* ignoring trailing characters such as #, +. 
> (i.e. I needed to match C# and C++)
>
> public final class LowercaseWhitespaceAnalyzer extends Analyzer 
> {
>   public TokenStream tokenStream(String fieldName, Reader reader) {
>     return new LowercaseWhitespaceTokenizer(reader);
>   }
> } 
>
> Problem now exists that "system," etc is not matched against "system".
>
> Can anyone point to an example of a combination of analyser/tokeniser (or other method) that gets around this please?
>
> Thanks,
> Tom
>
>
> -----Original Message-----
> From: Tom Conlon [mailto:tomc@2ls.com] 
> Sent: 01 November 2007 09:18
> To: java-user@lucene.apache.org
> Subject: RE: Hits.score mystery
>
> Thanks Daniel,
>
> I'm using Searcher.explain() & luke to try to understand the reasons for the score.
>
> -----Original Message-----
> From: Daniel Naber [mailto:lucenelist2007@danielnaber.de]
> Sent: 01 November 2007 08:19
> To: java-user@lucene.apache.org
> Subject: Re: Hits.score mystery
>
> On Wednesday 31 October 2007 19:14, Tom Conlon wrote:
>
>   
>> 119.txt 17.865013        97%    (13 occurences) 45.txt  8.600986 47%  
>> (18 occurences)
>>     
>
> 45.txt might be a document with more therms so that its score is lower although it contains more matches.
>
> Regards
>  Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Hits.score mystery

Posted by Tom Conlon <to...@2ls.com>.

The reason seems to be that I found I needed to implement an analyser that lowercases terms as well as *not* ignoring trailing characters such as #, +. 
(i.e. I needed to match C# and C++)

public final class LowercaseWhitespaceAnalyzer extends Analyzer 
{
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new LowercaseWhitespaceTokenizer(reader);
  }
} 

Problem now exists that "system," etc is not matched against "system".

Can anyone point to an example of a combination of analyser/tokeniser (or other method) that gets around this please?

Thanks,
Tom

-----Original Message-----
From: Tom Conlon [mailto:tomc@2ls.com] 
Sent: 01 November 2007 09:18
To: java-user@lucene.apache.org
Subject: RE: Hits.score mystery

Thanks Daniel,

I'm using Searcher.explain() & luke to try to understand the reasons for the score.

-----Original Message-----
From: Daniel Naber [mailto:lucenelist2007@danielnaber.de]
Sent: 01 November 2007 08:19
To: java-user@lucene.apache.org
Subject: Re: Hits.score mystery

On Wednesday 31 October 2007 19:14, Tom Conlon wrote:

> 119.txt 17.865013        97%    (13 occurences) 45.txt  8.600986 47%  
> (18 occurences)

45.txt might be a document with more therms so that its score is lower although it contains more matches.

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Hits.score mystery

Posted by Tom Conlon <to...@2ls.com>.

Thanks Daniel,

I'm using Searcher.explain() & luke to try to understand the reasons for the score.

-----Original Message-----
From: Daniel Naber [mailto:lucenelist2007@danielnaber.de] 
Sent: 01 November 2007 08:19
To: java-user@lucene.apache.org
Subject: Re: Hits.score mystery

On Wednesday 31 October 2007 19:14, Tom Conlon wrote:

> 119.txt 17.865013        97%    (13 occurences) 45.txt  8.600986         
> 47%  (18 occurences)

45.txt might be a document with more therms so that its score is lower although it contains more matches.

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hits.score mystery

Posted by Daniel Naber <lu...@danielnaber.de>.

On Wednesday 31 October 2007 19:14, Tom Conlon wrote:

> 119.txt 17.865013        97%    (13 occurences)
> 45.txt  8.600986         47%  (18 occurences)

45.txt might be a document with more therms so that its score is lower 
although it contains more matches.

Regards
 Daniel

-- 
http://www.danielnaber.de

Re: Hits.score mystery

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm returning the document plus hits.score(i) * 100 but when the

NOTE: the score returned by Hits is not a "percentage" ... it is an 
arbitrary number less then 1.  it might be the "raw score" of the document 
or it might be the result of dividing the "raw score" by the "raw score" 
of the highest scoring document, if hte raw score of the highest scoring 
document is greater then 1

(kinda silly huh?)

basically it's just a way to ensure you always have a number less then 1 
-- but a score of 0.9 from one query isn't neccessarily better then a 
score of 0.1 from another query.

PS...

http://people.apache.org/~hossman/#threadhijack
When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Hits.score mystery

Posted by Tom Conlon <to...@2ls.com>.

Hi All,

Query:	systems AND 2000
Results: 	558 total matching documents

I'm returning the document plus hits.score(i) * 100 but when the
relevance is examined in the User interface it doesn't seem to be
working. 


E.g.  'rough' feedback in terms of occurences

61.txt	18.356403	100%	(13 occurences)
119.txt	17.865013	 97% 	(13 occurences)
...
45.txt	8.600986	 47%  (18 occurences) 
...
8.rtf		2.7724645	 15%  (10 occurences)

Is there something else I need to do or am missing?

Thanks,
Tom



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best way to count tokens

Posted by Cool Coder <te...@yahoo.com>.

This works and I can reuse token streams. But why TokenStream.reset() does not work which was in my earlier case. Is this a marker method in TokenStream without implementation and CachingTokenFilter implements the method.
   
  - BR


Mark Miller <ma...@gmail.com> wrote:
  reset is optional. StandardAnalyzer does not implement it. Check out 
CachingTokenFilter and wrap StandardAnalzyer in it.

Cool Coder wrote:
> Currently I have extended StandardAnalyzer and counting tokens in the following way. But the index is not getting created , though I call tokenStream.reset(). I am not sure whether reset() on token stream works or not??? I am debugging now
> 
> public TokenStream tokenStream(String fieldName, Reader reader) {
> TokenStream result = super.tokenStream(fieldName,new HTMLStripReader(reader));
> //To count tokens and put in a Map
> analyzeTokens(result);
> try {
> result.reset();
> } catch (IOException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> return result;
> }
> 
> public void analyzeTokens(TokenStream result)
> {
> try {
> Token token = result.next();
> while(token != null)
> {
> String tokenStr = token.termText();
> if(TokenHolder.tokenMap.get(tokenStr) == null)
> {
> TokenHolder.tokenMap.put(tokenStr,1);
> }
> else
> {
> TokenHolder.tokenMap.put(tokenStr,Integer.parseInt(TokenHolder.tokenMap.get(tokenStr).toString())+1);
> }
> token = result.next();
> 
> }
> //exxtra reset 
> result.reset();
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> 
>
> Karl Wettin wrote:
> 
> 1 nov 2007 kl. 18.09 skrev Cool Coder:
>
> 
>> prior to adding into index
>> 
>
> Easiest way out would be to add the document to a temporary index and 
> extract the term frequency vector. I would recommend using MemoryIndex.
>
> You could also tokenize the document and pass the data to a 
> TermVectorMapper. You could consider replacing the fields of the 
> document with CachedTokenStreams if you got the RAM to spare and 
> don't want to waste CPU analyzing the document twice. I welcome 
> TermVectorMappingChachedTokenStreamFactory. Even cooler would be to 
> pass code down the IndexWriter.addDocument using a command pattern or 
> something, allowing one to extend the document at the time of the 
> analysis.
>
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Best way to count tokens

Posted by Mark Miller <ma...@gmail.com>.

reset is optional. StandardAnalyzer does not implement it. Check out 
CachingTokenFilter and wrap StandardAnalzyer in it.

Cool Coder wrote:
> Currently I have extended StandardAnalyzer and counting tokens in the following way. But the index is not getting created , though I call tokenStream.reset(). I am not sure whether reset() on token stream works or not??? I am debugging now
>    
>   public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream result = super.tokenStream(fieldName,new HTMLStripReader(reader));
>   //To count tokens and put in a Map
>    analyzeTokens(result);
>   try {
>   result.reset();
>   } catch (IOException e) {
>   // TODO Auto-generated catch block
>   e.printStackTrace();
>   }
>   return result;
>   }
>    
>   public void analyzeTokens(TokenStream result)
>   {
>   try {
>   Token token = result.next();
>   while(token != null)
>   {
>   String tokenStr = token.termText();
>   if(TokenHolder.tokenMap.get(tokenStr) == null)
>   {
>   TokenHolder.tokenMap.put(tokenStr,1);
>   }
>   else
>   {
>   TokenHolder.tokenMap.put(tokenStr,Integer.parseInt(TokenHolder.tokenMap.get(tokenStr).toString())+1);
>   }
>   token = result.next();
>   
>   }
>   //exxtra reset 
>   result.reset();
>   } catch (IOException e) {
>   e.printStackTrace();
>   }
>   }
>   
>
> Karl Wettin <ka...@gmail.com> wrote:
>   
> 1 nov 2007 kl. 18.09 skrev Cool Coder:
>
>   
>> prior to adding into index
>>     
>
> Easiest way out would be to add the document to a temporary index and 
> extract the term frequency vector. I would recommend using MemoryIndex.
>
> You could also tokenize the document and pass the data to a 
> TermVectorMapper. You could consider replacing the fields of the 
> document with CachedTokenStreams if you got the RAM to spare and 
> don't want to waste CPU analyzing the document twice. I welcome 
> TermVectorMappingChachedTokenStreamFactory. Even cooler would be to 
> pass code down the IndexWriter.addDocument using a command pattern or 
> something, allowing one to extend the document at the time of the 
> analysis.
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best way to count tokens

Posted by Cool Coder <te...@yahoo.com>.

Currently I have extended StandardAnalyzer and counting tokens in the following way. But the index is not getting created , though I call tokenStream.reset(). I am not sure whether reset() on token stream works or not??? I am debugging now
   
  public TokenStream tokenStream(String fieldName, Reader reader) {
  TokenStream result = super.tokenStream(fieldName,new HTMLStripReader(reader));
  //To count tokens and put in a Map
   analyzeTokens(result);
  try {
  result.reset();
  } catch (IOException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
  }
  return result;
  }
   
  public void analyzeTokens(TokenStream result)
  {
  try {
  Token token = result.next();
  while(token != null)
  {
  String tokenStr = token.termText();
  if(TokenHolder.tokenMap.get(tokenStr) == null)
  {
  TokenHolder.tokenMap.put(tokenStr,1);
  }
  else
  {
  TokenHolder.tokenMap.put(tokenStr,Integer.parseInt(TokenHolder.tokenMap.get(tokenStr).toString())+1);
  }
  token = result.next();
  
  }
  //exxtra reset 
  result.reset();
  } catch (IOException e) {
  e.printStackTrace();
  }
  }
  

Karl Wettin <ka...@gmail.com> wrote:
  
1 nov 2007 kl. 18.09 skrev Cool Coder:

> prior to adding into index

Easiest way out would be to add the document to a temporary index and 
extract the term frequency vector. I would recommend using MemoryIndex.

You could also tokenize the document and pass the data to a 
TermVectorMapper. You could consider replacing the fields of the 
document with CachedTokenStreams if you got the RAM to spare and 
don't want to waste CPU analyzing the document twice. I welcome 
TermVectorMappingChachedTokenStreamFactory. Even cooler would be to 
pass code down the IndexWriter.addDocument using a command pattern or 
something, allowing one to extend the document at the time of the 
analysis.


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Best way to count tokens

Posted by Karl Wettin <ka...@gmail.com>.

1 nov 2007 kl. 18.09 skrev Cool Coder:

> prior to adding into index

Easiest way out would be to add the document to a temporary index and  
extract the term frequency vector. I would recommend using MemoryIndex.

You could also tokenize the document and pass the data to a  
TermVectorMapper. You could consider replacing the fields of the  
document with CachedTokenStreams if you got the RAM to spare and  
don't want to waste CPU analyzing the document twice. I welcome  
TermVectorMappingChachedTokenStreamFactory. Even cooler would be to  
pass code down the IndexWriter.addDocument using a command pattern or  
something, allowing one to extend the document at the time of the  
analysis.


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best way to count tokens

Posted by Cool Coder <te...@yahoo.com>.

This is what I am looking for prior to adding into index. SO that it can help me  to display in my site first 10 tokens that has got maximum occurences in my index. In otherword, user can add weightage to these terms.

  - BR

Karl Wettin <ka...@gmail.com> wrote:

31 okt 2007 kl. 15.18 skrev Cool Coder:

> Hi Group,
> I need to display list of tokens (tags) in my side 
> those have got maximum occurances in my index. One way I can think 
> of is to keep track of all tokens during analysis and accordingly 
> display them. Is there any other way? e.g. if I want to display 
> tokens in order of their occurences as well as their weightage.

Are you looking for the term frequency vector?

IndexReader.html#getTermFreqVector(int,%20java.lang.String)>

If you are using 2.3 the TermVectorMapper might save you a couple of 
clock ticks sorting.

Or is this something you want to do prior to adding the document to 
the index?

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Best way to count tokens

Posted by Karl Wettin <ka...@gmail.com>.

31 okt 2007 kl. 15.18 skrev Cool Coder:

> Hi Group,
>               I need to display list of tokens (tags) in my side  
> those have got maximum occurances in my index. One way I can think  
> of is to keep track of all tokens during analysis and accordingly  
> display them. Is there any other way? e.g. if I want to display  
> tokens in order of their occurences as well as their weightage.

Are you looking for the term frequency vector?

<http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/ 
IndexReader.html#getTermFreqVector(int,%20java.lang.String)>

If you are using 2.3 the TermVectorMapper might save you a couple of  
clock ticks sorting.


Or is this something you want to do prior to adding the document to  
the index?

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Best way to count tokens

Posted by Cool Coder <te...@yahoo.com>.

Hi Group,
              I need to display list of tokens (tags) in my side those have got maximum occurances in my index. One way I can think of is to keep track of all tokens during analysis and accordingly display them. Is there any other way? e.g. if I want to display tokens in order of their occurences as well as their weightage.
   
  regards,
  Ranjan

 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Making Highlighter.mergeContiguousFragments() public

Posted by Mark Miller <ma...@gmail.com>.

Uh...ignore that lsat email...hit reply on the wrong one obviously...sorry.

Dave Golombek wrote:
> I was wondering if people thought that making
> Highlighter.mergeContiguousFragments() public (and non-final) would be
> acceptable. In my application, I want to strip all fragments with score == 0
> before merging the fragments (to get the minimal matching section, but still
> in order), and the easiest way to do so would be to override that method.
> Not a big deal, but I thought other people might find it useful. Making the
> method public static would also achieve the same result, allowing me to call
> the function separately.
>
> Thanks,
> Dave Golombek
> Black Duck Software, Inc.
> http://www.blackducksoftware.com 
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org