You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Kevin A. Burton" <bu...@newsmonster.org> on 2004/03/09 02:50:07 UTC

DocumentWriter, StopFilter should use HashMap... (patch)

I'm looking at StopFilter.java right now...

I did a kill -3 java and a number of my threads were blocked here:

"ksa-task-thread-34" prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for 
monitor entry [b9bff000..b9bff8d0]
        at java.util.Hashtable.get(Hashtable.java:332)
        - waiting to lock <0x61569720> (a java.util.Hashtable)
        at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94)
        at 
org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:170)
        at 
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:111)
        at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
        at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
        at 
ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java:136)
        at 
ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java:331)

Is there ANY reason to keep this as a Hashtable?  It's just preventing 
inversion across multiple threads.  They all have to lock on this hashtable.

Note that this guy is initialized ONCE and no more puts take place so I 
don't see why not.  It's readonly after the StopFilter is created.

I think this might really end up speeding up indexing a bit.  No hard 
benchmarks yet though.  Right now though it's just an inefficiency that 
should be removed.

I've attached a quick implementation. 

Kevin

-- 

Please reply using PGP:

    http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Incze Lajos <in...@mail.matav.hu>.
> This would no longer compile with the change Kevin proposes.
> 
> To make things back-compatible we must:
> 
> 1. Keep but deprectate StopFilter(Hashtable) constructor;
> 2. Keep but deprecate StopFilter.makeStopTable(String[]);
> 3. Add a new constructor: StopFilter(HashMap);

If you'd use StopFilter(Map), then it'd be back compatible
to users using HasTable in their constructor. I'm not sure
in olde Java versions but 1.4 java Hasstable implements
Map. (And OTOH why HashMap and not Map?)

> 4. Add a new method: StopFilter.makeStopMap(String[]);
> 
> Does that make sense?
> 
> Doug


incze

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
David Spencer wrote:

>
> Maybe I missed something but I always thought the stop list should be 
> a Set, not a Map (or Hashtable/Dictionary). After all, all you need to 
> know is existence and that's what a Set does.

It stores the word as the key and the value...

I don't care either way... There was no HashSet back when this was 
written. I was just going to leave it as a HashMap so that in the future 
if we ever wanted to change the value we could...

Either way.

-- 

Please reply using PGP:

    http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:

> David Spencer wrote:
>
>> Maybe I missed something but I always thought the stop list should be 
>> a Set, not a Map (or Hashtable/Dictionary). After all, all you need 
>> to know is existence and that's what a Set does.
>
>
> Good point.

It's easy to migrate to a HashSet... either way...   I was thinking 
about the same thing myself...

Kevin

-- 

Please reply using PGP:

    http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> Maybe I missed something but I always thought the stop list should be a 
> Set, not a Map (or Hashtable/Dictionary). After all, all you need to 
> know is existence and that's what a Set does.

Good point.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by David Spencer <da...@tropo.com>.
Maybe I missed something but I always thought the stop list should be a 
Set, not a Map (or Hashtable/Dictionary). After all, all you need to 
know is existence and that's what a Set does.

Doug Cutting wrote:

> Erik Hatcher wrote:
> 
>> Well, one issue you didn't consider is changing a public method 
>> signature.  I will make this change, but leave the Hashtable signature 
>> method there.  I suppose we could change the signature to use a Map 
>> instead, but I believe there are some issues with doing something like 
>> this if you do not recompile your own source code against a new Lucene 
>> JAR.... so I will simply provide another signature too.
> 
> 
> This is also a problem for folks who're implementing analyzers which use 
> StopFilter.  For example:
> 
> public MyAnalyzer extends Analyzer {
> 
>   private static Hashtable stopTable =
>     StopFilter.makeStopTable(stopWords);
> 
>   public TokenStream tokenStream(String field, Reader reader) {
>     ... new StopFilter(stopTable) ...
> 
> }
> 
> This would no longer compile with the change Kevin proposes.
> 
> To make things back-compatible we must:
> 
> 1. Keep but deprectate StopFilter(Hashtable) constructor;
> 2. Keep but deprecate StopFilter.makeStopTable(String[]);
> 3. Add a new constructor: StopFilter(HashMap);
> 4. Add a new method: StopFilter.makeStopMap(String[]);
> 
> Does that make sense?
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Date Range and proximity search

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Please keep discussion on the e-mail list.  I'm not familiar with the 
fields and types that Nutch uses first-hand, but you should take this 
up on the Nutch list rather than the Lucene list.

I do know that Nutch uses a custom query parser, so perhaps it does not 
allow range and proximity queries?

	Erik


On Mar 14, 2004, at 10:03 PM, redpineseed wrote:

> hi Erik,
>
> thanks for the reply. I did not do a new index. I just used the sample 
> index
> downloaded from nutch site.  it would be great if you have some working
> query strings to show me. thanks.
>
> philip
>
>
> ----- Original Message -----
> From: "Erik Hatcher" <er...@ehatchersolutions.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Sunday, March 14, 2004 4:00 PM
> Subject: Re: Date Range and proximity search
>
>
> To be honest, I'm way out of the loop of the demo and needs to be
> re-written.  It is on my to-do list!
>
> But, date range and proximity searches most definitely work.  Can you
> be more specific about what you index and how you searched?  Perhaps
> even a working test case?
>
> Erik
>
>
> On Mar 14, 2004, at 3:52 PM, redpineseed wrote:
>> hi all,
>>
>> I tried the demo, and Date range and proximity search did not return
>> anything. are these two features functioning at all? tia
>>
>> Philip
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


TO Erik Hatcher Re: Date Range and proximity search

Posted by redpineseed <re...@telus.net>.
hi Erik,

thanks for the reply. I did not do a new index. I just used the sample index
downloaded from nutch site.  it would be great if you have some working
query strings to show me. thanks.

philip


----- Original Message ----- 
From: "Erik Hatcher" <er...@ehatchersolutions.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Sunday, March 14, 2004 4:00 PM
Subject: Re: Date Range and proximity search


To be honest, I'm way out of the loop of the demo and needs to be
re-written.  It is on my to-do list!

But, date range and proximity searches most definitely work.  Can you
be more specific about what you index and how you searched?  Perhaps
even a working test case?

Erik


On Mar 14, 2004, at 3:52 PM, redpineseed wrote:
> hi all,
>
> I tried the demo, and Date range and proximity search did not return
> anything. are these two features functioning at all? tia
>
> Philip
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Date Range and proximity search

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
To be honest, I'm way out of the loop of the demo and needs to be 
re-written.  It is on my to-do list!

But, date range and proximity searches most definitely work.  Can you 
be more specific about what you index and how you searched?  Perhaps 
even a working test case?

	Erik


On Mar 14, 2004, at 3:52 PM, redpineseed wrote:
> hi all,
>
> I tried the demo, and Date range and proximity search did not return
> anything. are these two features functioning at all? tia
>
> Philip
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Date Range and proximity search

Posted by redpineseed <re...@telus.net>.
hi all,

I tried the demo, and Date range and proximity search did not return
anything. are these two features functioning at all? tia

Philip



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Stephane James Vaucher <va...@cirano.qc.ca>.
Just found the rest of the thread. I'll shut up now ;)

sv

On Sun, 14 Mar 2004, Stephane James Vaucher wrote:

> Back from a weeks' vacation, so this reply is a little late, maybe out of
> order as well ;). Comment inline:
>
> On Tue, 9 Mar 2004, Kevin A. Burton wrote:
>
> > Doug Cutting wrote:
> >
> > > Erik Hatcher wrote:
> > >
> > >> Well, one issue you didn't consider is changing a public method
> > >> signature.  I will make this change, but leave the Hashtable
> > >> signature method there.  I suppose we could change the signature to
> > >> use a Map instead, but I believe there are some issues with doing
> > >> something like this if you do not recompile your own source code
> > >> against a new Lucene JAR.... so I will simply provide another
> > >> signature too.
> > >
> > >
> > > This would no longer compile with the change Kevin proposes.
> > >
> > > To make things back-compatible we must:
> > >
> > > 1. Keep but deprectate StopFilter(Hashtable) constructor;
> > > 2. Keep but deprecate StopFilter.makeStopTable(String[]);
> > > 3. Add a new constructor: StopFilter(HashMap);
> > > 4. Add a new method: StopFilter.makeStopMap(String[]);
>
> Why impose implementation details in the constructor? Shouldn't the
> constructor use a Map (not a HashMap), a Set, or a String array?
>
> sv
>
> > >
> > > Does that make sense?
> > >
> > This patch and attachment take care of this problem...
> >
> > It does make this class more complex than it needs to be... but 1/2 of
> > the methods are deprecated.
> >
> > Kevin
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Stephane James Vaucher <va...@cirano.qc.ca>.
Back from a weeks' vacation, so this reply is a little late, maybe out of
order as well ;). Comment inline:

On Tue, 9 Mar 2004, Kevin A. Burton wrote:

> Doug Cutting wrote:
>
> > Erik Hatcher wrote:
> >
> >> Well, one issue you didn't consider is changing a public method
> >> signature.  I will make this change, but leave the Hashtable
> >> signature method there.  I suppose we could change the signature to
> >> use a Map instead, but I believe there are some issues with doing
> >> something like this if you do not recompile your own source code
> >> against a new Lucene JAR.... so I will simply provide another
> >> signature too.
> >
> >
> > This would no longer compile with the change Kevin proposes.
> >
> > To make things back-compatible we must:
> >
> > 1. Keep but deprectate StopFilter(Hashtable) constructor;
> > 2. Keep but deprecate StopFilter.makeStopTable(String[]);
> > 3. Add a new constructor: StopFilter(HashMap);
> > 4. Add a new method: StopFilter.makeStopMap(String[]);

Why impose implementation details in the constructor? Shouldn't the
constructor use a Map (not a HashMap), a Set, or a String array?

sv

> >
> > Does that make sense?
> >
> This patch and attachment take care of this problem...
>
> It does make this class more complex than it needs to be... but 1/2 of
> the methods are deprecated.
>
> Kevin
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Part of the dilemma of which implementation to actually be used will be 
solved implicit since our function to construct the Set will return a 
HashSet - and this will surely be the method most folks would use.  But 
I will be sure to note in the Javadoc that the implementation of the 
Set is important.

	Erik

On Mar 11, 2004, at 5:22 PM, Kevin A. Burton wrote:

> Erik Hatcher wrote:
>
>> I will refactor again using Set with no copying this time (except for 
>> the String[] and Hashtable) constructors.  This was my original 
>> preference, but I got caught up in the arguments by Kevin and lost my 
>> ideals temporarily :)
>>
>> I expect to do this later tonight or tomorrow.
>
> How about this as a compromise...
>
> No copy on constructor... use a Set but in the documentation summarize 
> this conversation and point out that the user should use a HashSet and 
> NOT any other type of set and that it will result in a copy..
>
> I think Doug's comment about a potentially faster impl in the future 
> was a good point...
>
> Kevin
>
> -- 
>
> Please reply using PGP.
>
>    http://peerfear.org/pubkey.asc          NewsMonster - 
> http://www.newsmonster.org/
>    Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Erik Hatcher wrote:

> I will refactor again using Set with no copying this time (except for 
> the String[] and Hashtable) constructors.  This was my original 
> preference, but I got caught up in the arguments by Kevin and lost my 
> ideals temporarily :)
>
> I expect to do this later tonight or tomorrow.

How about this as a compromise...

No copy on constructor... use a Set but in the documentation summarize 
this conversation and point out that the user should use a HashSet and 
NOT any other type of set and that it will result in a copy..

I think Doug's comment about a potentially faster impl in the future was 
a good point...

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
I will refactor again using Set with no copying this time (except for 
the String[] and Hashtable) constructors.  This was my original 
preference, but I got caught up in the arguments by Kevin and lost my 
ideals temporarily :)

I expect to do this later tonight or tomorrow.

	Erik


On Mar 11, 2004, at 12:04 PM, Doug Cutting wrote:

> Erik Hatcher wrote:
>> Yes, I saw it.  But is there a reason not to just expose HashSet 
>> given that it is the data structure that is most efficient?  I bought 
>> into Kevin's arguments that it made sense to just expose HashSet.
>
> Just the general principal that one shouldn't expose more of the 
> implementation than one must.  I can imagine faster things than a 
> HashSet for this, e.g., a well-coded letter tree (trie) could be a bit 
> faster, since it would only touch each character in the key once.  But 
> it's not a big deal, perhaps not worth fixing at this point.
>
> I proposed a solution that both respected this concern (yours, as I 
> recall) while at the same time avoiding copying.  It doesn't need to 
> be an either/or situation.  We can easily hide the implementation, 
> avoid copying, and use the most efficient implementation internally.  
> If you no longer care about hiding the implementation, then I guess 
> this is moot.  Before we started this exercise the implementation was 
> exposed, so things have gotten no worse, only better.  But they could 
> have gotten just a little bit better yet!
>
> Cheers,
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Doug Cutting <cu...@apache.org>.
Erik Hatcher wrote:
> Yes, I saw it.  But is there a reason not to just expose HashSet given 
> that it is the data structure that is most efficient?  I bought into 
> Kevin's arguments that it made sense to just expose HashSet.

Just the general principal that one shouldn't expose more of the 
implementation than one must.  I can imagine faster things than a 
HashSet for this, e.g., a well-coded letter tree (trie) could be a bit 
faster, since it would only touch each character in the key once.  But 
it's not a big deal, perhaps not worth fixing at this point.

I proposed a solution that both respected this concern (yours, as I 
recall) while at the same time avoiding copying.  It doesn't need to be 
an either/or situation.  We can easily hide the implementation, avoid 
copying, and use the most efficient implementation internally.  If you 
no longer care about hiding the implementation, then I guess this is 
moot.  Before we started this exercise the implementation was exposed, 
so things have gotten no worse, only better.  But they could have gotten 
just a little bit better yet!

Cheers,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Scott ganyo wrote:

> I don't buy it.  HashSet is but one implementation of a Set.  By 
> choosing the HashSet implementation you are not only tying the class 
> to a hash-based implementation, you are trying the interface to *that 
> specific* hash-based implementation or it's subclasses.  In the end, 
> either you buy the concept of the interface and its abstraction or you 
> don't.  I firmly believe in using interfaces as they were intended to 
> be used.

An interface isn't just the concept of a Java interface but ALSO the 
implied and required semantics.

TreeSet, etc are too slow to be used with the StopFitler thus we should 
prevent their use. 

We require HashSet/Map...

> Scott
>
> P.S. In fact, HashSet isn't always going to be the most efficient 
> anyway.  Just for one example:  Consider possible implementations if I 
> have only 1 or 2 entries.
>
HashSet is not always the most efficient... if you need to do runtime 
inserts and bulk removal TreeSet/Map might be more efficient.  Also if 
you need to sort the map then you're stuck with a tree.

KEvin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Scott ganyo <sc...@ganyo.com>.
I don't buy it.  HashSet is but one implementation of a Set.  By 
choosing the HashSet implementation you are not only tying the class to 
a hash-based implementation, you are trying the interface to *that 
specific* hash-based implementation or it's subclasses.  In the end, 
either you buy the concept of the interface and its abstraction or you 
don't.  I firmly believe in using interfaces as they were intended to 
be used.

Scott

P.S. In fact, HashSet isn't always going to be the most efficient 
anyway.  Just for one example:  Consider possible implementations if I 
have only 1 or 2 entries.

On Mar 10, 2004, at 11:13 PM, Erik Hatcher wrote:

> On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote:
>> Erik Hatcher wrote:
>>>> Also... you're HashSet constructor has to copy values from the 
>>>> original HashSet into the new HashSet ... not very clean and this 
>>>> can just be removed by forcing the caller to use a HashSet (which 
>>>> they should).
>>> I've caved in and gone HashSet all the way.
>>
>> Did you not see my message suggesting a way to both not expose 
>> HashSet publicly and also not to copy values?  If not, I attached it.
>
> Yes, I saw it.  But is there a reason not to just expose HashSet given 
> that it is the data structure that is most efficient?  I bought into 
> Kevin's arguments that it made sense to just expose HashSet.
>
> As for copying values - that is only happening now if you use the 
> Hashtable or String[] constructor.
>
> 	Erik
>
>
>>
>> Doug
>>
>>
>>
>> From: Doug Cutting <cu...@apache.org>
>> Date: March 10, 2004 1:08:24 PM EST
>> To: Lucene Developers List <lu...@jakarta.apache.org>
>> Subject: Re: cvs commit: 
>> jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
>> Reply-To: "Lucene Developers List" <lu...@jakarta.apache.org>
>>
>>
>> ehatcher@apache.org wrote:
>>>   -  public StopFilter(TokenStream in, Set stopTable) {
>>>   +  public StopFilter(TokenStream in, Set stopWords) {
>>>        super(in);
>>>   -    table = stopTable;
>>>   +    this.stopWords = new HashSet(stopWords);
>>>      }
>>
>> This always allocates a new HashSet, which, if the stop list is 
>> large, and documents are small, could impact performance.
>>
>> Perhaps we can replace this with something like:
>>
>> public StopFilter(TokenStream in, Set stopWords) {
>>   this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
>>            : new HashSet(stopWords));
>> }
>>
>> and then add another constructor:
>>
>> private StopFilter(TokenStream in, HashSet stopWords) {
>>   super(in);
>>   this.stopWords = stopTable;
>> }
>>
>> Also, if we want the implementation to always be a HashSet 
>> internally, for performance, we ought to declare the field to be a 
>> HashSet, no?
>>
>> The competing goals here are:
>>   1. Not to expose publicly the implementation of the Set;
>>   2. Not to copy the contents of the Set when folks pass the value of 
>> makeStopSet.
>>   3. Use the most efficient implementation internally.
>>
>> I think the changes above meet all of these.
>>
>> Doug
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote:
> Erik Hatcher wrote:
>>> Also... you're HashSet constructor has to copy values from the 
>>> original HashSet into the new HashSet ... not very clean and this 
>>> can just be removed by forcing the caller to use a HashSet (which 
>>> they should).
>> I've caved in and gone HashSet all the way.
>
> Did you not see my message suggesting a way to both not expose HashSet 
> publicly and also not to copy values?  If not, I attached it.

Yes, I saw it.  But is there a reason not to just expose HashSet given 
that it is the data structure that is most efficient?  I bought into 
Kevin's arguments that it made sense to just expose HashSet.

As for copying values - that is only happening now if you use the 
Hashtable or String[] constructor.

	Erik


>
> Doug
>
>
>
> From: Doug Cutting <cu...@apache.org>
> Date: March 10, 2004 1:08:24 PM EST
> To: Lucene Developers List <lu...@jakarta.apache.org>
> Subject: Re: cvs commit: 
> jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
> Reply-To: "Lucene Developers List" <lu...@jakarta.apache.org>
>
>
> ehatcher@apache.org wrote:
>>   -  public StopFilter(TokenStream in, Set stopTable) {
>>   +  public StopFilter(TokenStream in, Set stopWords) {
>>        super(in);
>>   -    table = stopTable;
>>   +    this.stopWords = new HashSet(stopWords);
>>      }
>
> This always allocates a new HashSet, which, if the stop list is large, 
> and documents are small, could impact performance.
>
> Perhaps we can replace this with something like:
>
> public StopFilter(TokenStream in, Set stopWords) {
>   this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
>            : new HashSet(stopWords));
> }
>
> and then add another constructor:
>
> private StopFilter(TokenStream in, HashSet stopWords) {
>   super(in);
>   this.stopWords = stopTable;
> }
>
> Also, if we want the implementation to always be a HashSet internally, 
> for performance, we ought to declare the field to be a HashSet, no?
>
> The competing goals here are:
>   1. Not to expose publicly the implementation of the Set;
>   2. Not to copy the contents of the Set when folks pass the value of 
> makeStopSet.
>   3. Use the most efficient implementation internally.
>
> I think the changes above meet all of these.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:

> Erik Hatcher wrote:
>
>>> Also... you're HashSet constructor has to copy values from the 
>>> original HashSet into the new HashSet ... not very clean and this 
>>> can just be removed by forcing the caller to use a HashSet (which 
>>> they should).
>>
>>
>> I've caved in and gone HashSet all the way.
>
>
> Did you not see my message suggesting a way to both not expose HashSet 
> publicly and also not to copy values?  If not, I attached it.
>
For the record I didn't see it... but it echos my points...

Thanks!

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Doug Cutting <cu...@apache.org>.
Erik Hatcher wrote:
>> Also... you're HashSet constructor has to copy values from the 
>> original HashSet into the new HashSet ... not very clean and this can 
>> just be removed by forcing the caller to use a HashSet (which they 
>> should).
> 
> I've caved in and gone HashSet all the way.

Did you not see my message suggesting a way to both not expose HashSet 
publicly and also not to copy values?  If not, I attached it.

Doug



Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Mar 10, 2004, at 2:59 PM, Kevin A. Burton wrote:
>> I refuse to expose HashSet... sorry!  :)  But I did wrap what is 
>> passed in, like above, in a HashSet in my latest commit.
>
> Hm... You're doing this EVEN if the caller passes a HashSet directly?!

Well it was in the ctor.  But I guess I'm not seeing all the times the 
filter is being constructed to make this a cause a performance hit.

> Why do you have a problem exposing a HashSet/Map... it SHOULD be a 
> Hash based implementation.  Doing anything else is just wrong and 
> would seriously slow down Lucene indexing.

Just semantically, it is a "set" of stop words - so in theory it 
shouldn't matter the actual implementation.  I'm an interface purist at 
heart.

> Also... you're HashSet constructor has to copy values from the 
> original HashSet into the new HashSet ... not very clean and this can 
> just be removed by forcing the caller to use a HashSet (which they 
> should).

I've caved in and gone HashSet all the way.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Erik Hatcher wrote:

>
>> Also... while you're at it... the private variable name is 'table' 
>> which this HashSet certainly is *not* ;)
>
>
> Well, depends on your definition of 'table' I suppose :)  I changed it 
> to a type-agnostic stopWords.

Did you know that internally HashSet uses a HashMap?

I sure didn't!

hashset.contains() maps to hashmap.containsKey()

It uses a key -> value mapping to a generic PRESENT Object... hm. 

>> Probably makes sense to just call this variable 'hashset' and then 
>> force the type to be HashSet since it's necessary for this to be a 
>> HashSet to maintain any decent performance.  You'll need to update 
>> your second constructor to require a HashSet too.. would be very bad 
>> to let callers use another set impl... TreeSet and SortedSet would 
>> still be too slow...
>
> I refuse to expose HashSet... sorry!  :)  But I did wrap what is 
> passed in, like above, in a HashSet in my latest commit. 

Hm... You're doing this EVEN if the caller passes a HashSet directly?!

Why do you have a problem exposing a HashSet/Map... it SHOULD be a Hash 
based implementation.  Doing anything else is just wrong and would 
seriously slow down Lucene indexing.

Also... you're HashSet constructor has to copy values from the original 
HashSet into the new HashSet ... not very clean and this can just be 
removed by forcing the caller to use a HashSet (which they should).

:)

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Mar 9, 2004, at 10:23 PM, Kevin A. Burton wrote:
> You need do make it a HashSet:
>
>   table = new HashSet( stopTable.keySet() );

Done.

> Also... while you're at it... the private variable name is 'table' 
> which this HashSet certainly is *not* ;)

Well, depends on your definition of 'table' I suppose :)  I changed it 
to a type-agnostic stopWords.

> Probably makes sense to just call this variable 'hashset' and then 
> force the type to be HashSet since it's necessary for this to be a 
> HashSet to maintain any decent performance.  You'll need to update 
> your second constructor to require a HashSet too.. would be very bad 
> to let callers use another set impl... TreeSet and SortedSet would 
> still be too slow...

I refuse to expose HashSet... sorry!  :)  But I did wrap what is passed 
in, like above, in a HashSet in my latest commit.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Erik Hatcher wrote:

> Kevin - I've made this change and committed it, using a Set.
>
> Let me know if there are any issues with what I've committed - I 
> believe I've faithfully preserved backwards compatibility.

Actually... Erik.. I don't think your Hashtable constructor will work...

By default Hashtable.keySet returns a SynchronizedSet. (on JDK 1.4.2). 
so were're back to where we started:

>  public StopFilter(TokenStream in, Hashtable stopTable) {
>    super(in);
>    table = stopTable.keySet();
>  }
>  
>
You need do make it a HashSet:

   table = new HashSet( stopTable.keySet() );

Also... while you're at it... the private variable name is 'table' which 
this HashSet certainly is *not* ;)

Probably makes sense to just call this variable 'hashset' and then force 
the type to be HashSet since it's necessary for this to be a HashSet to 
maintain any decent performance.  You'll need to update your second 
constructor to require a HashSet too.. would be very bad to let callers 
use another set impl... TreeSet and SortedSet would still be too slow...

Anyway... I had this feature in my patch ;)

Thanks!

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Erik Hatcher wrote:

> Kevin - I've made this change and committed it, using a Set.
>
> Let me know if there are any issues with what I've committed - I 
> believe I've faithfully preserved backwards compatibility.
>
Great... I'll take a look!

> p.s. ...
>
> On Mar 9, 2004, at 2:00 PM, Kevin A. Burton wrote:
>
>>   public StopFilter(TokenStream in, Hashtable stopTable) {
>>     super(in);
>>     map = new HashMap();
>>
>>     Enumeration keys = stopTable.keys();
>>     while ( keys.hasMoreElements() ) {
>>         Object key = keys.nextElement();
>>         map.put( key, stopTable.get( key ) );
>>     }
>
>
> By the way, the ctor to HashMap can take a Map, which Hashtable is 
> also :))
>
Crap... good point.. Actually that was the FIRST thing I checked but my 
javadoc index wasn't up to date... long story.  Actually I was pissed to 
find out that it didn't implement a map interface...  :)

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Kevin - I've made this change and committed it, using a Set.

Let me know if there are any issues with what I've committed - I 
believe I've faithfully preserved backwards compatibility.

	Erik

p.s. ...

On Mar 9, 2004, at 2:00 PM, Kevin A. Burton wrote:
>   public StopFilter(TokenStream in, Hashtable stopTable) {
>     super(in);
>     map = new HashMap();
>
>     Enumeration keys = stopTable.keys();
>     while ( keys.hasMoreElements() ) {
>         Object key = keys.nextElement();
>         map.put( key, stopTable.get( key ) );
>     }

By the way, the ctor to HashMap can take a Map, which Hashtable is also 
:))


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:

> Erik Hatcher wrote:
>
>> Well, one issue you didn't consider is changing a public method 
>> signature.  I will make this change, but leave the Hashtable 
>> signature method there.  I suppose we could change the signature to 
>> use a Map instead, but I believe there are some issues with doing 
>> something like this if you do not recompile your own source code 
>> against a new Lucene JAR.... so I will simply provide another 
>> signature too.
>
>
> This would no longer compile with the change Kevin proposes.
>
> To make things back-compatible we must:
>
> 1. Keep but deprectate StopFilter(Hashtable) constructor;
> 2. Keep but deprecate StopFilter.makeStopTable(String[]);
> 3. Add a new constructor: StopFilter(HashMap);
> 4. Add a new method: StopFilter.makeStopMap(String[]);
>
> Does that make sense?
>
This patch and attachment take care of this problem... 

It does make this class more complex than it needs to be... but 1/2 of 
the methods are deprecated.

Kevin

-- 

Please reply using PGP:

    http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Doug Cutting wrote:

> Erik Hatcher wrote:
>
>> Well, one issue you didn't consider is changing a public method 
>> signature.  I will make this change, but leave the Hashtable 
>> signature method there.  I suppose we could change the signature to 
>> use a Map instead, but I believe there are some issues with doing 
>> something like this if you do not recompile your own source code 
>> against a new Lucene JAR.... so I will simply provide another 
>> signature too.
>
>
> This is also a problem for folks who're implementing analyzers which 
> use StopFilter.  For example:
>
> public MyAnalyzer extends Analyzer {
>
>   private static Hashtable stopTable =
>     StopFilter.makeStopTable(stopWords);
>
>   public TokenStream tokenStream(String field, Reader reader) {
>     ... new StopFilter(stopTable) ...
>
> }
>
> This would no longer compile with the change Kevin proposes.
>
> To make things back-compatible we must:
>
> 1. Keep but deprectate StopFilter(Hashtable) constructor;
> 2. Keep but deprecate StopFilter.makeStopTable(String[]);
> 3. Add a new constructor: StopFilter(HashMap);
> 4. Add a new method: StopFilter.makeStopMap(String[]);
>
> Does that make sense?

Ah... ok... good point.  If no one does this I'll take care of it...

Kevin

-- 

Please reply using PGP:

    http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Doug Cutting <cu...@apache.org>.
Erik Hatcher wrote:
> Well, one issue you didn't consider is changing a public method 
> signature.  I will make this change, but leave the Hashtable signature 
> method there.  I suppose we could change the signature to use a Map 
> instead, but I believe there are some issues with doing something like 
> this if you do not recompile your own source code against a new Lucene 
> JAR.... so I will simply provide another signature too.

This is also a problem for folks who're implementing analyzers which use 
StopFilter.  For example:

public MyAnalyzer extends Analyzer {

   private static Hashtable stopTable =
     StopFilter.makeStopTable(stopWords);

   public TokenStream tokenStream(String field, Reader reader) {
     ... new StopFilter(stopTable) ...

}

This would no longer compile with the change Kevin proposes.

To make things back-compatible we must:

1. Keep but deprectate StopFilter(Hashtable) constructor;
2. Keep but deprecate StopFilter.makeStopTable(String[]);
3. Add a new constructor: StopFilter(HashMap);
4. Add a new method: StopFilter.makeStopMap(String[]);

Does that make sense?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Well, one issue you didn't consider is changing a public method 
signature.  I will make this change, but leave the Hashtable signature 
method there.  I suppose we could change the signature to use a Map 
instead, but I believe there are some issues with doing something like 
this if you do not recompile your own source code against a new Lucene 
JAR.... so I will simply provide another signature too.

	Erik


On Mar 9, 2004, at 4:15 AM, Kevin A. Burton wrote:

> Erik Hatcher wrote:
>
>> I don't see any reason for this to be a Hashtable.
>>
>> It seems an acceptable alternative to not share analyzer/filter  
>> instances across threads - they don't really take up much space, so 
>> is  there a reason to share them?  Or I'm guessing you're sharing it  
>> implicitly through an IndexWriter, huh?
>>
>> I'll away further feedback before committing this change, but seems  
>> reasonable to me.
>>
> Yeah... I'm using a RAMDirectory and adding documents to it across 
> multiple threads... some of them index at the same time.
>
> The patch is super small... the only difference is that it's using a 
> HashMap which isn't synchronized... it can't hurt anything...
>
> but feedback is a good thing :)
>
> Kevin
>
> -- 
>
> Please reply using PGP:
>
>    http://peerfear.org/pubkey.asc
>    NewsMonster - http://www.newsmonster.org/
>    Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
>
> <burton.vcf>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Erik Hatcher wrote:

> I don't see any reason for this to be a Hashtable.
>
> It seems an acceptable alternative to not share analyzer/filter  
> instances across threads - they don't really take up much space, so 
> is  there a reason to share them?  Or I'm guessing you're sharing it  
> implicitly through an IndexWriter, huh?
>
> I'll away further feedback before committing this change, but seems  
> reasonable to me.
>
Yeah... I'm using a RAMDirectory and adding documents to it across 
multiple threads... some of them index at the same time.

The patch is super small... the only difference is that it's using a 
HashMap which isn't synchronized... it can't hurt anything...

but feedback is a good thing :)

Kevin

-- 

Please reply using PGP:

    http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
I don't see any reason for this to be a Hashtable.

It seems an acceptable alternative to not share analyzer/filter  
instances across threads - they don't really take up much space, so is  
there a reason to share them?  Or I'm guessing you're sharing it  
implicitly through an IndexWriter, huh?

I'll away further feedback before committing this change, but seems  
reasonable to me.

	Erik


On Mar 8, 2004, at 8:50 PM, Kevin A. Burton wrote:
> I'm looking at StopFilter.java right now...
>
> I did a kill -3 java and a number of my threads were blocked here:
>
> "ksa-task-thread-34" prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for  
> monitor entry [b9bff000..b9bff8d0]
>        at java.util.Hashtable.get(Hashtable.java:332)
>        - waiting to lock <0x61569720> (a java.util.Hashtable)
>        at  
> org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94)
>        at  
> org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.ja 
> va:170)
>        at  
> org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java: 
> 111)
>        at  
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
>        at  
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
>        at  
> ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java: 
> 136)
>        at  
> ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java: 
> 331)
>
> Is there ANY reason to keep this as a Hashtable?  It's just preventing  
> inversion across multiple threads.  They all have to lock on this  
> hashtable.
>
> Note that this guy is initialized ONCE and no more puts take place so  
> I don't see why not.  It's readonly after the StopFilter is created.
>
> I think this might really end up speeding up indexing a bit.  No hard  
> benchmarks yet though.  Right now though it's just an inefficiency  
> that should be removed.
>
> I've attached a quick implementation.
> Kevin
>
> -- 
>
> Please reply using PGP:
>
>    http://peerfear.org/pubkey.asc
>    NewsMonster - http://www.newsmonster.org/
>    Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
>
> package org.apache.lucene.analysis;
>
> /* ====================================================================
>  * The Apache Software License, Version 1.1
>  *
>  * Copyright (c) 2001 The Apache Software Foundation.  All rights
>  * reserved.
>  *
>  * Redistribution and use in source and binary forms, with or without
>  * modification, are permitted provided that the following conditions
>  * are met:
>  *
>  * 1. Redistributions of source code must retain the above copyright
>  *    notice, this list of conditions and the following disclaimer.
>  *
>  * 2. Redistributions in binary form must reproduce the above copyright
>  *    notice, this list of conditions and the following disclaimer in
>  *    the documentation and/or other materials provided with the
>  *    distribution.
>  *
>  * 3. The end-user documentation included with the redistribution,
>  *    if any, must include the following acknowledgment:
>  *       "This product includes software developed by the
>  *        Apache Software Foundation (http://www.apache.org/)."
>  *    Alternately, this acknowledgment may appear in the software  
> itself,
>  *    if and wherever such third-party acknowledgments normally appear.
>  *
>  * 4. The names "Apache" and "Apache Software Foundation" and
>  *    "Apache Lucene" must not be used to endorse or promote products
>  *    derived from this software without prior written permission. For
>  *    written permission, please contact apache@apache.org.
>  *
>  * 5. Products derived from this software may not be called "Apache",
>  *    "Apache Lucene", nor may "Apache" appear in their name, without
>  *    prior written permission of the Apache Software Foundation.
>  *
>  * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
>  * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
>  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
>  * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
>  * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>  * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>  * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
>  * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
>  * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
>  * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>  * SUCH DAMAGE.
>  * ====================================================================
>  *
>  * This software consists of voluntary contributions made by many
>  * individuals on behalf of the Apache Software Foundation.  For more
>  * information on the Apache Software Foundation, please see
>  * <http://www.apache.org/>.
>  */
>
> import java.io.IOException;
> import java.util.*;
>
> /** Removes stop words from a token stream. */
>
> public final class StopFilter extends TokenFilter {
>
>   //Note: this could migrate to using a HashSet
>   private HashMap table;
>
>   /** Constructs a filter which removes words from the input
>     TokenStream that are named in the array of words. */
>   public StopFilter(TokenStream in, String[] stopWords) {
>     super(in);
>     table = makeStopTable(stopWords);
>   }
>
>   /** Constructs a filter which removes words from the input
>     TokenStream that are named in the HashMap. */
>   public StopFilter(TokenStream in, HashMap stopTable) {
>     super(in);
>     table = stopTable;
>   }
>
>   /** Builds a HashMap from an array of stop words, appropriate for  
> passing
>     into the StopFilter constructor.  This permits this table  
> construction to
>     be cached once when an Analyzer is constructed. */
>   public static final HashMap makeStopTable(String[] stopWords) {
>       HashMap stopTable = new HashMap(stopWords.length);
>
>       for (int i = 0; i < stopWords.length; i++)
>           stopTable.put(stopWords[i], stopWords[i]);
>
>       return stopTable;
>   }
>
>   /** Returns the next input Token whose termText() is not a stop  
> word. */
>   public final Token next() throws IOException {
>     // return the first non-stop word found
>     for (Token token = input.next(); token != null; token =  
> input.next())
>       if (table.get(token.termText) == null)
> 	return token;
>     // reached EOS -- return null
>     return null;
>   }
> }
> <burton.vcf>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Otis Gospodnetic wrote:

>I really don't think this will make any noticable difference, but why
>not.  Could you please send a diff -uN patch, please?
>I made the same changes locally about a year ago, but have since thrown
>away my local changes (for no good reason that I recall).
>  
>
Just diff it locally... it's just a search replace for Hashtable -> 
HashMap...

Pretty trivial.

Kevin

-- 

Please reply using PGP:

    http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: DocumentWriter, StopFilter should use HashMap... (patch)

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I really don't think this will make any noticable difference, but why
not.  Could you please send a diff -uN patch, please?
I made the same changes locally about a year ago, but have since thrown
away my local changes (for no good reason that I recall).

Thanks,
Otis

--- "Kevin A. Burton" <bu...@newsmonster.org> wrote:
> I'm looking at StopFilter.java right now...
> 
> I did a kill -3 java and a number of my threads were blocked here:
> 
> "ksa-task-thread-34" prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for 
> monitor entry [b9bff000..b9bff8d0]
>         at java.util.Hashtable.get(Hashtable.java:332)
>         - waiting to lock <0x61569720> (a java.util.Hashtable)
>         at
> org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94)
>         at 
>
org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:170)
>         at 
>
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:111)
>         at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257)
>         at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244)
>         at 
>
ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java:136)
>         at 
>
ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java:331)
> 
> Is there ANY reason to keep this as a Hashtable?  It's just
> preventing 
> inversion across multiple threads.  They all have to lock on this
> hashtable.
> 
> Note that this guy is initialized ONCE and no more puts take place so
> I 
> don't see why not.  It's readonly after the StopFilter is created.
> 
> I think this might really end up speeding up indexing a bit.  No hard
> 
> benchmarks yet though.  Right now though it's just an inefficiency
> that 
> should be removed.
> 
> I've attached a quick implementation. 
> 
> Kevin
> 
> -- 
> 
> Please reply using PGP:
> 
>     http://peerfear.org/pubkey.asc    
> 
>     NewsMonster - http://www.newsmonster.org/
>     
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>        AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
> 
> > package org.apache.lucene.analysis;
> 
> /*
> ====================================================================
>  * The Apache Software License, Version 1.1
>  *
>  * Copyright (c) 2001 The Apache Software Foundation.  All rights
>  * reserved.
>  *
>  * Redistribution and use in source and binary forms, with or without
>  * modification, are permitted provided that the following conditions
>  * are met:
>  *
>  * 1. Redistributions of source code must retain the above copyright
>  *    notice, this list of conditions and the following disclaimer.
>  *
>  * 2. Redistributions in binary form must reproduce the above
> copyright
>  *    notice, this list of conditions and the following disclaimer in
>  *    the documentation and/or other materials provided with the
>  *    distribution.
>  *
>  * 3. The end-user documentation included with the redistribution,
>  *    if any, must include the following acknowledgment:
>  *       "This product includes software developed by the
>  *        Apache Software Foundation (http://www.apache.org/)."
>  *    Alternately, this acknowledgment may appear in the software
> itself,
>  *    if and wherever such third-party acknowledgments normally
> appear.
>  *
>  * 4. The names "Apache" and "Apache Software Foundation" and
>  *    "Apache Lucene" must not be used to endorse or promote products
>  *    derived from this software without prior written permission.
> For
>  *    written permission, please contact apache@apache.org.
>  *
>  * 5. Products derived from this software may not be called "Apache",
>  *    "Apache Lucene", nor may "Apache" appear in their name, without
>  *    prior written permission of the Apache Software Foundation.
>  *
>  * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
>  * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
>  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
>  * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
>  * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>  * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>  * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND
>  * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
>  * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
>  * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>  * SUCH DAMAGE.
>  *
> ====================================================================
>  *
>  * This software consists of voluntary contributions made by many
>  * individuals on behalf of the Apache Software Foundation.  For more
>  * information on the Apache Software Foundation, please see
>  * <http://www.apache.org/>.
>  */
> 
> import java.io.IOException;
> import java.util.*;
> 
> /** Removes stop words from a token stream. */
> 
> public final class StopFilter extends TokenFilter {
> 
>   //Note: this could migrate to using a HashSet
>   private HashMap table;
> 
>   /** Constructs a filter which removes words from the input
>     TokenStream that are named in the array of words. */
>   public StopFilter(TokenStream in, String[] stopWords) {
>     super(in);
>     table = makeStopTable(stopWords);
>   }
> 
>   /** Constructs a filter which removes words from the input
>     TokenStream that are named in the HashMap. */
>   public StopFilter(TokenStream in, HashMap stopTable) {
>     super(in);
>     table = stopTable;
>   }
>   
>   /** Builds a HashMap from an array of stop words, appropriate for
> passing
>     into the StopFilter constructor.  This permits this table
> construction to
>     be cached once when an Analyzer is constructed. */
>   public static final HashMap makeStopTable(String[] stopWords) {
>       HashMap stopTable = new HashMap(stopWords.length);
> 
>       for (int i = 0; i < stopWords.length; i++)
>           stopTable.put(stopWords[i], stopWords[i]);
> 
>       return stopTable;
>   }
> 
>   /** Returns the next input Token whose termText() is not a stop
> word. */
>   public final Token next() throws IOException {
>     // return the first non-stop word found
>     for (Token token = input.next(); token != null; token =
> input.next())
>       if (table.get(token.termText) == null)
> 	return token;
>     // reached EOS -- return null
>     return null;
>   }
> }
> > begin:vcard
> fn:Kevin Burton
> n:Burton;Kevin
> email;internet:burton@newsmonster.org
> x-mozilla-html:TRUE
> version:2.1
> end:vcard
> 
> 

> ATTACHMENT part 2 application/pgp-signature name=signature.asc



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org