You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Bernhard Messer <bm...@apache.org> on 2005/06/06 19:24:59 UTC

Re: svn-commit: 168449 FSDirectory

Hi Daniel,

i just had a look at the new implementation that FSDirectory deletes 
lucene related files only. I like the patch, but i think we left some 
room for optimization. In the current implementation, it's necessary to 
run thru all known Lucene extensions (13 for the moment), for each call 
of "LuceneFileFilter.accept()". If creating an index in a directory 
which contains several hundred files, this definitly will be a 
bottleneck. So creating a new Index in a directory containing 100 files, 
we will endup with 1300 calls to
"if (name.endsWith("."+IndexReader.FILENAME_EXTENSIONS[i]))" which 
always needs to create a new StringBuffer to merge the two strings.

Therefore i would like to propose two changes:
1) we should store the extension in a hash and not in String[] to have a 
faster lookup
2) check for the file extension only without using the "."

any thoughts

Bernhard


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: svn-commit: 168449 FSDirectory

Posted by Chris Nokleberg <ch...@sixlegs.com>.
On Tue, 07 Jun 2005 09:13:10 +0200, Bernhard Messer wrote:
> exactly, that's what i had in mind. I know that we have to allocate a 
> new string object for every call, but this must be much cheaper than the 
> current implementation which has to walk thru the whole array every time 
> the method is called.

If you're still concerned about performance you can use a "trie" data
structure, which takes a little effort to build but can match very quickly
(with no allocations). I've attached an example implementation that works
for filename extensions which only use the letters 'a'-'z' (minimally
tested):

  private static final ExtensionMatcher MATCHER = 
    new ExtensionMatcher(new String[]{
      "cfs", "fnm", "fdx", "fdt", "tii",
      "tis", "frq", "prx", "del", "tvx",
      "tvd", "tvf", "tvp",
    });

  MATCHER.match("foo.cfs"); // returns true

BTW, I think it is a bad idea to have FILENAME_EXTENSIONS as a public
array. It being final does not prevent code from changing the contents of
the array. If the extensions must be public I would recommend either an
accessor that returns a copy of the array, or an unmodifiable
collection/set/list.

Chris

////////////////////////////////////////////////////////////////////////

import java.util.*;

public class ExtensionMatcher
{
    private Object[] tree;
    
    public ExtensionMatcher(String[] exts)
    {
        tree = create(Arrays.asList(exts), 0);
    }

    private static Object[] create(List exts, int index)
    {
        Object[] array = new Object[27];
        Map byLetter = new HashMap();
        for (Iterator it = exts.iterator(); it.hasNext();) {
            String ext = (String)it.next();
            int length = ext.length();
            if (length > index) {
                Character c = new Character(ext.charAt(length - 1 - index));
                List list = (List)byLetter.get(c);
                if (list == null)
                    byLetter.put(c, list = new ArrayList());
                list.add(ext);
            } else if (length == index) {
                array[26] = Boolean.TRUE;
            }
        }
        for (Iterator it = byLetter.keySet().iterator(); it.hasNext();) {
            Character c = (Character)it.next();
            char val = c.charValue();
            if (val < 'a' || val > 'z')
                throw new IllegalArgumentException("Extension must be between 'a' and 'z'");
            array[val - 'a'] = create((List)byLetter.get(c), index + 1);
        }
        return array;
    }

    public boolean match(String file)
    {
        int index = 0;
        int length = file.length();
        Object[] array = tree;
        for (;;) {
            if (index >= length)
                return false;
            char val = file.charAt(length - 1 - index);
            if (val == '.' && array[26] != null)
                return true;
            if (val < 'a' || val > 'z')
                return false;
            array = (Object[])array[val - 'a'];
            if (array == null)
                return false;
            index++;
        }
    }
}



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: svn-commit: 168449 FSDirectory

Posted by Bernhard Messer <bm...@apache.org>.
Doug Cutting wrote:

> Bernhard Messer wrote:
>
>> Therefore i would like to propose two changes:
>> 1) we should store the extension in a hash and not in String[] to 
>> have a faster lookup
>
>
> Do you mean to use something like:
>
> String lastDot = name.lastIndexOf('.');
> if (lastDot >= 0) {
>   String nameExt = name.substring(lastDot+1, name.length());
>   if (FILENAME_EXTENSIONS.get(nameExt)) {
>     ...
>   }
> }
>
> That would allocate a new String in each case, which would be 
> substantially faster.  Is that what you meant?

exactly, that's what i had in mind. I know that we have to allocate a 
new string object for every call, but this must be much cheaper than the 
current implementation which has to walk thru the whole array every time 
the method is called.

Bernhard


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: svn-commit: 168449 FSDirectory

Posted by Doug Cutting <cu...@apache.org>.
Bernhard Messer wrote:
> Therefore i would like to propose two changes:
> 1) we should store the extension in a hash and not in String[] to have a 
> faster lookup

Do you mean to use something like:

String lastDot = name.lastIndexOf('.');
if (lastDot >= 0) {
   String nameExt = name.substring(lastDot+1, name.length());
   if (FILENAME_EXTENSIONS.get(nameExt)) {
     ...
   }
}

That would allocate a new String in each case, which would be 
substantially faster.  Is that what you meant?

> 2) check for the file extension only without using the "."

Do you mean something like:

   String ext = IndexReader.FILENAME_EXTENSIONS[i];
   if (name.endsWith(ext)
       && name.length >= ext.length()+1
       && '.' == name.charAt(name.length()-(ext.length()+1))) {
     ...
   }

I don't see how this works with (1) above.  My guess is that (1) alone 
would be fastest, even though it still allocates objects.

Also, as mentioned in another message, I think this class, along with 
other index file name logic, should all move to a single place.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org