You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@creadur.apache.org by Robert Burrell Donkin <ro...@blueyonder.co.uk> on 2008/04/28 23:26:47 UTC

Easier New License Addition

ATM license readers are hard coded. this just won't scale. it's better
to be able to read some meta-data linking a header to an URL and then to
a license URL.

been trying to think of more efficient ways to parse the headers when
faced with more possible headers. 

i've been thinking about creating a specialised tokeniser which strips
an extended set of whitespace characters. this set can either be guessed
from the document MIME type or hard coded (not sure which would be best)
and would include punctuation. conversion is also perform to upper case.
this tokeniser would produce a stream of words. should be good enough to
ignore words which are too long (>20 characters, say) which means a word
-> number mapping can be used to reduce each word to a fixed number of
longs.

for each license, upon initialisation generate a state machine. some
limited ability to handle simple regexes (? meaning one or none would be
enough to start with) would be needed to cope with license families.
should be able to use bitwise operations to compare words with words in
the license. 

sounds complex, i know. is it likely to be faster than java's regex?

opinions?

- robert

Re: Easier New License Addition

Posted by Jochen Wiedmann <jo...@gmail.com>.

On Mon, Apr 28, 2008 at 11:26 PM, Robert Burrell Donkin
<ro...@blueyonder.co.uk> wrote:

>  sounds complex, i know. is it likely to be faster than java's regex?

I am sure, that you'd be able to get something that is faster than
java's regex. But I doubt that it will be maintainable. I'd vote for
sticking to regex's. IMO, RAT is not required to be really fast.
People who are sensible for performance will disable it for the daily
build and enable it for release builds only.

Jochen


-- 
Look, that's why there's rules, understand? So that you think before
you break 'em.

 -- (Terry Pratchett, Thief of Time)