You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@creadur.apache.org by Robert Burrell Donkin <ro...@blueyonder.co.uk> on 2013/07/12 21:26:22 UTC

RAT: IHeaderMatcher Design

Rat spends a lot of effort parsing textual documents, looking for 
headers and boilerplate text. There's an extension point (of sorts) for 
the searches that can be performed, provided by IHeaderMatcher[1].

This interface has a few TODOs in. It's used by pushing the text in one 
line at a time, after doing some pre-processing. As the TODO indicates, 
this may not the most elegant design.

As an extension point, IHeaderMatcher has the advantage of flexibility. 
It would be possible to plug in radically different implementations. It 
turns out, though, that few clever new implementations have emerge. All 
implementations seem to do is check for license headers.

One disadvantage of this arrangement is that it pushes some of the 
parsing outwards toward supposedly pluggable implementations. This means 
that adding new licenses means adding a partial parser.

I wonder whether it might be more intuitive (as well as opening 
potential for faster parsing) to use immutable domain objects for 
licenses and so on, making them data rather than processors.

Opinions...? Alternatives...?

Robert

[1]
/**
* Resets this matches.
* Subsequent calls to {@link #match} will accumulate new text.
*/
public void reset();

/**
* Matches the text accumulated to licenses.
* TODO probably a poor design choice - hope to fix later
* @param subject TODO
* @param line next line of text, not null
* @return TODO
*/
public boolean match(Document subject, String line) throws 
RatHeaderAnalysisException;

http://svn.apache.org/viewvc/creadur/rat/trunk/apache-rat-core/src/main/java/org/apache/rat/analysis/IHeaderMatcher.java?revision=1396305&view=markup

Re: RAT: IHeaderMatcher Design

Posted by Robert Burrell Donkin <ro...@blueyonder.co.uk>.
On 07/13/13 18:52, P. Ottlinger wrote:
> Hi *,
>
> thanks for raising the issue.
>
> Am 12.07.2013 21:26, schrieb Robert Burrell Donkin:
>> I wonder whether it might be more intuitive (as well as opening
>> potential for faster parsing) to use immutable domain objects for
>> licenses and so on, making them data rather than processors.
>
> +1
>
> Licences should be data objects.

+1

> What about adding a parserFactory whose default implementation is the
> line based parser and adding a parser to the licence data objects.
>
> Thus a matcher were to contain a pair of data object/licence and parser.
>
> This would reduce the amount of duplication since most licences use the
> default parser.

Text files are searched for a header, boilerplate or complete license text.

(Here's my understanding of the domain. Please jump in with questions, 
clarifications or to correct any misunderstanding...)

The parser needs to be smart enough to strip out not only white space 
but also appropriate comments indicators (for example '*') before 
searching for the text.

The headers, boilerplates and licenses may include some parameterized. 
For example, to use some BSD licenses, the name of the organisation 
issuing the license needs to be included.

Each license may have several different headers that all indicate that 
the file is issued under that license. For example, the standard Apache 
Software Foundation boilerplate is different from the recommended text 
for application by individuals. Some policies may need to distinguish 
between these cases, but most will be more interested in the license.

Robert

Re: RAT GSOC: [WAS Re: RAT: IHeaderMatcher Design]

Posted by Robert Burrell Donkin <ro...@blueyonder.co.uk>.
On 07/27/13 18:49, P. Ottlinger wrote:
> Hi,
>
> Am 27.07.2013 09:42, schrieb Robert Burrell Donkin:
>> I'm inclined towards factoring out a separate Maven module for our new
>> domain objects. If this turns out to be a mistake, we can always combine
>> it back in later. Potentially, this will allow legacy stuff to be kept
>> around for a while.
>
> +1
>
> Could be some sort of creadur-api module containing beans and services,
> while the other modules provide implementations for CLI/mvn a.s.o.

+1

core is now too large to fill that role well

The ant tasks and maven plugins depend on core, and moving command line 
stuff from core to a cli module sounds good. Code could be relocated 
gradually from core to the new api module and (perhaps) into new 
implementation modules.

Robert

Re: RAT GSOC: [WAS Re: RAT: IHeaderMatcher Design]

Posted by "P. Ottlinger" <po...@aiki-it.de>.
Hi,

Am 27.07.2013 09:42, schrieb Robert Burrell Donkin:
> I'm inclined towards factoring out a separate Maven module for our new
> domain objects. If this turns out to be a mistake, we can always combine
> it back in later. Potentially, this will allow legacy stuff to be kept
> around for a while.

+1

Could be some sort of creadur-api module containing beans and services,
while the other modules provide implementations for CLI/mvn a.s.o.

Phil

RAT GSOC: [WAS Re: RAT: IHeaderMatcher Design]

Posted by Robert Burrell Donkin <ro...@blueyonder.co.uk>.
On 07/13/13 18:52, P. Ottlinger wrote:

<snip>

> Licences should be data objects.
>
> What about adding a parserFactory whose default implementation is the
> line based parser and adding a parser to the licence data objects.
>
> Thus a matcher were to contain a pair of data object/licence and parser.
>
> This would reduce the amount of duplication since most licences use the
> default parser.

+1

Being able to do this sort of remodelling is one of my motivations for 
proposing our GSoc project. So, I'm keen to get this moving.

Manuel[2] and I[1] are working on the GSOC stuff over at GitHub. Please 
feel free to dive in and comment or fork. Hopefully, we'll be able to 
start offering patches for trunk at Apache once 0.10 is released.

I'm inclined towards factoring out a separate Maven module for our new 
domain objects. If this turns out to be a mistake, we can always combine 
it back in later. Potentially, this will allow legacy stuff to be kept 
around for a while.

Opinions ...?

Objections ...?

Robert

[1] https://github.com/itstechupnorth/creadur-rat/tree/gsoc
[2] https://github.com/elnuma/creadur-rat/tree/gsoc

Re: RAT: IHeaderMatcher Design

Posted by "P. Ottlinger" <po...@aiki-it.de>.
Hi *,

thanks for raising the issue.

Am 12.07.2013 21:26, schrieb Robert Burrell Donkin:
> I wonder whether it might be more intuitive (as well as opening
> potential for faster parsing) to use immutable domain objects for
> licenses and so on, making them data rather than processors.

+1

Licences should be data objects.

What about adding a parserFactory whose default implementation is the
line based parser and adding a parser to the licence data objects.

Thus a matcher were to contain a pair of data object/licence and parser.

This would reduce the amount of duplication since most licences use the
default parser.

HTH
Phil