You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@maven.apache.org by Andreas Sewe <se...@st.informatik.tu-darmstadt.de> on 2018/01/09 09:59:06 UTC

Allowed characters in GAV and how/where to sanitize?

Hi Maven developers,

doing a large-scale analysis of Maven Central, I've come across a couple
of "weird" GAVs like this one: groupId=com.knappsack,
artifactId=swagger4spring-web, version=mvn+release:perform [1].

The colon in the version raises the question as to the allowed
characters in the different components of a GAV. AFAICT, a colon in the
version is at least rejected by the
org.eclipse.aether.artifact.DefaultArtifact(String) constructor, so that
seems to be illegal, but DefaultModelValidator doesn't complain. Also,
querying the index of Central returns an
org.apache.maven.index.ArtifactInfo with a version of
"mvn+release:perform" just fine.

What's the best way to handle this?

Should every plug-in that consumes, say, a Maven Index sanitize the results?

Or should this be handled upstream in the repository manager? (Note that
the POM of [1] has a <version> of "mvn release:perform", but the
ArtifactInfo's version is "mvn+release:perform", so some sanitation has
already happened somewhere, probably in Nexus.)

Best wishes,

Andreas

[1]
<http://search.maven.org/#artifactdetails%7Ccom.knappsack%7Cswagger4spring-web%7Cmvn%2Brelease%3Aperform%7Cjar>


Re: Allowed characters in GAV and how/where to sanitize?

Posted by Robert Scholte <rf...@apache.org>.
The org.apache.maven.*project*.validation.ModelValidator is the Maven2  
validator.
In Maven3 it is placed in maven-compat, but from there it calls  
org.apache.maven.model.validation.ModelValidator

The maven-deploy-plugin is still compatible with Maven2, hence it refers  
to the old validator. But when you run it with Maven3, it'll use the  
matching ModelValidator.

It looks like the version has become more strict since Maven 3.1

I have to admit this are interesting analysis, but for those artifacts the  
damage is already done, not much we can do about it anymore.
And it is likely you cannot use them anymore as dependency with the most  
recent Maven version.

thanks,
Robert

On Tue, 09 Jan 2018 15:42:41 +0100, Andreas Sewe  
<se...@st.informatik.tu-darmstadt.de> wrote:

> Hi Stephen,
>
>> In an url path segment space is mapped to +
>
> not quite. This holds only for the query part of an URI which is
> typically encoding according to the application/x-www-form-urlencoded
> scheme. Elsewhere in a URI, e.g., the path component, a space is simply
> percent-encoded a %20.
>
>> The repo manager should be blocking those... likely not doing it’s job.
>
> I agree. IMHO the repo manager should block (if only as a last resort
> for people using something to deploy that doesn't do the check earlier).
>
> That being said, the situation on Maven Central is not that dire; there
> are very few versions in the wild that I consider broken:
>
> Additional quotes:
>
> - "1.0.0
> - '1.0'
>
> CLI trouble:
>
> - mvn+release:perform
> - version=1.6.2.1
>
> Commas instead of dots as separator
>
> - 1,0
>
> Expressions or expression-like constructs:
>
> - ${env.VERSION}
> - ${parent.version}
> - @metro.version@
> - $%7Bcucumber-jvm.version%7D
>
> If you are interested, I have a more complete list (about 30 entries
> overall), together with a histogram of characters used in versions.
> Interestingly, no non-ASCII characters are used, not even in qualifiers.
>
>> We probably should also barf on : in a version. There is validation on
>> artifactId and groupId when last I checked
>
> Different validators barf on different things. The
> org.apache.maven.*project*.validation.DefaultModelValidator used by
> deploy:deploy-file is happy with *any* non-empty version, whereas
> org.apache.maven.*model*.validation.DefaultModelValidator does a bit
> more; in particular, it checks for certain filesystem-unsafe characters,
> including the colon: \ / : " < > | ? *
>
> I don't really know why deploy:deploy-file prefers one ModelValidator
> over the other, though. Is this a bug in the the maven-deploy-plugin?
>
> Best wishes,
>
> Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: Allowed characters in GAV and how/where to sanitize?

Posted by Hervé BOUTEMY <he...@free.fr>.
ok, I suppose this is a reminiscence of Maven 2: yes, in Maven 2, there was 
this limitation that caused multiple issues linked in MNG-3010
https://issues.apache.org/jira/browse/MNG-3010

As Robert Scholte noticed, perhaps there are still a few glitches with some 
plugins here and there, that require precise description to be fixed

It's good to have memory: now, I hope the fix for MNG-3010 is part of our 
common memory :)

Regards,

Hervé

Le mercredi 10 janvier 2018, 23:47:13 CET Fred Cooke a écrit :
> Looks like I'm just plain wrong (and happy about it). I'm not sure where
> that memory came from. Perhaps maven 2 some time ago, though it felt fresh
> in my mind. Apologies for the noise! :-(
> 
> On 11 January 2018 at 11:22, Hervé BOUTEMY <he...@free.fr> wrote:
> > Le mercredi 10 janvier 2018, 10:21:35 CET Andreas Sewe a écrit :
> > > Hervé BOUTEMY wrote:
> > > > notice that Central contains artifacts produced by Maven but also by
> > 
> > other
> > 
> > > > tools: I did some analysis myself and found strange things also that
> > 
> > are
> > 
> > > > clearly not produced by Maven. Scala for example produces some
> > 
> > artifacts
> > 
> > > > that I doubt could be referenced by Maven.
> > > 
> > > Yes, Maven is not the only tool that can deploy artifacts to Central --
> > > and that's a good thing.
> > 
> > +1
> > 
> > > > Then: what do we call "broken"?
> > > > Something that seems "clearly" related to a typo?
> > > > Something that can't be consumed by Maven?
> > > > Something that people who produced the release (with any tooling)
> > > > won't
> > > > consume for syntactic reasons on the result? Something that they won't
> > > > consume for other reasons? (like for example because it's continuous
> > > > deployment and it's the 4th version of the day)
> > > 
> > > I wouldn't go so far to treat version=1.6.2.1 as an illegal version
> > 
> > I was not talking about a version with 4 parts: Maven 3 supports an
> > arbitrary
> > count of parts.
> > I was talking about an artifact that is released 4 times per day, because
> > it's
> > continuous delivery (I suppose): a vast majority of releases are IMHO
> > never
> > used
> > 
> > > (after all, I can image someone using legitimately using a qualifier
> > > scheme like 1.2.3-os=linux), but there are IMHO two cases which I always
> > > consider broken:
> > > 
> > > - Spaces in any of the components of a GAV
> > > - A colon in any of the components of a GAV
> > 
> > +1
> > 
> > > Spaces are just likely to cause trouble for some tool further down the
> > 
> > line.
> > 
> > > And for colons we know that they will cause trouble, being the default
> > > separator for GAVs when written as a single string.
> > > 
> > > Aside from those characters, I would probably just ban a few characters
> > > (non-printable control characters). A bit similar to what XML did with a
> > > its NCName (non-colon name) production [1].
> > 
> > +1
> > 
> > > However, for groudId and artifactId we already have much stricter rules
> > > (A-Z, a-z, 0-9, ., -, _), so the argument can be made that
> > > versions/classifiers/extensions should also be made up of a more limited
> > > character set as well.
> > 
> > +1
> > 
> > > In particular, care should to be taken that the path component can still
> > > be parsed unambiguously, so allowing '.' in a classifier is probably
> > > asking for trouble.
> > 
> > +1 again
> > 
> > > > And what can we do?
> > > > On the past artifacts, removing anything is not really an option:
> > > > IMHO,
> > > > the
> > > > issue does not deserve the effort and to break our base rule about
> > > > inalterability.
> > > > On the future, perhaps we can do something:
> > > > - at Maven level, sure we can and we should improve controls as much
> > > > as
> > > > possible
> > > 
> > > Yes, if only that at this level we can provide the best error messages,
> > > as the error is recognized closest to the user.
> > > 
> > > > - on other build tools: perhaps we should try not only to implement
> > 
> > checks
> > 
> > > > in Maven but also document rules for other tools to implement same
> > 
> > rules
> > 
> > > The Maven Resolver is a great place to enforce some rules in
> > > DefaultArtifact (or whatever replaces it). Granted, not everyone deploys
> > > using the Maven Resolver, but its *the* place that knows about all the
> > > intricacies of the repository layout already.
> > > 
> > > > - on repo managers used by the publishers: same rules documentation
> > > > prerequisite, but other tools target
> > > 
> > > Well, Nexus already has some checks in place, to avoid versions like
> > > "1/../../other-artifact/2". However, groupIds like "org...example" are
> > > still accepted (deployed under org/example).
> > 
> > probably ".." should be forbidden also
> > 
> > > > - on sync to central: this is the only location where some rules can
> > > > be
> > > > checked for absolutely any new artifact then really interesting at a
> > 
> > first
> > 
> > > > glance. But making rules evolve at this level is really hard since
> > 
> > there
> > 
> > > > is no real feedback process I know of when base Central publication
> > 
> > rules
> > 
> > > > are not met. Base Central publication rules were defined from the
> > > > beginning (signature, ...), then are implemented by publishers' repo
> > > > managers. I suppose failed controls done by sync to central (then sync
> > > > blocked) are rare: I'm not sure there is a strong process/tooling. And
> > > > adding it would cost some management: not easy. IMHO, we should start
> > 
> > by
> > 
> > > > first detecting if there are really issues on new artifacts these days
> > > > before trying to take actions at this level.
> > > 
> > > That being said, I think the first step is to document the syntax for
> > > GAVs somewhere (e.g., at [2] or [3]).
> > 
> > +1
> > there is an edit button near the title to find the source and propose a PR
> > 
> > :)
> > 
> > Regards,
> > 
> > Hervé
> > 
> > > Best wishes,
> > > 
> > > Andreas
> > > 
> > > [1] <https://www.w3.org/TR/REC-xml-names/#NT-NCName>
> > > [2] <http://maven.apache.org/pom.html#Maven_Coordinates>
> > > [3] <http://maven.apache.org/ref/3.5.2/maven-model/maven.html>
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> > For additional commands, e-mail: dev-help@maven.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: Allowed characters in GAV and how/where to sanitize?

Posted by Fred Cooke <fr...@gmail.com>.
Looks like I'm just plain wrong (and happy about it). I'm not sure where
that memory came from. Perhaps maven 2 some time ago, though it felt fresh
in my mind. Apologies for the noise! :-(

On 11 January 2018 at 11:22, Hervé BOUTEMY <he...@free.fr> wrote:

> Le mercredi 10 janvier 2018, 10:21:35 CET Andreas Sewe a écrit :
> > Hervé BOUTEMY wrote:
> > > notice that Central contains artifacts produced by Maven but also by
> other
> > > tools: I did some analysis myself and found strange things also that
> are
> > > clearly not produced by Maven. Scala for example produces some
> artifacts
> > > that I doubt could be referenced by Maven.
> >
> > Yes, Maven is not the only tool that can deploy artifacts to Central --
> > and that's a good thing.
> +1
>
> >
> > > Then: what do we call "broken"?
> > > Something that seems "clearly" related to a typo?
> > > Something that can't be consumed by Maven?
> > > Something that people who produced the release (with any tooling) won't
> > > consume for syntactic reasons on the result? Something that they won't
> > > consume for other reasons? (like for example because it's continuous
> > > deployment and it's the 4th version of the day)
> >
> > I wouldn't go so far to treat version=1.6.2.1 as an illegal version
> I was not talking about a version with 4 parts: Maven 3 supports an
> arbitrary
> count of parts.
> I was talking about an artifact that is released 4 times per day, because
> it's
> continuous delivery (I suppose): a vast majority of releases are IMHO never
> used
>
> > (after all, I can image someone using legitimately using a qualifier
> > scheme like 1.2.3-os=linux), but there are IMHO two cases which I always
> > consider broken:
> >
> > - Spaces in any of the components of a GAV
> > - A colon in any of the components of a GAV
> +1
>
> >
> > Spaces are just likely to cause trouble for some tool further down the
> line.
> >
> > And for colons we know that they will cause trouble, being the default
> > separator for GAVs when written as a single string.
> >
> > Aside from those characters, I would probably just ban a few characters
> > (non-printable control characters). A bit similar to what XML did with a
> > its NCName (non-colon name) production [1].
> +1
>
> >
> > However, for groudId and artifactId we already have much stricter rules
> > (A-Z, a-z, 0-9, ., -, _), so the argument can be made that
> > versions/classifiers/extensions should also be made up of a more limited
> > character set as well.
> +1
>
> >
> > In particular, care should to be taken that the path component can still
> > be parsed unambiguously, so allowing '.' in a classifier is probably
> > asking for trouble.
> +1 again
>
> >
> > > And what can we do?
> > > On the past artifacts, removing anything is not really an option: IMHO,
> > > the
> > > issue does not deserve the effort and to break our base rule about
> > > inalterability.
> > > On the future, perhaps we can do something:
> > > - at Maven level, sure we can and we should improve controls as much as
> > > possible
> >
> > Yes, if only that at this level we can provide the best error messages,
> > as the error is recognized closest to the user.
> >
> > > - on other build tools: perhaps we should try not only to implement
> checks
> > > in Maven but also document rules for other tools to implement same
> rules
> > The Maven Resolver is a great place to enforce some rules in
> > DefaultArtifact (or whatever replaces it). Granted, not everyone deploys
> > using the Maven Resolver, but its *the* place that knows about all the
> > intricacies of the repository layout already.
> >
> > > - on repo managers used by the publishers: same rules documentation
> > > prerequisite, but other tools target
> >
> > Well, Nexus already has some checks in place, to avoid versions like
> > "1/../../other-artifact/2". However, groupIds like "org...example" are
> > still accepted (deployed under org/example).
> probably ".." should be forbidden also
>
> >
> > > - on sync to central: this is the only location where some rules can be
> > > checked for absolutely any new artifact then really interesting at a
> first
> > > glance. But making rules evolve at this level is really hard since
> there
> > > is no real feedback process I know of when base Central publication
> rules
> > > are not met. Base Central publication rules were defined from the
> > > beginning (signature, ...), then are implemented by publishers' repo
> > > managers. I suppose failed controls done by sync to central (then sync
> > > blocked) are rare: I'm not sure there is a strong process/tooling. And
> > > adding it would cost some management: not easy. IMHO, we should start
> by
> > > first detecting if there are really issues on new artifacts these days
> > > before trying to take actions at this level.
> > That being said, I think the first step is to document the syntax for
> > GAVs somewhere (e.g., at [2] or [3]).
> +1
> there is an edit button near the title to find the source and propose a PR
> :)
>
> Regards,
>
> Hervé
>
> >
> > Best wishes,
> >
> > Andreas
> >
> > [1] <https://www.w3.org/TR/REC-xml-names/#NT-NCName>
> > [2] <http://maven.apache.org/pom.html#Maven_Coordinates>
> > [3] <http://maven.apache.org/ref/3.5.2/maven-model/maven.html>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org
>
>

Re: Allowed characters in GAV and how/where to sanitize?

Posted by Hervé BOUTEMY <he...@free.fr>.
Le mercredi 10 janvier 2018, 10:21:35 CET Andreas Sewe a écrit :
> Hervé BOUTEMY wrote:
> > notice that Central contains artifacts produced by Maven but also by other
> > tools: I did some analysis myself and found strange things also that are
> > clearly not produced by Maven. Scala for example produces some artifacts
> > that I doubt could be referenced by Maven.
> 
> Yes, Maven is not the only tool that can deploy artifacts to Central --
> and that's a good thing.
+1

> 
> > Then: what do we call "broken"?
> > Something that seems "clearly" related to a typo?
> > Something that can't be consumed by Maven?
> > Something that people who produced the release (with any tooling) won't
> > consume for syntactic reasons on the result? Something that they won't
> > consume for other reasons? (like for example because it's continuous
> > deployment and it's the 4th version of the day)
> 
> I wouldn't go so far to treat version=1.6.2.1 as an illegal version
I was not talking about a version with 4 parts: Maven 3 supports an arbitrary 
count of parts.
I was talking about an artifact that is released 4 times per day, because it's 
continuous delivery (I suppose): a vast majority of releases are IMHO never 
used

> (after all, I can image someone using legitimately using a qualifier
> scheme like 1.2.3-os=linux), but there are IMHO two cases which I always
> consider broken:
> 
> - Spaces in any of the components of a GAV
> - A colon in any of the components of a GAV
+1

> 
> Spaces are just likely to cause trouble for some tool further down the line.
> 
> And for colons we know that they will cause trouble, being the default
> separator for GAVs when written as a single string.
> 
> Aside from those characters, I would probably just ban a few characters
> (non-printable control characters). A bit similar to what XML did with a
> its NCName (non-colon name) production [1].
+1

> 
> However, for groudId and artifactId we already have much stricter rules
> (A-Z, a-z, 0-9, ., -, _), so the argument can be made that
> versions/classifiers/extensions should also be made up of a more limited
> character set as well.
+1

> 
> In particular, care should to be taken that the path component can still
> be parsed unambiguously, so allowing '.' in a classifier is probably
> asking for trouble.
+1 again

> 
> > And what can we do?
> > On the past artifacts, removing anything is not really an option: IMHO,
> > the
> > issue does not deserve the effort and to break our base rule about
> > inalterability.
> > On the future, perhaps we can do something:
> > - at Maven level, sure we can and we should improve controls as much as
> > possible
> 
> Yes, if only that at this level we can provide the best error messages,
> as the error is recognized closest to the user.
> 
> > - on other build tools: perhaps we should try not only to implement checks
> > in Maven but also document rules for other tools to implement same rules
> The Maven Resolver is a great place to enforce some rules in
> DefaultArtifact (or whatever replaces it). Granted, not everyone deploys
> using the Maven Resolver, but its *the* place that knows about all the
> intricacies of the repository layout already.
> 
> > - on repo managers used by the publishers: same rules documentation
> > prerequisite, but other tools target
> 
> Well, Nexus already has some checks in place, to avoid versions like
> "1/../../other-artifact/2". However, groupIds like "org...example" are
> still accepted (deployed under org/example).
probably ".." should be forbidden also

> 
> > - on sync to central: this is the only location where some rules can be
> > checked for absolutely any new artifact then really interesting at a first
> > glance. But making rules evolve at this level is really hard since there
> > is no real feedback process I know of when base Central publication rules
> > are not met. Base Central publication rules were defined from the
> > beginning (signature, ...), then are implemented by publishers' repo
> > managers. I suppose failed controls done by sync to central (then sync
> > blocked) are rare: I'm not sure there is a strong process/tooling. And
> > adding it would cost some management: not easy. IMHO, we should start by
> > first detecting if there are really issues on new artifacts these days
> > before trying to take actions at this level.
> That being said, I think the first step is to document the syntax for
> GAVs somewhere (e.g., at [2] or [3]).
+1
there is an edit button near the title to find the source and propose a PR :)

Regards,

Hervé

> 
> Best wishes,
> 
> Andreas
> 
> [1] <https://www.w3.org/TR/REC-xml-names/#NT-NCName>
> [2] <http://maven.apache.org/pom.html#Maven_Coordinates>
> [3] <http://maven.apache.org/ref/3.5.2/maven-model/maven.html>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: Allowed characters in GAV and how/where to sanitize?

Posted by Hervé BOUTEMY <he...@free.fr>.
Why do you say that Maven can't handle 4 part versions?

I fixed MNG-3010 near 10 years ago for Maven 3.0.0-alpha-1

this was a versioning limitation removal with enhancement documented in 
https://cwiki.apache.org/confluence/display/MAVENOLD/Versioning

 I can tell you that since this fix, it supports an arbitrary count of parts
(and resolver's GenericVersionScheme is a little evolution of Maven Artifact's 
ComparableVersion)

Regards,

Hervé

Le mercredi 10 janvier 2018, 15:07:08 CET Andreas Sewe a écrit :
> Fred Cooke wrote:
> > Re versions, I know the background on it, but it annoys me that maven
> > can't
> > handle 4 part versions, 1.2.3.4 as sometimes it's handy to do a patch
> > level
> > that deep. Lots of messed up software in the world :-)
> 
> Are you sure that's still the case (the parts-restriction, not the
> messiness of software ;-)?
> 
> At least the Maven Resolver uses a versioning scheme that's quite
> flexible [1]. Not sure if the flexibility at this low level bubbles up
> all the way to the top, though. Maybe one the Maven developers can chime in.
> > Format should be N[.N as many times as needed][optional hyphen and
> > qualifier of some sort] or something like that. Not hard limited to 1 2 or
> > 3 parts.
> 
> AFAICT, that's what GenericVersionScheme does.
> 
> Hope this helps,
> 
> Andreas
> 
> [1]
> <https://github.com/apache/maven-resolver/blob/3fc53c052f538169cb7dc6aa9ed90
> 52514b569ca/maven-resolver-util/src/main/java/org/eclipse/aether/util/versio
> n/GenericVersionScheme.java#L31>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: Allowed characters in GAV and how/where to sanitize?

Posted by Robert Scholte <rf...@apache.org>.
On Wed, 10 Jan 2018 15:07:08 +0100, Andreas Sewe  
<se...@st.informatik.tu-darmstadt.de> wrote:

> Fred Cooke wrote:
>> Re versions, I know the background on it, but it annoys me that maven  
>> can't
>> handle 4 part versions, 1.2.3.4 as sometimes it's handy to do a patch  
>> level
>> that deep. Lots of messed up software in the world :-)

The number of dots is not limited to 2 for a version. A version has some  
banned characters, but that's it.
However, an ArtifactVersion[1] has some explicit/named segments (which I  
personally consider a bad choice in the past). This means there can be  
unexpected results when *comparing* versions.
In general you won't hit such issues as long as you don't use version  
ranges.
Assuming most won't use ranges, I can only think of possible issues with  
enforcer rules like requireUpperBoundDeps[2], but only when having 4 or  
more segments.

thanks,
Robert

[1]  
https://maven.apache.org/ref/3.5.2/maven-artifact/apidocs/org/apache/maven/artifact/versioning/ArtifactVersion.html
[2]  
https://maven.apache.org/enforcer/enforcer-rules/requireUpperBoundDeps.html

>
> Are you sure that's still the case (the parts-restriction, not the
> messiness of software ;-)?
>
> At least the Maven Resolver uses a versioning scheme that's quite
> flexible [1]. Not sure if the flexibility at this low level bubbles up
> all the way to the top, though. Maybe one the Maven developers can chime  
> in.
>
>> Format should be N[.N as many times as needed][optional hyphen and
>> qualifier of some sort] or something like that. Not hard limited to 1 2  
>> or
>> 3 parts.
>
> AFAICT, that's what GenericVersionScheme does.
>
> Hope this helps,
>
> Andreas
>
> [1]
> <https://github.com/apache/maven-resolver/blob/3fc53c052f538169cb7dc6aa9ed9052514b569ca/maven-resolver-util/src/main/java/org/eclipse/aether/util/version/GenericVersionScheme.java#L31>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: Allowed characters in GAV and how/where to sanitize?

Posted by Andreas Sewe <se...@st.informatik.tu-darmstadt.de>.
Fred Cooke wrote:
> Re versions, I know the background on it, but it annoys me that maven can't
> handle 4 part versions, 1.2.3.4 as sometimes it's handy to do a patch level
> that deep. Lots of messed up software in the world :-)

Are you sure that's still the case (the parts-restriction, not the
messiness of software ;-)?

At least the Maven Resolver uses a versioning scheme that's quite
flexible [1]. Not sure if the flexibility at this low level bubbles up
all the way to the top, though. Maybe one the Maven developers can chime in.

> Format should be N[.N as many times as needed][optional hyphen and
> qualifier of some sort] or something like that. Not hard limited to 1 2 or
> 3 parts.

AFAICT, that's what GenericVersionScheme does.

Hope this helps,

Andreas

[1]
<https://github.com/apache/maven-resolver/blob/3fc53c052f538169cb7dc6aa9ed9052514b569ca/maven-resolver-util/src/main/java/org/eclipse/aether/util/version/GenericVersionScheme.java#L31>


Re: Allowed characters in GAV and how/where to sanitize?

Posted by Fred Cooke <fr...@gmail.com>.
Re versions, I know the background on it, but it annoys me that maven can't
handle 4 part versions, 1.2.3.4 as sometimes it's handy to do a patch level
that deep. Lots of messed up software in the world :-)

Format should be N[.N as many times as needed][optional hyphen and
qualifier of some sort] or something like that. Not hard limited to 1 2 or
3 parts.

On 10 January 2018 at 22:21, Andreas Sewe <
sewe@st.informatik.tu-darmstadt.de> wrote:

> Hervé BOUTEMY wrote:
> > notice that Central contains artifacts produced by Maven but also by
> other
> > tools: I did some analysis myself and found strange things also that are
> > clearly not produced by Maven. Scala for example produces some artifacts
> that
> > I doubt could be referenced by Maven.
>
> Yes, Maven is not the only tool that can deploy artifacts to Central --
> and that's a good thing.
>
> > Then: what do we call "broken"?
> > Something that seems "clearly" related to a typo?
> > Something that can't be consumed by Maven?
> > Something that people who produced the release (with any tooling) won't
> > consume for syntactic reasons on the result? Something that they won't
> consume
> > for other reasons? (like for example because it's continuous deployment
> and
> > it's the 4th version of the day)
>
> I wouldn't go so far to treat version=1.6.2.1 as an illegal version
> (after all, I can image someone using legitimately using a qualifier
> scheme like 1.2.3-os=linux), but there are IMHO two cases which I always
> consider broken:
>
> - Spaces in any of the components of a GAV
> - A colon in any of the components of a GAV
>
> Spaces are just likely to cause trouble for some tool further down the
> line.
>
> And for colons we know that they will cause trouble, being the default
> separator for GAVs when written as a single string.
>
> Aside from those characters, I would probably just ban a few characters
> (non-printable control characters). A bit similar to what XML did with a
> its NCName (non-colon name) production [1].
>
> However, for groudId and artifactId we already have much stricter rules
> (A-Z, a-z, 0-9, ., -, _), so the argument can be made that
> versions/classifiers/extensions should also be made up of a more limited
> character set as well.
>
> In particular, care should to be taken that the path component can still
> be parsed unambiguously, so allowing '.' in a classifier is probably
> asking for trouble.
>
> > And what can we do?
> > On the past artifacts, removing anything is not really an option: IMHO,
> the
> > issue does not deserve the effort and to break our base rule about
> > inalterability.
> > On the future, perhaps we can do something:
> > - at Maven level, sure we can and we should improve controls as much as
> > possible
>
> Yes, if only that at this level we can provide the best error messages,
> as the error is recognized closest to the user.
>
> > - on other build tools: perhaps we should try not only to implement
> checks in
> > Maven but also document rules for other tools to implement same rules
>
> The Maven Resolver is a great place to enforce some rules in
> DefaultArtifact (or whatever replaces it). Granted, not everyone deploys
> using the Maven Resolver, but its *the* place that knows about all the
> intricacies of the repository layout already.
>
> > - on repo managers used by the publishers: same rules documentation
> > prerequisite, but other tools target
>
> Well, Nexus already has some checks in place, to avoid versions like
> "1/../../other-artifact/2". However, groupIds like "org...example" are
> still accepted (deployed under org/example).
>
> > - on sync to central: this is the only location where some rules can be
> > checked for absolutely any new artifact then really interesting at a
> first
> > glance. But making rules evolve at this level is really hard since there
> is no
> > real feedback process I know of when base Central publication rules are
> not
> > met. Base Central publication rules were defined from the beginning
> (signature,
> > ...), then are implemented by publishers' repo managers. I suppose failed
> > controls done by sync to central (then sync blocked) are rare: I'm not
> sure
> > there is a strong process/tooling. And adding it would cost some
> management:
> > not easy. IMHO, we should start by first detecting if there are really
> issues
> > on new artifacts these days before trying to take actions at this level.
>
> That being said, I think the first step is to document the syntax for
> GAVs somewhere (e.g., at [2] or [3]).
>
> Best wishes,
>
> Andreas
>
> [1] <https://www.w3.org/TR/REC-xml-names/#NT-NCName>
> [2] <http://maven.apache.org/pom.html#Maven_Coordinates>
> [3] <http://maven.apache.org/ref/3.5.2/maven-model/maven.html>
>
>
>

Re: Allowed characters in GAV and how/where to sanitize?

Posted by Andreas Sewe <se...@st.informatik.tu-darmstadt.de>.
Hervé BOUTEMY wrote:
> notice that Central contains artifacts produced by Maven but also by other 
> tools: I did some analysis myself and found strange things also that are 
> clearly not produced by Maven. Scala for example produces some artifacts that 
> I doubt could be referenced by Maven.

Yes, Maven is not the only tool that can deploy artifacts to Central --
and that's a good thing.

> Then: what do we call "broken"?
> Something that seems "clearly" related to a typo?
> Something that can't be consumed by Maven?
> Something that people who produced the release (with any tooling) won't 
> consume for syntactic reasons on the result? Something that they won't consume 
> for other reasons? (like for example because it's continuous deployment and 
> it's the 4th version of the day)

I wouldn't go so far to treat version=1.6.2.1 as an illegal version
(after all, I can image someone using legitimately using a qualifier
scheme like 1.2.3-os=linux), but there are IMHO two cases which I always
consider broken:

- Spaces in any of the components of a GAV
- A colon in any of the components of a GAV

Spaces are just likely to cause trouble for some tool further down the line.

And for colons we know that they will cause trouble, being the default
separator for GAVs when written as a single string.

Aside from those characters, I would probably just ban a few characters
(non-printable control characters). A bit similar to what XML did with a
its NCName (non-colon name) production [1].

However, for groudId and artifactId we already have much stricter rules
(A-Z, a-z, 0-9, ., -, _), so the argument can be made that
versions/classifiers/extensions should also be made up of a more limited
character set as well.

In particular, care should to be taken that the path component can still
be parsed unambiguously, so allowing '.' in a classifier is probably
asking for trouble.

> And what can we do?
> On the past artifacts, removing anything is not really an option: IMHO, the 
> issue does not deserve the effort and to break our base rule about 
> inalterability.
> On the future, perhaps we can do something:
> - at Maven level, sure we can and we should improve controls as much as 
> possible

Yes, if only that at this level we can provide the best error messages,
as the error is recognized closest to the user.

> - on other build tools: perhaps we should try not only to implement checks in 
> Maven but also document rules for other tools to implement same rules

The Maven Resolver is a great place to enforce some rules in
DefaultArtifact (or whatever replaces it). Granted, not everyone deploys
using the Maven Resolver, but its *the* place that knows about all the
intricacies of the repository layout already.

> - on repo managers used by the publishers: same rules documentation 
> prerequisite, but other tools target

Well, Nexus already has some checks in place, to avoid versions like
"1/../../other-artifact/2". However, groupIds like "org...example" are
still accepted (deployed under org/example).

> - on sync to central: this is the only location where some rules can be 
> checked for absolutely any new artifact then really interesting at a first 
> glance. But making rules evolve at this level is really hard since there is no 
> real feedback process I know of when base Central publication rules are not 
> met. Base Central publication rules were defined from the beginning (signature, 
> ...), then are implemented by publishers' repo managers. I suppose failed 
> controls done by sync to central (then sync blocked) are rare: I'm not sure 
> there is a strong process/tooling. And adding it would cost some management: 
> not easy. IMHO, we should start by first detecting if there are really issues 
> on new artifacts these days before trying to take actions at this level.

That being said, I think the first step is to document the syntax for
GAVs somewhere (e.g., at [2] or [3]).

Best wishes,

Andreas

[1] <https://www.w3.org/TR/REC-xml-names/#NT-NCName>
[2] <http://maven.apache.org/pom.html#Maven_Coordinates>
[3] <http://maven.apache.org/ref/3.5.2/maven-model/maven.html>



Re: Allowed characters in GAV and how/where to sanitize?

Posted by Hervé BOUTEMY <he...@free.fr>.
notice that Central contains artifacts produced by Maven but also by other 
tools: I did some analysis myself and found strange things also that are 
clearly not produced by Maven. Scala for example produces some artifacts that 
I doubt could be referenced by Maven.

Then: what do we call "broken"?
Something that seems "clearly" related to a typo?
Something that can't be consumed by Maven?
Something that people who produced the release (with any tooling) won't 
consume for syntactic reasons on the result? Something that they won't consume 
for other reasons? (like for example because it's continuous deployment and 
it's the 4th version of the day)

And what can we do?
On the past artifacts, removing anything is not really an option: IMHO, the 
issue does not deserve the effort and to break our base rule about 
inalterability.
On the future, perhaps we can do something:
- at Maven level, sure we can and we should improve controls as much as 
possible
- on other build tools: perhaps we should try not only to implement checks in 
Maven but also document rules for other tools to implement same rules
- on repo managers used by the publishers: same rules documentation 
prerequisite, but other tools target
- on sync to central: this is the only location where some rules can be 
checked for absolutely any new artifact then really interesting at a first 
glance. But making rules evolve at this level is really hard since there is no 
real feedback process I know of when base Central publication rules are not 
met. Base Central publication rules were defined from the beginning (signature, 
...), then are implemented by publishers' repo managers. I suppose failed 
controls done by sync to central (then sync blocked) are rare: I'm not sure 
there is a strong process/tooling. And adding it would cost some management: 
not easy. IMHO, we should start by first detecting if there are really issues 
on new artifacts these days before trying to take actions at this level.

Regards,

Hervé

Le mardi 9 janvier 2018, 15:42:41 CET Andreas Sewe a écrit :
> Hi Stephen,
> 
> > In an url path segment space is mapped to +
> 
> not quite. This holds only for the query part of an URI which is
> typically encoding according to the application/x-www-form-urlencoded
> scheme. Elsewhere in a URI, e.g., the path component, a space is simply
> percent-encoded a %20.
> 
> > The repo manager should be blocking those... likely not doing it’s job.
> 
> I agree. IMHO the repo manager should block (if only as a last resort
> for people using something to deploy that doesn't do the check earlier).
> 
> That being said, the situation on Maven Central is not that dire; there
> are very few versions in the wild that I consider broken:
> 
> Additional quotes:
> 
> - "1.0.0
> - '1.0'
> 
> CLI trouble:
> 
> - mvn+release:perform
> - version=1.6.2.1
> 
> Commas instead of dots as separator
> 
> - 1,0
> 
> Expressions or expression-like constructs:
> 
> - ${env.VERSION}
> - ${parent.version}
> - @metro.version@
> - $%7Bcucumber-jvm.version%7D
> 
> If you are interested, I have a more complete list (about 30 entries
> overall), together with a histogram of characters used in versions.
> Interestingly, no non-ASCII characters are used, not even in qualifiers.
> 
> > We probably should also barf on : in a version. There is validation on
> > artifactId and groupId when last I checked
> 
> Different validators barf on different things. The
> org.apache.maven.*project*.validation.DefaultModelValidator used by
> deploy:deploy-file is happy with *any* non-empty version, whereas
> org.apache.maven.*model*.validation.DefaultModelValidator does a bit
> more; in particular, it checks for certain filesystem-unsafe characters,
> including the colon: \ / : " < > | ? *
> 
> I don't really know why deploy:deploy-file prefers one ModelValidator
> over the other, though. Is this a bug in the the maven-deploy-plugin?
> 
> Best wishes,
> 
> Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: Allowed characters in GAV and how/where to sanitize?

Posted by Andreas Sewe <se...@st.informatik.tu-darmstadt.de>.
Hi Stephen,

> In an url path segment space is mapped to +

not quite. This holds only for the query part of an URI which is
typically encoding according to the application/x-www-form-urlencoded
scheme. Elsewhere in a URI, e.g., the path component, a space is simply
percent-encoded a %20.

> The repo manager should be blocking those... likely not doing it’s job.

I agree. IMHO the repo manager should block (if only as a last resort
for people using something to deploy that doesn't do the check earlier).

That being said, the situation on Maven Central is not that dire; there
are very few versions in the wild that I consider broken:

Additional quotes:

- "1.0.0
- '1.0'

CLI trouble:

- mvn+release:perform
- version=1.6.2.1

Commas instead of dots as separator

- 1,0

Expressions or expression-like constructs:

- ${env.VERSION}
- ${parent.version}
- @metro.version@
- $%7Bcucumber-jvm.version%7D

If you are interested, I have a more complete list (about 30 entries
overall), together with a histogram of characters used in versions.
Interestingly, no non-ASCII characters are used, not even in qualifiers.

> We probably should also barf on : in a version. There is validation on
> artifactId and groupId when last I checked

Different validators barf on different things. The
org.apache.maven.*project*.validation.DefaultModelValidator used by
deploy:deploy-file is happy with *any* non-empty version, whereas
org.apache.maven.*model*.validation.DefaultModelValidator does a bit
more; in particular, it checks for certain filesystem-unsafe characters,
including the colon: \ / : " < > | ? *

I don't really know why deploy:deploy-file prefers one ModelValidator
over the other, though. Is this a bug in the the maven-deploy-plugin?

Best wishes,

Andreas


Re: Allowed characters in GAV and how/where to sanitize?

Posted by Stephen Connolly <st...@gmail.com>.
In an url path segment space is mapped to +

The repo manager should be blocking those... likely not doing it’s job.

We probably should also barf on : in a version. There is validation on
artifactId and groupId when last I checked

On Tue 9 Jan 2018 at 09:59, Andreas Sewe <se...@st.informatik.tu-darmstadt.de>
wrote:

> Hi Maven developers,
>
> doing a large-scale analysis of Maven Central, I've come across a couple
> of "weird" GAVs like this one: groupId=com.knappsack,
> artifactId=swagger4spring-web, version=mvn+release:perform [1].
>
> The colon in the version raises the question as to the allowed
> characters in the different components of a GAV. AFAICT, a colon in the
> version is at least rejected by the
> org.eclipse.aether.artifact.DefaultArtifact(String) constructor, so that
> seems to be illegal, but DefaultModelValidator doesn't complain. Also,
> querying the index of Central returns an
> org.apache.maven.index.ArtifactInfo with a version of
> "mvn+release:perform" just fine.
>
> What's the best way to handle this?
>
> Should every plug-in that consumes, say, a Maven Index sanitize the
> results?
>
> Or should this be handled upstream in the repository manager? (Note that
> the POM of [1] has a <version> of "mvn release:perform", but the
> ArtifactInfo's version is "mvn+release:perform", so some sanitation has
> already happened somewhere, probably in Nexus.)
>
> Best wishes,
>
> Andreas
>
> [1]
> <
> http://search.maven.org/#artifactdetails%7Ccom.knappsack%7Cswagger4spring-web%7Cmvn%2Brelease%3Aperform%7Cjar
> >
>
> --
Sent from my phone