You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by Mike Beckerle <mb...@apache.org> on 2021/12/03 22:25:14 UTC

simplified schema project layout

Experience in giving DFDL training via daffodil is that our standard schema
project layout <https://daffodil.apache.org/dfdl-layout/> is much too deep
(directory wise) for many users to conveniently navigate and use. It gets
in the way of learning.

Our layout was designed to follow sbt conventions that enable automated
dependency management, packaging, etc. It is easy to use if you are
accustomed to using an IDE like Eclipse or IntelliJ.  It is also
extraordinarily valuable (and underappreciated) that 'sbt test' does a
built-in-self-test on a schema, and that 'sbt publishLocal' creates a Jar
of a DFDL schema for managed dependencies use between schemas.

But new users are mostly coming to DFDL/Daffodil from a command-line prompt
and a text editor (e.g., VIM).

I am wondering if we can have our cake and eat it too, without too much
added sbt complexity, and without losing 'sbt test' and 'sbt publishLocal'
working their magic for us.

E.g., what if a simplified layout was:

mySchema/schema - takes the place of src/main/*. Also no package-style
directory folder structure.
mySchema/test - takes the place of src/test/*. No package-style directory
folder structure.

It would be optional if users want to user mySchema/test/data and
mySchema/test/infosets to separate infosets and data, or just put all those
files in the same place and use file extensions (.dat vs. .dat.xml vs.
.tdml, etc.) to distinguish the kinds of content.

Such a flattened tree structure requires that the schema file names are
well chosen to be unlikely to conflict with other users chosen names, so a
name like common.dfdl.xsd or main.dfdl.xsd would be no good as there is no
package directory structure to make them unique.

But names like common-mySchema.dfdl.xsd and main-mySchema.dfdl.xsd would
still be quite convenient to use, particularly if the mySchema name is well
chosen. (Note how I've put the unique part of the name first, so that
name-completion will work most easily on command line.)

I think this would still work with sbt if we simply override the default
paths (and perhaps file patterns) used for specifying source and resources.

Thoughts?

Re: simplified schema project layout

Posted by Mike Beckerle <mb...@apache.org>.
I converted the DFDLSchemas CSV example to use the simplified layout.

I actually like this a lot better for simple examples than the
original "standard schema file system layout".

Take a look and see what you think:

https://github.com/DFDLSchemas/CSV/pull/7

On Wed, Dec 8, 2021 at 12:05 PM Mike Beckerle <mb...@apache.org> wrote:
>
> I will give this a try.
>
> On Wed, Dec 8, 2021 at 10:39 AM Steve Lawrence <sl...@apache.org> wrote:
> >
> > That's fair, I agree there definitely is some redundancy. In general I'm
> > not a huge fan of mixing sources and resources, but maybe it's not too
> > big of a deal since in this case since sources for UDF/Layers will be
> > rare, and when they do exist there's probably only a very small number
> > of them.
> >
> > I haven't tested this much, but based on some examples and playing
> > around a bit, I think this gets you what you're after:
> >
> >    organization := "org.example"
> >
> >    name := "dfdl-fmt"
> >
> >    version := "0.1.0-SNAPSHOT"
> >
> >    lazy val root = (project in file("."))
> >      .settings(
> >        Project.inConfig(Compile)(flattenSettings("src")),
> >        Project.inConfig(Test)(flattenSettings("test")),
> >      )
> >
> >    def flattenSettings(name: String) = Seq(
> >      unmanagedSourceDirectories := Seq(baseDirectory.value / name),
> >      unmanagedResourceDirectories := unmanagedSourceDirectories.value,
> >      unmanagedSources / includeFilter := "*.java" | "*.scala",
> >      unmanagedResources / excludeFilter := (unmanagedSources /
> > includeFilter).value,
> >    )
> >
> > (note that we probably also want many of the existing settings in our
> > current build.sbt files)
> >
> > All the non-test stuff goes in a "src" directory. Sources are anything
> > that ends with .java or .scala. Resources are anything that isn't a source.
> >
> > And the "test" directory has the exact same layout, but for tests.
> >
> > The .class files that end up in the jar are namespaced by the package line.
> >
> > The resources that end up in the jar are namespaced by the directory
> > structure and/or file naming convention as they are in the src/ or test/
> > directory. So schema authors can namespace schemas however they want,
> > whether it be directories or file names, or not at all.
> >
> >
> > On 12/8/21 9:56 AM, Mike Beckerle wrote:
> > > I guess my concern is that all the depth associated with the sbt-based
> > > standard layout feels completely redundant to me.
> > >
> > > I am suggesting of the src/main/scala, we need only main/. Of
> > > src/main/resources/kind we need only main/.
> > >
> > > E.g, Why are all the typed subdirs needed (xsd/, dfdl/, etc.) when
> > > file extensions can be used to distinguish resource types and
> > > programming language compilers to be used?
> > >
> > > To me the only "real" distinction in the standard project layout is
> > > main vs. test which is needed to exclude test stuff when packaging.
> > >
> > > The rest is
> > > (a) using directories as "package names" - which can be done with
> > > well-chosen longer file names
> > > (b) using directories as redundant file typing - which can be done
> > > with file name extensions.
> > >
> > > To me a UDF is a META-INF/services file and some scala/java code in
> > > the "main" area.
> > > Ditto for a layer definition.
> > >
> > > I guess concretely I am wondering if there is a way to override basic
> > > sbt settings like this:
> > >
> > > * Instead of src/main/scala, just look for main/*.scala
> > > * Instead of src/main/java, just look for main/*.java
> > > * Instead of src/main/resources/* just look for main/* where the file
> > > name does not end in ".scala" nor ".java"
> > >
> > > And similarly for test things, where src/test/whatever just becomes
> > > test/whatever and distinctions are made using file name extensions.
> > >
> > > On Wed, Dec 8, 2021 at 9:21 AM Steve Lawrence <sl...@apache.org> wrote:
> > >>
> > >> What about the scala/java/resources directories? Do those still exist or
> > >> are they simplified somehow?
> > >>
> > >> We currently have an xsd/ directory to allow schematron, xslt, etc to be
> > >> included in the same repo. Do we still have that directory?
> > >>
> > >> How do pluggable UDF's and Layers fit into this? Do we suggest those are
> > >> in separate repos, or can they fit into this?
> > >>
> > >> Note that I believe sbt supports organizations in a single directory
> > >> name, e.g.
> > >>
> > >>     src/
> > >>     └── main/
> > >>         └── resources/
> > >>             └── org.foo.myschema/
> > >>                 └── xsd/
> > >>                     └── common.xsd
> > >>
> > >> So that could be one approach to reduce the deep directory structures.
> > >>
> > >> Generally, I'm definitely in favor of simplifying the layout, but this
> > >> to me feels like it might just add more confusion since it's sort of
> > >> close to the existing layout, but not quite the same.
> > >>
> > >> If we are potentially going to go against the standards, and potentially
> > >> make IDE support more difficult, I almost wonder if we should be more
> > >> ambitious and come up with something that is completely different? I'm
> > >> not sure what that would be, but could be more flat. For example, maybe
> > >> something like this:
> > >>
> > >>     dfdl-fmt/
> > >>     ├── build.sbt
> > >>     ├── dfdl/
> > >>     │   ├── format.dfdl.xsd
> > >>     │   └── main.dfdl.xsd
> > >>     ├── layer/
> > >>     │   └── MyLayer.scala
> > >>     ├── sch/
> > >>     ├── tdml/
> > >>     │   └── main.tdml
> > >>     ├── udf/
> > >>     │   └── MyUDF.scala
> > >>     └── xslt/
> > >>
> > >> A plugin could implicitly add organization structure so things are
> > >> namespace when building a jar. Or maybe we even do something like NiFi
> > >> has with .nar fles have have a custom package format, e.g. .dar
> > >>
> > >> It's probably a lot more work, and things to work out (e.g. how to
> > >> dependencies work for udf and layers), and almost certainly needs  a
> > >> plugin to work instead of just tweaking sbt properties, but something
> > >> like that feels more ideal to me.
> > >>
> > >> Note that maybe we don't even use sbt for this. Maybe there's a better
> > >> tool for something like this.
> > >>
> > >> Another thing to consider that is related, with NiFi we found it
> > >> difficult to add jars to the NiFi classpath for a specific processor,
> > >> which means loading schemas from a jar on the classpath couldn't be
> > >> done. Having a custom package format could make this easier, since all
> > >> the .dar processing/lookup would be done by Daffodil rather than
> > >> standard classpath lookups.
> > >>
> > >>
> > >> On 12/3/21 5:25 PM, Mike Beckerle wrote:
> > >>> Experience in giving DFDL training via daffodil is that our standard schema
> > >>> project layout <https://daffodil.apache.org/dfdl-layout/> is much too deep
> > >>> (directory wise) for many users to conveniently navigate and use. It gets
> > >>> in the way of learning.
> > >>>
> > >>> Our layout was designed to follow sbt conventions that enable automated
> > >>> dependency management, packaging, etc. It is easy to use if you are
> > >>> accustomed to using an IDE like Eclipse or IntelliJ.  It is also
> > >>> extraordinarily valuable (and underappreciated) that 'sbt test' does a
> > >>> built-in-self-test on a schema, and that 'sbt publishLocal' creates a Jar
> > >>> of a DFDL schema for managed dependencies use between schemas.
> > >>>
> > >>> But new users are mostly coming to DFDL/Daffodil from a command-line prompt
> > >>> and a text editor (e.g., VIM).
> > >>>
> > >>> I am wondering if we can have our cake and eat it too, without too much
> > >>> added sbt complexity, and without losing 'sbt test' and 'sbt publishLocal'
> > >>> working their magic for us.
> > >>>
> > >>> E.g., what if a simplified layout was:
> > >>>
> > >>> mySchema/schema - takes the place of src/main/*. Also no package-style
> > >>> directory folder structure.
> > >>> mySchema/test - takes the place of src/test/*. No package-style directory
> > >>> folder structure.
> > >>>
> > >>> It would be optional if users want to user mySchema/test/data and
> > >>> mySchema/test/infosets to separate infosets and data, or just put all those
> > >>> files in the same place and use file extensions (.dat vs. .dat.xml vs.
> > >>> .tdml, etc.) to distinguish the kinds of content.
> > >>>
> > >>> Such a flattened tree structure requires that the schema file names are
> > >>> well chosen to be unlikely to conflict with other users chosen names, so a
> > >>> name like common.dfdl.xsd or main.dfdl.xsd would be no good as there is no
> > >>> package directory structure to make them unique.
> > >>>
> > >>> But names like common-mySchema.dfdl.xsd and main-mySchema.dfdl.xsd would
> > >>> still be quite convenient to use, particularly if the mySchema name is well
> > >>> chosen. (Note how I've put the unique part of the name first, so that
> > >>> name-completion will work most easily on command line.)
> > >>>
> > >>> I think this would still work with sbt if we simply override the default
> > >>> paths (and perhaps file patterns) used for specifying source and resources.
> > >>>
> > >>> Thoughts?
> > >>>
> > >>
> >

Re: simplified schema project layout

Posted by Mike Beckerle <mb...@apache.org>.
I will give this a try.

On Wed, Dec 8, 2021 at 10:39 AM Steve Lawrence <sl...@apache.org> wrote:
>
> That's fair, I agree there definitely is some redundancy. In general I'm
> not a huge fan of mixing sources and resources, but maybe it's not too
> big of a deal since in this case since sources for UDF/Layers will be
> rare, and when they do exist there's probably only a very small number
> of them.
>
> I haven't tested this much, but based on some examples and playing
> around a bit, I think this gets you what you're after:
>
>    organization := "org.example"
>
>    name := "dfdl-fmt"
>
>    version := "0.1.0-SNAPSHOT"
>
>    lazy val root = (project in file("."))
>      .settings(
>        Project.inConfig(Compile)(flattenSettings("src")),
>        Project.inConfig(Test)(flattenSettings("test")),
>      )
>
>    def flattenSettings(name: String) = Seq(
>      unmanagedSourceDirectories := Seq(baseDirectory.value / name),
>      unmanagedResourceDirectories := unmanagedSourceDirectories.value,
>      unmanagedSources / includeFilter := "*.java" | "*.scala",
>      unmanagedResources / excludeFilter := (unmanagedSources /
> includeFilter).value,
>    )
>
> (note that we probably also want many of the existing settings in our
> current build.sbt files)
>
> All the non-test stuff goes in a "src" directory. Sources are anything
> that ends with .java or .scala. Resources are anything that isn't a source.
>
> And the "test" directory has the exact same layout, but for tests.
>
> The .class files that end up in the jar are namespaced by the package line.
>
> The resources that end up in the jar are namespaced by the directory
> structure and/or file naming convention as they are in the src/ or test/
> directory. So schema authors can namespace schemas however they want,
> whether it be directories or file names, or not at all.
>
>
> On 12/8/21 9:56 AM, Mike Beckerle wrote:
> > I guess my concern is that all the depth associated with the sbt-based
> > standard layout feels completely redundant to me.
> >
> > I am suggesting of the src/main/scala, we need only main/. Of
> > src/main/resources/kind we need only main/.
> >
> > E.g, Why are all the typed subdirs needed (xsd/, dfdl/, etc.) when
> > file extensions can be used to distinguish resource types and
> > programming language compilers to be used?
> >
> > To me the only "real" distinction in the standard project layout is
> > main vs. test which is needed to exclude test stuff when packaging.
> >
> > The rest is
> > (a) using directories as "package names" - which can be done with
> > well-chosen longer file names
> > (b) using directories as redundant file typing - which can be done
> > with file name extensions.
> >
> > To me a UDF is a META-INF/services file and some scala/java code in
> > the "main" area.
> > Ditto for a layer definition.
> >
> > I guess concretely I am wondering if there is a way to override basic
> > sbt settings like this:
> >
> > * Instead of src/main/scala, just look for main/*.scala
> > * Instead of src/main/java, just look for main/*.java
> > * Instead of src/main/resources/* just look for main/* where the file
> > name does not end in ".scala" nor ".java"
> >
> > And similarly for test things, where src/test/whatever just becomes
> > test/whatever and distinctions are made using file name extensions.
> >
> > On Wed, Dec 8, 2021 at 9:21 AM Steve Lawrence <sl...@apache.org> wrote:
> >>
> >> What about the scala/java/resources directories? Do those still exist or
> >> are they simplified somehow?
> >>
> >> We currently have an xsd/ directory to allow schematron, xslt, etc to be
> >> included in the same repo. Do we still have that directory?
> >>
> >> How do pluggable UDF's and Layers fit into this? Do we suggest those are
> >> in separate repos, or can they fit into this?
> >>
> >> Note that I believe sbt supports organizations in a single directory
> >> name, e.g.
> >>
> >>     src/
> >>     └── main/
> >>         └── resources/
> >>             └── org.foo.myschema/
> >>                 └── xsd/
> >>                     └── common.xsd
> >>
> >> So that could be one approach to reduce the deep directory structures.
> >>
> >> Generally, I'm definitely in favor of simplifying the layout, but this
> >> to me feels like it might just add more confusion since it's sort of
> >> close to the existing layout, but not quite the same.
> >>
> >> If we are potentially going to go against the standards, and potentially
> >> make IDE support more difficult, I almost wonder if we should be more
> >> ambitious and come up with something that is completely different? I'm
> >> not sure what that would be, but could be more flat. For example, maybe
> >> something like this:
> >>
> >>     dfdl-fmt/
> >>     ├── build.sbt
> >>     ├── dfdl/
> >>     │   ├── format.dfdl.xsd
> >>     │   └── main.dfdl.xsd
> >>     ├── layer/
> >>     │   └── MyLayer.scala
> >>     ├── sch/
> >>     ├── tdml/
> >>     │   └── main.tdml
> >>     ├── udf/
> >>     │   └── MyUDF.scala
> >>     └── xslt/
> >>
> >> A plugin could implicitly add organization structure so things are
> >> namespace when building a jar. Or maybe we even do something like NiFi
> >> has with .nar fles have have a custom package format, e.g. .dar
> >>
> >> It's probably a lot more work, and things to work out (e.g. how to
> >> dependencies work for udf and layers), and almost certainly needs  a
> >> plugin to work instead of just tweaking sbt properties, but something
> >> like that feels more ideal to me.
> >>
> >> Note that maybe we don't even use sbt for this. Maybe there's a better
> >> tool for something like this.
> >>
> >> Another thing to consider that is related, with NiFi we found it
> >> difficult to add jars to the NiFi classpath for a specific processor,
> >> which means loading schemas from a jar on the classpath couldn't be
> >> done. Having a custom package format could make this easier, since all
> >> the .dar processing/lookup would be done by Daffodil rather than
> >> standard classpath lookups.
> >>
> >>
> >> On 12/3/21 5:25 PM, Mike Beckerle wrote:
> >>> Experience in giving DFDL training via daffodil is that our standard schema
> >>> project layout <https://daffodil.apache.org/dfdl-layout/> is much too deep
> >>> (directory wise) for many users to conveniently navigate and use. It gets
> >>> in the way of learning.
> >>>
> >>> Our layout was designed to follow sbt conventions that enable automated
> >>> dependency management, packaging, etc. It is easy to use if you are
> >>> accustomed to using an IDE like Eclipse or IntelliJ.  It is also
> >>> extraordinarily valuable (and underappreciated) that 'sbt test' does a
> >>> built-in-self-test on a schema, and that 'sbt publishLocal' creates a Jar
> >>> of a DFDL schema for managed dependencies use between schemas.
> >>>
> >>> But new users are mostly coming to DFDL/Daffodil from a command-line prompt
> >>> and a text editor (e.g., VIM).
> >>>
> >>> I am wondering if we can have our cake and eat it too, without too much
> >>> added sbt complexity, and without losing 'sbt test' and 'sbt publishLocal'
> >>> working their magic for us.
> >>>
> >>> E.g., what if a simplified layout was:
> >>>
> >>> mySchema/schema - takes the place of src/main/*. Also no package-style
> >>> directory folder structure.
> >>> mySchema/test - takes the place of src/test/*. No package-style directory
> >>> folder structure.
> >>>
> >>> It would be optional if users want to user mySchema/test/data and
> >>> mySchema/test/infosets to separate infosets and data, or just put all those
> >>> files in the same place and use file extensions (.dat vs. .dat.xml vs.
> >>> .tdml, etc.) to distinguish the kinds of content.
> >>>
> >>> Such a flattened tree structure requires that the schema file names are
> >>> well chosen to be unlikely to conflict with other users chosen names, so a
> >>> name like common.dfdl.xsd or main.dfdl.xsd would be no good as there is no
> >>> package directory structure to make them unique.
> >>>
> >>> But names like common-mySchema.dfdl.xsd and main-mySchema.dfdl.xsd would
> >>> still be quite convenient to use, particularly if the mySchema name is well
> >>> chosen. (Note how I've put the unique part of the name first, so that
> >>> name-completion will work most easily on command line.)
> >>>
> >>> I think this would still work with sbt if we simply override the default
> >>> paths (and perhaps file patterns) used for specifying source and resources.
> >>>
> >>> Thoughts?
> >>>
> >>
>

Re: simplified schema project layout

Posted by Steve Lawrence <sl...@apache.org>.
That's fair, I agree there definitely is some redundancy. In general I'm 
not a huge fan of mixing sources and resources, but maybe it's not too 
big of a deal since in this case since sources for UDF/Layers will be 
rare, and when they do exist there's probably only a very small number 
of them.

I haven't tested this much, but based on some examples and playing 
around a bit, I think this gets you what you're after:

   organization := "org.example"

   name := "dfdl-fmt"

   version := "0.1.0-SNAPSHOT"

   lazy val root = (project in file("."))
     .settings(
       Project.inConfig(Compile)(flattenSettings("src")),
       Project.inConfig(Test)(flattenSettings("test")),
     )

   def flattenSettings(name: String) = Seq(
     unmanagedSourceDirectories := Seq(baseDirectory.value / name),
     unmanagedResourceDirectories := unmanagedSourceDirectories.value,
     unmanagedSources / includeFilter := "*.java" | "*.scala",
     unmanagedResources / excludeFilter := (unmanagedSources / 
includeFilter).value,
   )

(note that we probably also want many of the existing settings in our 
current build.sbt files)

All the non-test stuff goes in a "src" directory. Sources are anything 
that ends with .java or .scala. Resources are anything that isn't a source.

And the "test" directory has the exact same layout, but for tests.

The .class files that end up in the jar are namespaced by the package line.

The resources that end up in the jar are namespaced by the directory 
structure and/or file naming convention as they are in the src/ or test/ 
directory. So schema authors can namespace schemas however they want, 
whether it be directories or file names, or not at all.


On 12/8/21 9:56 AM, Mike Beckerle wrote:
> I guess my concern is that all the depth associated with the sbt-based
> standard layout feels completely redundant to me.
> 
> I am suggesting of the src/main/scala, we need only main/. Of
> src/main/resources/kind we need only main/.
> 
> E.g, Why are all the typed subdirs needed (xsd/, dfdl/, etc.) when
> file extensions can be used to distinguish resource types and
> programming language compilers to be used?
> 
> To me the only "real" distinction in the standard project layout is
> main vs. test which is needed to exclude test stuff when packaging.
> 
> The rest is
> (a) using directories as "package names" - which can be done with
> well-chosen longer file names
> (b) using directories as redundant file typing - which can be done
> with file name extensions.
> 
> To me a UDF is a META-INF/services file and some scala/java code in
> the "main" area.
> Ditto for a layer definition.
> 
> I guess concretely I am wondering if there is a way to override basic
> sbt settings like this:
> 
> * Instead of src/main/scala, just look for main/*.scala
> * Instead of src/main/java, just look for main/*.java
> * Instead of src/main/resources/* just look for main/* where the file
> name does not end in ".scala" nor ".java"
> 
> And similarly for test things, where src/test/whatever just becomes
> test/whatever and distinctions are made using file name extensions.
> 
> On Wed, Dec 8, 2021 at 9:21 AM Steve Lawrence <sl...@apache.org> wrote:
>>
>> What about the scala/java/resources directories? Do those still exist or
>> are they simplified somehow?
>>
>> We currently have an xsd/ directory to allow schematron, xslt, etc to be
>> included in the same repo. Do we still have that directory?
>>
>> How do pluggable UDF's and Layers fit into this? Do we suggest those are
>> in separate repos, or can they fit into this?
>>
>> Note that I believe sbt supports organizations in a single directory
>> name, e.g.
>>
>>     src/
>>     └── main/
>>         └── resources/
>>             └── org.foo.myschema/
>>                 └── xsd/
>>                     └── common.xsd
>>
>> So that could be one approach to reduce the deep directory structures.
>>
>> Generally, I'm definitely in favor of simplifying the layout, but this
>> to me feels like it might just add more confusion since it's sort of
>> close to the existing layout, but not quite the same.
>>
>> If we are potentially going to go against the standards, and potentially
>> make IDE support more difficult, I almost wonder if we should be more
>> ambitious and come up with something that is completely different? I'm
>> not sure what that would be, but could be more flat. For example, maybe
>> something like this:
>>
>>     dfdl-fmt/
>>     ├── build.sbt
>>     ├── dfdl/
>>     │   ├── format.dfdl.xsd
>>     │   └── main.dfdl.xsd
>>     ├── layer/
>>     │   └── MyLayer.scala
>>     ├── sch/
>>     ├── tdml/
>>     │   └── main.tdml
>>     ├── udf/
>>     │   └── MyUDF.scala
>>     └── xslt/
>>
>> A plugin could implicitly add organization structure so things are
>> namespace when building a jar. Or maybe we even do something like NiFi
>> has with .nar fles have have a custom package format, e.g. .dar
>>
>> It's probably a lot more work, and things to work out (e.g. how to
>> dependencies work for udf and layers), and almost certainly needs  a
>> plugin to work instead of just tweaking sbt properties, but something
>> like that feels more ideal to me.
>>
>> Note that maybe we don't even use sbt for this. Maybe there's a better
>> tool for something like this.
>>
>> Another thing to consider that is related, with NiFi we found it
>> difficult to add jars to the NiFi classpath for a specific processor,
>> which means loading schemas from a jar on the classpath couldn't be
>> done. Having a custom package format could make this easier, since all
>> the .dar processing/lookup would be done by Daffodil rather than
>> standard classpath lookups.
>>
>>
>> On 12/3/21 5:25 PM, Mike Beckerle wrote:
>>> Experience in giving DFDL training via daffodil is that our standard schema
>>> project layout <https://daffodil.apache.org/dfdl-layout/> is much too deep
>>> (directory wise) for many users to conveniently navigate and use. It gets
>>> in the way of learning.
>>>
>>> Our layout was designed to follow sbt conventions that enable automated
>>> dependency management, packaging, etc. It is easy to use if you are
>>> accustomed to using an IDE like Eclipse or IntelliJ.  It is also
>>> extraordinarily valuable (and underappreciated) that 'sbt test' does a
>>> built-in-self-test on a schema, and that 'sbt publishLocal' creates a Jar
>>> of a DFDL schema for managed dependencies use between schemas.
>>>
>>> But new users are mostly coming to DFDL/Daffodil from a command-line prompt
>>> and a text editor (e.g., VIM).
>>>
>>> I am wondering if we can have our cake and eat it too, without too much
>>> added sbt complexity, and without losing 'sbt test' and 'sbt publishLocal'
>>> working their magic for us.
>>>
>>> E.g., what if a simplified layout was:
>>>
>>> mySchema/schema - takes the place of src/main/*. Also no package-style
>>> directory folder structure.
>>> mySchema/test - takes the place of src/test/*. No package-style directory
>>> folder structure.
>>>
>>> It would be optional if users want to user mySchema/test/data and
>>> mySchema/test/infosets to separate infosets and data, or just put all those
>>> files in the same place and use file extensions (.dat vs. .dat.xml vs.
>>> .tdml, etc.) to distinguish the kinds of content.
>>>
>>> Such a flattened tree structure requires that the schema file names are
>>> well chosen to be unlikely to conflict with other users chosen names, so a
>>> name like common.dfdl.xsd or main.dfdl.xsd would be no good as there is no
>>> package directory structure to make them unique.
>>>
>>> But names like common-mySchema.dfdl.xsd and main-mySchema.dfdl.xsd would
>>> still be quite convenient to use, particularly if the mySchema name is well
>>> chosen. (Note how I've put the unique part of the name first, so that
>>> name-completion will work most easily on command line.)
>>>
>>> I think this would still work with sbt if we simply override the default
>>> paths (and perhaps file patterns) used for specifying source and resources.
>>>
>>> Thoughts?
>>>
>>


Re: simplified schema project layout

Posted by Mike Beckerle <mb...@apache.org>.
I guess my concern is that all the depth associated with the sbt-based
standard layout feels completely redundant to me.

I am suggesting of the src/main/scala, we need only main/. Of
src/main/resources/kind we need only main/.

E.g, Why are all the typed subdirs needed (xsd/, dfdl/, etc.) when
file extensions can be used to distinguish resource types and
programming language compilers to be used?

To me the only "real" distinction in the standard project layout is
main vs. test which is needed to exclude test stuff when packaging.

The rest is
(a) using directories as "package names" - which can be done with
well-chosen longer file names
(b) using directories as redundant file typing - which can be done
with file name extensions.

To me a UDF is a META-INF/services file and some scala/java code in
the "main" area.
Ditto for a layer definition.

I guess concretely I am wondering if there is a way to override basic
sbt settings like this:

* Instead of src/main/scala, just look for main/*.scala
* Instead of src/main/java, just look for main/*.java
* Instead of src/main/resources/* just look for main/* where the file
name does not end in ".scala" nor ".java"

And similarly for test things, where src/test/whatever just becomes
test/whatever and distinctions are made using file name extensions.

On Wed, Dec 8, 2021 at 9:21 AM Steve Lawrence <sl...@apache.org> wrote:
>
> What about the scala/java/resources directories? Do those still exist or
> are they simplified somehow?
>
> We currently have an xsd/ directory to allow schematron, xslt, etc to be
> included in the same repo. Do we still have that directory?
>
> How do pluggable UDF's and Layers fit into this? Do we suggest those are
> in separate repos, or can they fit into this?
>
> Note that I believe sbt supports organizations in a single directory
> name, e.g.
>
>    src/
>    └── main/
>        └── resources/
>            └── org.foo.myschema/
>                └── xsd/
>                    └── common.xsd
>
> So that could be one approach to reduce the deep directory structures.
>
> Generally, I'm definitely in favor of simplifying the layout, but this
> to me feels like it might just add more confusion since it's sort of
> close to the existing layout, but not quite the same.
>
> If we are potentially going to go against the standards, and potentially
> make IDE support more difficult, I almost wonder if we should be more
> ambitious and come up with something that is completely different? I'm
> not sure what that would be, but could be more flat. For example, maybe
> something like this:
>
>    dfdl-fmt/
>    ├── build.sbt
>    ├── dfdl/
>    │   ├── format.dfdl.xsd
>    │   └── main.dfdl.xsd
>    ├── layer/
>    │   └── MyLayer.scala
>    ├── sch/
>    ├── tdml/
>    │   └── main.tdml
>    ├── udf/
>    │   └── MyUDF.scala
>    └── xslt/
>
> A plugin could implicitly add organization structure so things are
> namespace when building a jar. Or maybe we even do something like NiFi
> has with .nar fles have have a custom package format, e.g. .dar
>
> It's probably a lot more work, and things to work out (e.g. how to
> dependencies work for udf and layers), and almost certainly needs  a
> plugin to work instead of just tweaking sbt properties, but something
> like that feels more ideal to me.
>
> Note that maybe we don't even use sbt for this. Maybe there's a better
> tool for something like this.
>
> Another thing to consider that is related, with NiFi we found it
> difficult to add jars to the NiFi classpath for a specific processor,
> which means loading schemas from a jar on the classpath couldn't be
> done. Having a custom package format could make this easier, since all
> the .dar processing/lookup would be done by Daffodil rather than
> standard classpath lookups.
>
>
> On 12/3/21 5:25 PM, Mike Beckerle wrote:
> > Experience in giving DFDL training via daffodil is that our standard schema
> > project layout <https://daffodil.apache.org/dfdl-layout/> is much too deep
> > (directory wise) for many users to conveniently navigate and use. It gets
> > in the way of learning.
> >
> > Our layout was designed to follow sbt conventions that enable automated
> > dependency management, packaging, etc. It is easy to use if you are
> > accustomed to using an IDE like Eclipse or IntelliJ.  It is also
> > extraordinarily valuable (and underappreciated) that 'sbt test' does a
> > built-in-self-test on a schema, and that 'sbt publishLocal' creates a Jar
> > of a DFDL schema for managed dependencies use between schemas.
> >
> > But new users are mostly coming to DFDL/Daffodil from a command-line prompt
> > and a text editor (e.g., VIM).
> >
> > I am wondering if we can have our cake and eat it too, without too much
> > added sbt complexity, and without losing 'sbt test' and 'sbt publishLocal'
> > working their magic for us.
> >
> > E.g., what if a simplified layout was:
> >
> > mySchema/schema - takes the place of src/main/*. Also no package-style
> > directory folder structure.
> > mySchema/test - takes the place of src/test/*. No package-style directory
> > folder structure.
> >
> > It would be optional if users want to user mySchema/test/data and
> > mySchema/test/infosets to separate infosets and data, or just put all those
> > files in the same place and use file extensions (.dat vs. .dat.xml vs.
> > .tdml, etc.) to distinguish the kinds of content.
> >
> > Such a flattened tree structure requires that the schema file names are
> > well chosen to be unlikely to conflict with other users chosen names, so a
> > name like common.dfdl.xsd or main.dfdl.xsd would be no good as there is no
> > package directory structure to make them unique.
> >
> > But names like common-mySchema.dfdl.xsd and main-mySchema.dfdl.xsd would
> > still be quite convenient to use, particularly if the mySchema name is well
> > chosen. (Note how I've put the unique part of the name first, so that
> > name-completion will work most easily on command line.)
> >
> > I think this would still work with sbt if we simply override the default
> > paths (and perhaps file patterns) used for specifying source and resources.
> >
> > Thoughts?
> >
>

Re: simplified schema project layout

Posted by Steve Lawrence <sl...@apache.org>.
What about the scala/java/resources directories? Do those still exist or 
are they simplified somehow?

We currently have an xsd/ directory to allow schematron, xslt, etc to be 
included in the same repo. Do we still have that directory?

How do pluggable UDF's and Layers fit into this? Do we suggest those are 
in separate repos, or can they fit into this?

Note that I believe sbt supports organizations in a single directory 
name, e.g.

   src/
   └── main/
       └── resources/
           └── org.foo.myschema/
               └── xsd/
                   └── common.xsd

So that could be one approach to reduce the deep directory structures.

Generally, I'm definitely in favor of simplifying the layout, but this 
to me feels like it might just add more confusion since it's sort of 
close to the existing layout, but not quite the same.

If we are potentially going to go against the standards, and potentially 
make IDE support more difficult, I almost wonder if we should be more 
ambitious and come up with something that is completely different? I'm 
not sure what that would be, but could be more flat. For example, maybe 
something like this:

   dfdl-fmt/
   ├── build.sbt
   ├── dfdl/
   │   ├── format.dfdl.xsd
   │   └── main.dfdl.xsd
   ├── layer/
   │   └── MyLayer.scala
   ├── sch/
   ├── tdml/
   │   └── main.tdml
   ├── udf/
   │   └── MyUDF.scala
   └── xslt/

A plugin could implicitly add organization structure so things are 
namespace when building a jar. Or maybe we even do something like NiFi 
has with .nar fles have have a custom package format, e.g. .dar

It's probably a lot more work, and things to work out (e.g. how to 
dependencies work for udf and layers), and almost certainly needs  a 
plugin to work instead of just tweaking sbt properties, but something 
like that feels more ideal to me.

Note that maybe we don't even use sbt for this. Maybe there's a better 
tool for something like this.

Another thing to consider that is related, with NiFi we found it 
difficult to add jars to the NiFi classpath for a specific processor, 
which means loading schemas from a jar on the classpath couldn't be 
done. Having a custom package format could make this easier, since all 
the .dar processing/lookup would be done by Daffodil rather than 
standard classpath lookups.


On 12/3/21 5:25 PM, Mike Beckerle wrote:
> Experience in giving DFDL training via daffodil is that our standard schema
> project layout <https://daffodil.apache.org/dfdl-layout/> is much too deep
> (directory wise) for many users to conveniently navigate and use. It gets
> in the way of learning.
> 
> Our layout was designed to follow sbt conventions that enable automated
> dependency management, packaging, etc. It is easy to use if you are
> accustomed to using an IDE like Eclipse or IntelliJ.  It is also
> extraordinarily valuable (and underappreciated) that 'sbt test' does a
> built-in-self-test on a schema, and that 'sbt publishLocal' creates a Jar
> of a DFDL schema for managed dependencies use between schemas.
> 
> But new users are mostly coming to DFDL/Daffodil from a command-line prompt
> and a text editor (e.g., VIM).
> 
> I am wondering if we can have our cake and eat it too, without too much
> added sbt complexity, and without losing 'sbt test' and 'sbt publishLocal'
> working their magic for us.
> 
> E.g., what if a simplified layout was:
> 
> mySchema/schema - takes the place of src/main/*. Also no package-style
> directory folder structure.
> mySchema/test - takes the place of src/test/*. No package-style directory
> folder structure.
> 
> It would be optional if users want to user mySchema/test/data and
> mySchema/test/infosets to separate infosets and data, or just put all those
> files in the same place and use file extensions (.dat vs. .dat.xml vs.
> .tdml, etc.) to distinguish the kinds of content.
> 
> Such a flattened tree structure requires that the schema file names are
> well chosen to be unlikely to conflict with other users chosen names, so a
> name like common.dfdl.xsd or main.dfdl.xsd would be no good as there is no
> package directory structure to make them unique.
> 
> But names like common-mySchema.dfdl.xsd and main-mySchema.dfdl.xsd would
> still be quite convenient to use, particularly if the mySchema name is well
> chosen. (Note how I've put the unique part of the name first, so that
> name-completion will work most easily on command line.)
> 
> I think this would still work with sbt if we simply override the default
> paths (and perhaps file patterns) used for specifying source and resources.
> 
> Thoughts?
>