You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Mike Beckerle <mb...@apache.org> on 2023/10/12 18:58:03 UTC

Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

So when a data format is described by a DFDL schema, I can generate
equivalent Drill schema (TupleMetadata). This schema is always complete. I
have unit tests working with this.

To do this for a real SQL query, I need the DFDL schema to be identified on
the SQL query by a file path or URI.

Q: How do I get that DFDL schema File/URI parameter from the SQL query?

Next, assuming I have the DFDL schema identified, I generate an equivalent
Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)

What objects do I call, or what classes do I have to create to make this
Drill TupleMetadata available to Drill so it uses it in all the ways a
static Drill schema can be useful?

I just need pointers to the code that illustrate how to do this. Thanks

-Mike Beckerle










On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <pa...@gmail.com> wrote:

> Mike,
>
> This is a complex question and has two answers.
>
> First, the standard enhanced vector framework (EVF) used by most readers
> assumes a "pull" model: read each record. This is where the next() comes
> in: readers just implement this to read the next record. But, the code
> under EVF works with a push model: the readers write to vectors, and signal
> the next record. EVF translates the lower-level push model to the
> higher-level, easier-to-use pull model. The best example of this is the
> JSON reader which uses Jackson to parse JSON and responds to the
> corresponding events.
>
> You can thus take over the task of filling a batch of records. I'd have to
> poke around the code to refresh my memory. Or, you can take a look at the
> (quite complex) JSON parser, or the EVF itself to see what it does. There
> are many unit tests that show this at various levels of abstraction.
>
> Basically, you have to:
>
> * Start a batch
> * Ask if you can start the next record (which might be declined if the
> batch is full)
> * Write each field. For complex fields, such as records, recursively do the
> start/end record work.
> * Mark the record as complete.
>
> You should be able to map event handlers to EVF actions as a result. Even
> though DFDL wants to "drive", it still has to give up control once the
> batch is full. EVF will then handle the (surprisingly complex) task of
> finishing up the batch and returning it as the output of the Scan operator.
>
> - Paul
>
> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mb...@apache.org>
> wrote:
>
> > Daffodil parsing generates event callbacks to an InfosetOutputter, which
> is
> > analogous to a SAX event handler.
> >
> > Drill is expecting an iterator style of calling next() to advance through
> > the input, i.e., Drill has the control thread and expects to do pull
> > parsing. At least from the code I studied in the format-xml contrib.
> >
> > Is there any alternative? Before I dig into creating another one of these
> > co-routine-style control inversions (which have proven to be problematic
> > for performance.
> >
>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Posted by Paul Rogers <pa...@gmail.com>.

Mike,

Excellent progress! Very impressive.

So now you are talking about the planning side of things. There are
multiple ways this could be done. Let's start with some basics. Recall that
Drill is distributed: a file can be in S3 or old-school HDFS (along with
other variations). When Drill is run as intended, you'll have a cluster of
10, 20 or more Drillbits, all on distinct nodes or K8s pods, any of which
can be asked to plan the query, and so all of them need visibility to the
shared pool of metadata.

I would argue that the simplest solution from the user's perspective is for
Drill, not the user, to associate a Daffodil schema with a file. That is, I
set up the definition once, then Drill uses that for each of my dozens (or
hundreds) of queries against that file. The alternative is to remember to
include the schema information in every query, which will get old quickly.

The simplest option is to put a schema file in the same "directory"
(however that is defined in the target distributed file system) as the
data: something like "filename.dfdl" or ".schema.dfdl", depending on
whether the shema describes a single file or (more likely) a collection of
files. The planner simply looks for the schema file based on the file name
(filename.dfdl) or location (/path/to/data/.schema.dfdl). And there is a
precedent: this is how Drill finds its old-school Parquet metadata cache.
This works, but is clunky: mixing data (which is likely generated and
expired by a pipeline) with metadata (which changes slowly and is managed
by hand) is somewhat awkward from an operations perspective. (This was one
of the many issues with that first-gen Parquet metadata cache solution.)
Still, it is the simplest option to get going.

A somewhat more useful way is to integrate schema info not with the file,
but with the storage plugin.The plugin could point to the location of the
schema for the files available from that plugin. This works because
Daffodil really only applies to file-like objects. Things like DBs or APIs
or Kafka streams have their own plugins that typically provide their own
schema. So associating schema with a plugin might make sense. Plugin A
defines our Daffodil-ready files, while plugin B is for ad-hoc data with no
schema. A new property on the plugin might provide the location where the
Daffodil schema files are stored. Matching could be done by name, file
path, file name pattern matching, or whatever. Basically, you'd be adding a
property to the DFS (distributed file system) storage plugin, along with
plan-time code to make use of the property.

Once you have the property, it could be set for the plugin, or provided in
each query. Use table functions to provide the schema:

SELECT *
FROM myFile.json (`schema` => '\path\to\schema\myFile.dfdl')

(Don't quote me on the syntax. Instead, find the unit tests that exercise
this feature to override plugin options.) The implementation of this
feature starts with the properties from the storage plugin, then allows you
to add (or overwrite) properties per-query. You should get this behavior
for free after adding a property to the plugin as described above.

You can then exploit the above approach to encourage people to wrap the
above "base query" query in a view. However, the view approach is a bit
clunky because it mixes query considerations with schema considerations.
Schema is about a file, not a particular base query on that file. Still,
you'd get the view approach for free, which is a strong argument for such
an approach.

There is a completely different approach: create a "Daffodil metastore":
something that applies to all plugins, a bit like Drill's own (seldom-used)
metastore. That is, implement a metastore, using Drill's metastore API
(which I hope we have), that stores schemas as Daffodil files rather than
the standard DB implementation. The underlying storage could be a (shared)
directory, a document store or whatever. Once you convert the Daffodil
format to Drill's internal format, you would leverage the large amount of
existing plan-time code. That is, you trade off having to become a planner
expert against becoming a metastore API expert. The advantage is that
schema is completely separated from storage: Drill "just knows" which
schema to use (as defined by the admin), the users (of which we hope there
will be many) don't care: they just get the right results without needing
to fiddle with the details.

There may be another alternative as well that I've missed. You'll find that
this area of the product will be a bit more of a challenge than the runtime
portion. The team has added many implementations of storage and format
plugins, so that area is well understood. There have been only a couple of
metadata implementations, so that area is at an earlier stage of evolution.
Still, you can look at the Parquet metadata and Drill metastore
implementations (neither of which is simple) for ideas about how to
approach the implementation.

I hope this provides a few hints to get you started.

- Paul

On Thu, Oct 12, 2023 at 11:58 AM Mike Beckerle <mb...@apache.org> wrote:

> So when a data format is described by a DFDL schema, I can generate
> equivalent Drill schema (TupleMetadata). This schema is always complete. I
> have unit tests working with this.
>
> To do this for a real SQL query, I need the DFDL schema to be identified on
> the SQL query by a file path or URI.
>
> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
>
> Next, assuming I have the DFDL schema identified, I generate an equivalent
> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>
> What objects do I call, or what classes do I have to create to make this
> Drill TupleMetadata available to Drill so it uses it in all the ways a
> static Drill schema can be useful?
>
> I just need pointers to the code that illustrate how to do this. Thanks
>
> -Mike Beckerle
>
>
>
>
>
>
>
>
>
>
> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <pa...@gmail.com> wrote:
>
> > Mike,
> >
> > This is a complex question and has two answers.
> >
> > First, the standard enhanced vector framework (EVF) used by most readers
> > assumes a "pull" model: read each record. This is where the next() comes
> > in: readers just implement this to read the next record. But, the code
> > under EVF works with a push model: the readers write to vectors, and
> signal
> > the next record. EVF translates the lower-level push model to the
> > higher-level, easier-to-use pull model. The best example of this is the
> > JSON reader which uses Jackson to parse JSON and responds to the
> > corresponding events.
> >
> > You can thus take over the task of filling a batch of records. I'd have
> to
> > poke around the code to refresh my memory. Or, you can take a look at the
> > (quite complex) JSON parser, or the EVF itself to see what it does. There
> > are many unit tests that show this at various levels of abstraction.
> >
> > Basically, you have to:
> >
> > * Start a batch
> > * Ask if you can start the next record (which might be declined if the
> > batch is full)
> > * Write each field. For complex fields, such as records, recursively do
> the
> > start/end record work.
> > * Mark the record as complete.
> >
> > You should be able to map event handlers to EVF actions as a result. Even
> > though DFDL wants to "drive", it still has to give up control once the
> > batch is full. EVF will then handle the (surprisingly complex) task of
> > finishing up the batch and returning it as the output of the Scan
> operator.
> >
> > - Paul
> >
> > On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mb...@apache.org>
> > wrote:
> >
> > > Daffodil parsing generates event callbacks to an InfosetOutputter,
> which
> > is
> > > analogous to a SAX event handler.
> > >
> > > Drill is expecting an iterator style of calling next() to advance
> through
> > > the input, i.e., Drill has the control thread and expects to do pull
> > > parsing. At least from the code I studied in the format-xml contrib.
> > >
> > > Is there any alternative? Before I dig into creating another one of
> these
> > > co-routine-style control inversions (which have proven to be
> problematic
> > > for performance.
> > >
> >
>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Posted by Paul Rogers <pa...@gmail.com>.

Hi Mike,

Earlier on, there were two approaches discussed:

1. Using a Daffodil schema to map to a Drill schema, and use Drill's
existing schema mechanisms for all of Drill's existing input formats.
2. Using a Daffodil-specific reader so that Daffodil does the data parsing.

Some of my earlier answers assumed you were doing option 1. The code shows
you are doing option 2. There are pros and cons, but let's just focus on
option 2 for now.

You need a way for a reader (running on Drillbit 2) to get a schema from a
query (planned on Drillbit 1). How does the Daffodil schema get from Node 1
to Node 2? Charles suggested ZK; I suggested that is not such a great idea,
for a number of reasons. A more "Drill-like" way would be to include the
Daffodil schema in the query plan: either as JSON or as a binary blob. The
planner attaches the schema when creating the reader definition; the reader
deserializes the schema at run time.

I believe you said schemas can be large. So, you could instead serialize a
reference. To do that, you'd need a location visible to all Drill nodes:
HDFS, S3, web server, etc. A crude-but-effective approach to get started is
the one mentioned for Drill's own metadata: the schema must reside in the
same directory as the data. This opens up issues with update race
conditions, as noted earlier. But, it could work if you are "careful." If
there is a Daffodil schema server, that would be better.

Given all that, your DaffodilBatchReader is generally headed in the right
direction. The same is true of DaffodilDrillInfosetOutputter, though, for
performance, you'll want to cache the column readers rather than do
name-based lookups for every column for every row. (Drill is designed to
read billions of rows; that's a lot of lookups!) But, that can be optimized
once things work.

You'll soon be at a place where you'll want to do some debugging. The
S-L-O-W way is to build Drill, fire of a query, and sort out what went
wrong, perhaps attaching a debugger. Another slow way is to fire up a
Drillbit in your test and run a query. (Such a test is a great integration
test, however.)

A good way to debug is to create a test that includes just your reader and
surrounding plumbing. This way, you can set up very specific cases and
easily debug, in a single thread, right from your IDE. The JSON reader
tests may have some examples. Charles may have others.

Thanks,

- Paul

On Wed, Oct 18, 2023 at 4:06 PM Charles Givre <cg...@gmail.com> wrote:

> Got it.  I’ll review today and tomorrow and hopefully we can get you
> unblocked.
> Sent from my iPhone
>
> > On Oct 18, 2023, at 18:01, Mike Beckerle <mb...@apache.org> wrote:
> >
> > I am very much hoping someone will look at my open PR soon.
> > https://github.com/apache/drill/pull/2836
> >
> > I am basically blocked on this effort until you help me with one key area
> > of that.
> >
> > I expect the part I am puzzling over is routine to you, so it will save
> me
> > much effort.
> >
> > This is the key area in the DaffodilBatchReader.java code:
> >
> >  // FIXME: Next, a MIRACLE occurs.
> >  //
> >  // We get the dfdlSchemaURI filled in from the query, or a default
> config
> > location
> >  // We get the rootName (or null if not supplied) from the query, or a
> > default config location
> >  // We get the rootNamespace (or null if not supplied) from the query, or
> > a default config location
> >  // We get the validationMode (true/false) filled in from the query or a
> > default config location
> >  // We get the dataInputURI filled in from the query, or from a default
> > config location
> >  //
> >  // For a first cut, let's just fake it. :-)
> >  boolean validationMode = true;
> >  URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd");
> >  String rootName = null;
> >  String rootNamespace = null;
> >  URI dataInputURI = new URI("data/complexArray1.dat");
> >
> >
> > I imagine this is just a few lines of code to grab these from the query,
> > and i don't even care about config files for now.
> >
> > I gave up on trying to figure out how to do this myself. It was actually
> > quite unclear from looking at the other format plugins. The way Drill
> does
> > configuration is obviously motivated by the distributed architecture
> > combined with pluggability, but all that combined with the negotation
> over
> > schemas which extends into runtime, and it all became quite muddy to me.
> I
> > think what I need is super straightforward, so i figured I should just
> > ask.
> >
> > This is just to get enough working (against local files only) that I can
> be
> > unblocked on creating and testing the rest of the Daffodil-to-Drill
> > metadata bridge and data bridge.
> >
> > My plan is to get all kinds of data and queries working first but just
> > against local-only files.  Fixing it to work in distributed Drill can
> come
> > later.
> >
> > -mikeb
> >
> >> On Wed, Oct 18, 2023 at 2:11 PM Paul Rogers <pa...@gmail.com> wrote:
> >>
> >> Hi Charles,
> >>
> >> The persistent store is just ZooKeeper, and ZK is known to work poorly
> as
> >> a distributed DB. ZK works great for things like tokens, node
> registrations
> >> and the like. But, ZK scales very poorly for things like schemas (or
> query
> >> profiles or a list of active queries.)
> >>
> >> A more scalable approach may be to cache the schemas in each Drillbit,
> >> then translate them to Drill's format and include them in each Scan
> >> operator definition sent to each execution Drillbit. That solution
> avoids
> >> race conditions when the schemas change while a query is in flight. This
> >> is, in fact, the model used for storage plugin definitions. (The storage
> >> plugin definitions are, in fact, stored in ZK, but tend to be small and
> few
> >> in number.)
> >>
> >> - Paul
> >>
> >>
> >>> On Wed, Oct 18, 2023 at 7:51 AM Charles Givre <cg...@gmail.com>
> wrote:
> >>>
> >>> Hi Mike,
> >>> I hope all is well.  I remembered one other piece which might be useful
> >>> for you.  Drill has an interface called a PersistentStore which is
> used for
> >>> storing artifacts such as tokens etc.  I've uesd it on two occasions:
> in
> >>> the GoogleSheets plugin and the Http plugin.  In both cases, I used it
> to
> >>> store OAuth user tokens which need to be preserved and shared across
> >>> drillbits, and also frequently updated.  I was thinking that this
> might be
> >>> useful for caching the DFDL schemata.  If you take a look here:
> >>>
> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java
> ,
> >>>
> >>>
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth
> .
> >>> and here
> >>>
> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java
> ,
> >>> you can see how I used that.
> >>>
> >>> Best,
> >>> -- C
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> On Oct 13, 2023, at 1:25 PM, Mike Beckerle <mb...@apache.org>
> >>> wrote:
> >>>>
> >>>> Very helpful.
> >>>>
> >>>> Answers to your questions, and comments are below:
> >>>>
> >>>> On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cgivre@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>>> HI Mike,
> >>>>> I hope all is well.  I'll take a stab at answering your questions.
> >>> But I have a few questions as well:
> >>>>>
> >>>>> 1.  Are you writing a storage or format plugin for DFDL?  My thinking
> >>> was that this would be a format plugin, but let me know if you were
> >>> thinking differently
> >>>>
> >>>> Format plugin.
> >>>>
> >>>>> 2.  In traditional deployments, where do people store the DFDL
> >>> schemata files?  Are they local or accessible via URL?
> >>>>
> >>>> Schemas are stored in files, or in jar files created when packaging a
> >>> schema project. Hence URI is the preferred identifier for them.  They
> are
> >>> not retrieved remotely or anything like that. It's a matter of whether
> they
> >>> are in jars on the classpath, directories on the classpath, or just a
> file
> >>> location.
> >>>>
> >>>> The source-code of DFDL schemas are often created using other schemas
> >>> as components, so a single "DFDL schema" may have parts that come from
> 5
> >>> jar files on the classpath e.g., 2 different header schemas, a library
> >>> schema, and the "main" schema that assembles them all.  Inside schemas
> they
> >>> refer to each other via xs:include or xs:import, and the schemaLocation
> >>> attribute takes a URI to the location of the included/imported schema
> and
> >>> those URIs are interpreted this same way we would want Drill to
> identify
> >>> the location of a schema.
> >>>>
> >>>> However, really people will want to pre-compile any real non-toy/test
> >>> DFDL schemas into binary ".bin" files for faster loading. Otherwise
> >>> Daffodil schema compilation time can be excessive (minutes for large
> DFDL
> >>> schemas - for example the DFDL schema for VMF is 180K lines of DFDL).
> >>> Compiled schemas live in exactly 1 file (relatively small. The compiled
> >>> form of VMF schema is 8Mbytes). So the path given for schema in Drill
> sql
> >>> query, or in the config wants to be allowed to be either a compiled
> schema
> >>> or a source-code schema (.xsd) this latter mostly being for test,
> training,
> >>> and toy examples that we would compile on-the-fly.
> >>>>
> >>>>> To get the DFDL schema file or URL we have a few options, all of
> which
> >>> revolve around setting a config variable.  For now, let's just say
> that the
> >>> schema file is contained in the same folder as the data.  (We can make
> this
> >>> more sophisticated later...)
> >>>>
> >>>> It would make life difficult if the schemas and test data must be
> >>> co-resident. Most schema projects have these in entirely separate
> >>> sub-trees. Schema will be under src/main/resources/..../xsd, compiled
> >>> schema would be under target/... and test data under
> >>> src/test/resources/.../data
> >>>>
> >>>> For now I think the easiest thing is just we get two URIs. One is for
> >>> the data, one is for the schema. We access them via
> >>> getClass().getResource().
> >>>>
> >>>> We should not worry about caching or anything for now. Once the above
> >>> works for a decent scope of tests we can worry about making it more
> >>> convenient to have a library of schemas at one's disposal.
> >>>>
> >>>>>
> >>>>> Here's what you have to do.
> >>>>>
> >>>>> 1.  In the formatConfig file, define a String called 'dfdlSchema'.
> >>> Note... config variables must be private and final.  If they aren't it
> can
> >>> cause weird errors that are really difficult to debug.  For some
> reference,
> >>> take a look at the Excel plugin.  (
> >>>
> https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java
> >>> )
> >>>>>
> >>>>> Setting a config variable there will allow a user to set a global
> >>> schema definition.  This can also be configured individually for
> various
> >>> workspaces.  So let's say you had PCAP files in one workspace, you
> could
> >>> globally set the DFDL file for that and then another workspace which
> has
> >>> some other file, you could create another DFDL plugin instance for
> that.
> >>>>
> >>>> Ok, so the above lets me play with Drill and one schema by default. Ok
> >>> for using Drill to explore data, and useful for testing.
> >>>>
> >>>>>
> >>>>> Now, this is all fine and good, but a user might also want to define
> >>> the schema file at query time.  The good news is that Drill allows you
> to
> >>> do that via the table() function.
> >>>>>
> >>>>
> >>>> This would allow real data-integration queries against multiple
> >>> different DFDL-described data sources. Needed for a compelling demo.
> >>>>
> >>>>> So let's say that we want to use a different schema file than the
> >>> default, we could do something like this:
> >>>>>
> >>>>> SELECT ....
> >>>>> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl',
> >>> dfdlSchema=>'path_to_schema')
> >>>>>
> >>>>> Take a look at the Excel docs (
> >>>
> https://github.com/apache/drill/blob/master/contrib/format-excel/README.md
> )
> >>> which demonstrate how to write queries like that.  I believe that the
> >>> parameters in the table function take higher precedence than the
> parameters
> >>> from the config.  That would make sense at least.
> >>>>>
> >>>>
> >>>> Perfect. I'll start with this.
> >>>>
> >>>>>
> >>>>> 2.  Now that we have the schema file, the next thing would be to
> >>> convert that into a Drill schema.  Let's say that we have a function
> called
> >>> dfdlToDrill that handles the conversion.
> >>>>>
> >>>>> What you'd have to do is in the constructor for the BatchReader,
> you'd
> >>> have to set the schema there.  So pseudo code:
> >>>>>
> >>>>> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan,
> >>> FileSchemaNegotiator negotiator) {
> >>>>>     // Other stuff...
> >>>>>
> >>>>>     // Get Drill schema from DFDL
> >>>>>     TupleMetadata schema = dfldToDrill(<dfdl schema file);
> >>>>>
> >>>>>     // Here's the important part
> >>>>>     negotiator.tableSchema(schema, true);
> >>>>> }
> >>>>>
> >>>>> The negotiator.tableSchema() accepts two args, a TupleMetadata and a
> >>> boolean as to whether the schema is final or not.  Once this schema has
> >>> been added to the negotiator object, you can then create the writers.
> >>>>>
> >>>>
> >>>> That negotiator.tableSchema() is ideal. I was hoping that this was
> >>> going to be the only place the metadata had to be given to drill.
> >>> Excellent.
> >>>>
> >>>>>
> >>>>> Take a look here...
> >>>>>
> >>>>>
> >>>>>
> >>>
> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
> >>> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
> >>>>> github.com
> >>>>> <
> >>>
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
> >drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
> >>> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill <
> >>>
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
> >>>>
> >>>>> github.com <
> >>>
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
> >>>>
> >>>>>
> >>>>>
> >>>>> I see Paul just responded so I'll leave you with this.  If you have
> >>> additional questions, send them our way.  Do take a look at the Excel
> >>> plugin as I think it will be helpful.
> >>>>>
> >>>> Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil
> can
> >>> work similarly.
> >>>>
> >>>> This will take me a few more days to get to a pull request. The first
> >>> one will be initial review, i.e., not intended to merge without more
> tests.
> >>> Probably it will support only integer data fields, but should support
> lots
> >>> of data shapes including vectors, choices, sequences, nested records,
> etc.
> >>>>
> >>>> Thanks for the help.
> >>>>
> >>>>>
> >>>>>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbeckerle@apache.org
> >>> <ma...@apache.org>> wrote:
> >>>>>>
> >>>>>> So when a data format is described by a DFDL schema, I can generate
> >>>>>> equivalent Drill schema (TupleMetadata). This schema is always
> >>> complete. I
> >>>>>> have unit tests working with this.
> >>>>>>
> >>>>>> To do this for a real SQL query, I need the DFDL schema to be
> >>> identified on
> >>>>>> the SQL query by a file path or URI.
> >>>>>>
> >>>>>> Q: How do I get that DFDL schema File/URI parameter from the SQL
> >>> query?
> >>>>>>
> >>>>>> Next, assuming I have the DFDL schema identified, I generate an
> >>> equivalent
> >>>>>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a
> cache)
> >>>>>>
> >>>>>> What objects do I call, or what classes do I have to create to make
> >>> this
> >>>>>> Drill TupleMetadata available to Drill so it uses it in all the
> ways a
> >>>>>> static Drill schema can be useful?
> >>>>>>
> >>>>>> I just need pointers to the code that illustrate how to do this.
> >>> Thanks
> >>>>>>
> >>>>>> -Mike Beckerle
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0328@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>>>>
> >>>>>>> Mike,
> >>>>>>>
> >>>>>>> This is a complex question and has two answers.
> >>>>>>>
> >>>>>>> First, the standard enhanced vector framework (EVF) used by most
> >>> readers
> >>>>>>> assumes a "pull" model: read each record. This is where the next()
> >>> comes
> >>>>>>> in: readers just implement this to read the next record. But, the
> >>> code
> >>>>>>> under EVF works with a push model: the readers write to vectors,
> and
> >>> signal
> >>>>>>> the next record. EVF translates the lower-level push model to the
> >>>>>>> higher-level, easier-to-use pull model. The best example of this is
> >>> the
> >>>>>>> JSON reader which uses Jackson to parse JSON and responds to the
> >>>>>>> corresponding events.
> >>>>>>>
> >>>>>>> You can thus take over the task of filling a batch of records. I'd
> >>> have to
> >>>>>>> poke around the code to refresh my memory. Or, you can take a look
> >>> at the
> >>>>>>> (quite complex) JSON parser, or the EVF itself to see what it does.
> >>> There
> >>>>>>> are many unit tests that show this at various levels of
> abstraction.
> >>>>>>>
> >>>>>>> Basically, you have to:
> >>>>>>>
> >>>>>>> * Start a batch
> >>>>>>> * Ask if you can start the next record (which might be declined if
> >>> the
> >>>>>>> batch is full)
> >>>>>>> * Write each field. For complex fields, such as records,
> recursively
> >>> do the
> >>>>>>> start/end record work.
> >>>>>>> * Mark the record as complete.
> >>>>>>>
> >>>>>>> You should be able to map event handlers to EVF actions as a
> result.
> >>> Even
> >>>>>>> though DFDL wants to "drive", it still has to give up control once
> >>> the
> >>>>>>> batch is full. EVF will then handle the (surprisingly complex) task
> >>> of
> >>>>>>> finishing up the batch and returning it as the output of the Scan
> >>> operator.
> >>>>>>>
> >>>>>>> - Paul
> >>>>>>>
> >>>>>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <
> mbeckerle@apache.org
> >>> <ma...@apache.org>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Daffodil parsing generates event callbacks to an InfosetOutputter,
> >>> which
> >>>>>>> is
> >>>>>>>> analogous to a SAX event handler.
> >>>>>>>>
> >>>>>>>> Drill is expecting an iterator style of calling next() to advance
> >>> through
> >>>>>>>> the input, i.e., Drill has the control thread and expects to do
> pull
> >>>>>>>> parsing. At least from the code I studied in the format-xml
> contrib.
> >>>>>>>>
> >>>>>>>> Is there any alternative? Before I dig into creating another one
> of
> >>> these
> >>>>>>>> co-routine-style control inversions (which have proven to be
> >>> problematic
> >>>>>>>> for performance.
> >>>
> >>>
>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Posted by Charles Givre <cg...@gmail.com>.

Got it.  I’ll review today and tomorrow and hopefully we can get you unblocked.  
Sent from my iPhone

> On Oct 18, 2023, at 18:01, Mike Beckerle <mb...@apache.org> wrote:
> 
> I am very much hoping someone will look at my open PR soon.
> https://github.com/apache/drill/pull/2836
> 
> I am basically blocked on this effort until you help me with one key area
> of that.
> 
> I expect the part I am puzzling over is routine to you, so it will save me
> much effort.
> 
> This is the key area in the DaffodilBatchReader.java code:
> 
>  // FIXME: Next, a MIRACLE occurs.
>  //
>  // We get the dfdlSchemaURI filled in from the query, or a default config
> location
>  // We get the rootName (or null if not supplied) from the query, or a
> default config location
>  // We get the rootNamespace (or null if not supplied) from the query, or
> a default config location
>  // We get the validationMode (true/false) filled in from the query or a
> default config location
>  // We get the dataInputURI filled in from the query, or from a default
> config location
>  //
>  // For a first cut, let's just fake it. :-)
>  boolean validationMode = true;
>  URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd");
>  String rootName = null;
>  String rootNamespace = null;
>  URI dataInputURI = new URI("data/complexArray1.dat");
> 
> 
> I imagine this is just a few lines of code to grab these from the query,
> and i don't even care about config files for now.
> 
> I gave up on trying to figure out how to do this myself. It was actually
> quite unclear from looking at the other format plugins. The way Drill does
> configuration is obviously motivated by the distributed architecture
> combined with pluggability, but all that combined with the negotation over
> schemas which extends into runtime, and it all became quite muddy to me. I
> think what I need is super straightforward, so i figured I should just
> ask.
> 
> This is just to get enough working (against local files only) that I can be
> unblocked on creating and testing the rest of the Daffodil-to-Drill
> metadata bridge and data bridge.
> 
> My plan is to get all kinds of data and queries working first but just
> against local-only files.  Fixing it to work in distributed Drill can come
> later.
> 
> -mikeb
> 
>> On Wed, Oct 18, 2023 at 2:11 PM Paul Rogers <pa...@gmail.com> wrote:
>> 
>> Hi Charles,
>> 
>> The persistent store is just ZooKeeper, and ZK is known to work poorly as
>> a distributed DB. ZK works great for things like tokens, node registrations
>> and the like. But, ZK scales very poorly for things like schemas (or query
>> profiles or a list of active queries.)
>> 
>> A more scalable approach may be to cache the schemas in each Drillbit,
>> then translate them to Drill's format and include them in each Scan
>> operator definition sent to each execution Drillbit. That solution avoids
>> race conditions when the schemas change while a query is in flight. This
>> is, in fact, the model used for storage plugin definitions. (The storage
>> plugin definitions are, in fact, stored in ZK, but tend to be small and few
>> in number.)
>> 
>> - Paul
>> 
>> 
>>> On Wed, Oct 18, 2023 at 7:51 AM Charles Givre <cg...@gmail.com> wrote:
>>> 
>>> Hi Mike,
>>> I hope all is well.  I remembered one other piece which might be useful
>>> for you.  Drill has an interface called a PersistentStore which is used for
>>> storing artifacts such as tokens etc.  I've uesd it on two occasions: in
>>> the GoogleSheets plugin and the Http plugin.  In both cases, I used it to
>>> store OAuth user tokens which need to be preserved and shared across
>>> drillbits, and also frequently updated.  I was thinking that this might be
>>> useful for caching the DFDL schemata.  If you take a look here:
>>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
>>> 
>>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
>>> and here
>>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
>>> you can see how I used that.
>>> 
>>> Best,
>>> -- C
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Oct 13, 2023, at 1:25 PM, Mike Beckerle <mb...@apache.org>
>>> wrote:
>>>> 
>>>> Very helpful.
>>>> 
>>>> Answers to your questions, and comments are below:
>>>> 
>>>> On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cgivre@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>>> HI Mike,
>>>>> I hope all is well.  I'll take a stab at answering your questions.
>>> But I have a few questions as well:
>>>>> 
>>>>> 1.  Are you writing a storage or format plugin for DFDL?  My thinking
>>> was that this would be a format plugin, but let me know if you were
>>> thinking differently
>>>> 
>>>> Format plugin.
>>>> 
>>>>> 2.  In traditional deployments, where do people store the DFDL
>>> schemata files?  Are they local or accessible via URL?
>>>> 
>>>> Schemas are stored in files, or in jar files created when packaging a
>>> schema project. Hence URI is the preferred identifier for them.  They are
>>> not retrieved remotely or anything like that. It's a matter of whether they
>>> are in jars on the classpath, directories on the classpath, or just a file
>>> location.
>>>> 
>>>> The source-code of DFDL schemas are often created using other schemas
>>> as components, so a single "DFDL schema" may have parts that come from 5
>>> jar files on the classpath e.g., 2 different header schemas, a library
>>> schema, and the "main" schema that assembles them all.  Inside schemas they
>>> refer to each other via xs:include or xs:import, and the schemaLocation
>>> attribute takes a URI to the location of the included/imported schema and
>>> those URIs are interpreted this same way we would want Drill to identify
>>> the location of a schema.
>>>> 
>>>> However, really people will want to pre-compile any real non-toy/test
>>> DFDL schemas into binary ".bin" files for faster loading. Otherwise
>>> Daffodil schema compilation time can be excessive (minutes for large DFDL
>>> schemas - for example the DFDL schema for VMF is 180K lines of DFDL).
>>> Compiled schemas live in exactly 1 file (relatively small. The compiled
>>> form of VMF schema is 8Mbytes). So the path given for schema in Drill sql
>>> query, or in the config wants to be allowed to be either a compiled schema
>>> or a source-code schema (.xsd) this latter mostly being for test, training,
>>> and toy examples that we would compile on-the-fly.
>>>> 
>>>>> To get the DFDL schema file or URL we have a few options, all of which
>>> revolve around setting a config variable.  For now, let's just say that the
>>> schema file is contained in the same folder as the data.  (We can make this
>>> more sophisticated later...)
>>>> 
>>>> It would make life difficult if the schemas and test data must be
>>> co-resident. Most schema projects have these in entirely separate
>>> sub-trees. Schema will be under src/main/resources/..../xsd, compiled
>>> schema would be under target/... and test data under
>>> src/test/resources/.../data
>>>> 
>>>> For now I think the easiest thing is just we get two URIs. One is for
>>> the data, one is for the schema. We access them via
>>> getClass().getResource().
>>>> 
>>>> We should not worry about caching or anything for now. Once the above
>>> works for a decent scope of tests we can worry about making it more
>>> convenient to have a library of schemas at one's disposal.
>>>> 
>>>>> 
>>>>> Here's what you have to do.
>>>>> 
>>>>> 1.  In the formatConfig file, define a String called 'dfdlSchema'.
>>> Note... config variables must be private and final.  If they aren't it can
>>> cause weird errors that are really difficult to debug.  For some reference,
>>> take a look at the Excel plugin.  (
>>> https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java
>>> )
>>>>> 
>>>>> Setting a config variable there will allow a user to set a global
>>> schema definition.  This can also be configured individually for various
>>> workspaces.  So let's say you had PCAP files in one workspace, you could
>>> globally set the DFDL file for that and then another workspace which has
>>> some other file, you could create another DFDL plugin instance for that.
>>>> 
>>>> Ok, so the above lets me play with Drill and one schema by default. Ok
>>> for using Drill to explore data, and useful for testing.
>>>> 
>>>>> 
>>>>> Now, this is all fine and good, but a user might also want to define
>>> the schema file at query time.  The good news is that Drill allows you to
>>> do that via the table() function.
>>>>> 
>>>> 
>>>> This would allow real data-integration queries against multiple
>>> different DFDL-described data sources. Needed for a compelling demo.
>>>> 
>>>>> So let's say that we want to use a different schema file than the
>>> default, we could do something like this:
>>>>> 
>>>>> SELECT ....
>>>>> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl',
>>> dfdlSchema=>'path_to_schema')
>>>>> 
>>>>> Take a look at the Excel docs (
>>> https://github.com/apache/drill/blob/master/contrib/format-excel/README.md)
>>> which demonstrate how to write queries like that.  I believe that the
>>> parameters in the table function take higher precedence than the parameters
>>> from the config.  That would make sense at least.
>>>>> 
>>>> 
>>>> Perfect. I'll start with this.
>>>> 
>>>>> 
>>>>> 2.  Now that we have the schema file, the next thing would be to
>>> convert that into a Drill schema.  Let's say that we have a function called
>>> dfdlToDrill that handles the conversion.
>>>>> 
>>>>> What you'd have to do is in the constructor for the BatchReader, you'd
>>> have to set the schema there.  So pseudo code:
>>>>> 
>>>>> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan,
>>> FileSchemaNegotiator negotiator) {
>>>>>     // Other stuff...
>>>>> 
>>>>>     // Get Drill schema from DFDL
>>>>>     TupleMetadata schema = dfldToDrill(<dfdl schema file);
>>>>> 
>>>>>     // Here's the important part
>>>>>     negotiator.tableSchema(schema, true);
>>>>> }
>>>>> 
>>>>> The negotiator.tableSchema() accepts two args, a TupleMetadata and a
>>> boolean as to whether the schema is final or not.  Once this schema has
>>> been added to the negotiator object, you can then create the writers.
>>>>> 
>>>> 
>>>> That negotiator.tableSchema() is ideal. I was hoping that this was
>>> going to be the only place the metadata had to be given to drill.
>>> Excellent.
>>>> 
>>>>> 
>>>>> Take a look here...
>>>>> 
>>>>> 
>>>>> 
>>> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
>>> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
>>>>> github.com
>>>>> <
>>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
>>> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill <
>>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
>>>> 
>>>>> github.com <
>>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
>>>> 
>>>>> 
>>>>> 
>>>>> I see Paul just responded so I'll leave you with this.  If you have
>>> additional questions, send them our way.  Do take a look at the Excel
>>> plugin as I think it will be helpful.
>>>>> 
>>>> Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can
>>> work similarly.
>>>> 
>>>> This will take me a few more days to get to a pull request. The first
>>> one will be initial review, i.e., not intended to merge without more tests.
>>> Probably it will support only integer data fields, but should support lots
>>> of data shapes including vectors, choices, sequences, nested records, etc.
>>>> 
>>>> Thanks for the help.
>>>> 
>>>>> 
>>>>>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbeckerle@apache.org
>>> <ma...@apache.org>> wrote:
>>>>>> 
>>>>>> So when a data format is described by a DFDL schema, I can generate
>>>>>> equivalent Drill schema (TupleMetadata). This schema is always
>>> complete. I
>>>>>> have unit tests working with this.
>>>>>> 
>>>>>> To do this for a real SQL query, I need the DFDL schema to be
>>> identified on
>>>>>> the SQL query by a file path or URI.
>>>>>> 
>>>>>> Q: How do I get that DFDL schema File/URI parameter from the SQL
>>> query?
>>>>>> 
>>>>>> Next, assuming I have the DFDL schema identified, I generate an
>>> equivalent
>>>>>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>>>>>> 
>>>>>> What objects do I call, or what classes do I have to create to make
>>> this
>>>>>> Drill TupleMetadata available to Drill so it uses it in all the ways a
>>>>>> static Drill schema can be useful?
>>>>>> 
>>>>>> I just need pointers to the code that illustrate how to do this.
>>> Thanks
>>>>>> 
>>>>>> -Mike Beckerle
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0328@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>>> Mike,
>>>>>>> 
>>>>>>> This is a complex question and has two answers.
>>>>>>> 
>>>>>>> First, the standard enhanced vector framework (EVF) used by most
>>> readers
>>>>>>> assumes a "pull" model: read each record. This is where the next()
>>> comes
>>>>>>> in: readers just implement this to read the next record. But, the
>>> code
>>>>>>> under EVF works with a push model: the readers write to vectors, and
>>> signal
>>>>>>> the next record. EVF translates the lower-level push model to the
>>>>>>> higher-level, easier-to-use pull model. The best example of this is
>>> the
>>>>>>> JSON reader which uses Jackson to parse JSON and responds to the
>>>>>>> corresponding events.
>>>>>>> 
>>>>>>> You can thus take over the task of filling a batch of records. I'd
>>> have to
>>>>>>> poke around the code to refresh my memory. Or, you can take a look
>>> at the
>>>>>>> (quite complex) JSON parser, or the EVF itself to see what it does.
>>> There
>>>>>>> are many unit tests that show this at various levels of abstraction.
>>>>>>> 
>>>>>>> Basically, you have to:
>>>>>>> 
>>>>>>> * Start a batch
>>>>>>> * Ask if you can start the next record (which might be declined if
>>> the
>>>>>>> batch is full)
>>>>>>> * Write each field. For complex fields, such as records, recursively
>>> do the
>>>>>>> start/end record work.
>>>>>>> * Mark the record as complete.
>>>>>>> 
>>>>>>> You should be able to map event handlers to EVF actions as a result.
>>> Even
>>>>>>> though DFDL wants to "drive", it still has to give up control once
>>> the
>>>>>>> batch is full. EVF will then handle the (surprisingly complex) task
>>> of
>>>>>>> finishing up the batch and returning it as the output of the Scan
>>> operator.
>>>>>>> 
>>>>>>> - Paul
>>>>>>> 
>>>>>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbeckerle@apache.org
>>> <ma...@apache.org>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Daffodil parsing generates event callbacks to an InfosetOutputter,
>>> which
>>>>>>> is
>>>>>>>> analogous to a SAX event handler.
>>>>>>>> 
>>>>>>>> Drill is expecting an iterator style of calling next() to advance
>>> through
>>>>>>>> the input, i.e., Drill has the control thread and expects to do pull
>>>>>>>> parsing. At least from the code I studied in the format-xml contrib.
>>>>>>>> 
>>>>>>>> Is there any alternative? Before I dig into creating another one of
>>> these
>>>>>>>> co-routine-style control inversions (which have proven to be
>>> problematic
>>>>>>>> for performance.
>>> 
>>>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Posted by Mike Beckerle <mb...@apache.org>.

I am very much hoping someone will look at my open PR soon.
https://github.com/apache/drill/pull/2836

I am basically blocked on this effort until you help me with one key area
of that.

I expect the part I am puzzling over is routine to you, so it will save me
much effort.

This is the key area in the DaffodilBatchReader.java code:

  // FIXME: Next, a MIRACLE occurs.
  //
  // We get the dfdlSchemaURI filled in from the query, or a default config
location
  // We get the rootName (or null if not supplied) from the query, or a
default config location
  // We get the rootNamespace (or null if not supplied) from the query, or
a default config location
  // We get the validationMode (true/false) filled in from the query or a
default config location
  // We get the dataInputURI filled in from the query, or from a default
config location
  //
  // For a first cut, let's just fake it. :-)
  boolean validationMode = true;
  URI dfdlSchemaURI = new URI("schema/complexArray1.dfdl.xsd");
  String rootName = null;
  String rootNamespace = null;
  URI dataInputURI = new URI("data/complexArray1.dat");


I imagine this is just a few lines of code to grab these from the query,
and i don't even care about config files for now.

I gave up on trying to figure out how to do this myself. It was actually
quite unclear from looking at the other format plugins. The way Drill does
configuration is obviously motivated by the distributed architecture
combined with pluggability, but all that combined with the negotation over
schemas which extends into runtime, and it all became quite muddy to me. I
think what I need is super straightforward, so i figured I should just
ask.

This is just to get enough working (against local files only) that I can be
unblocked on creating and testing the rest of the Daffodil-to-Drill
metadata bridge and data bridge.

My plan is to get all kinds of data and queries working first but just
against local-only files.  Fixing it to work in distributed Drill can come
later.

-mikeb

On Wed, Oct 18, 2023 at 2:11 PM Paul Rogers <pa...@gmail.com> wrote:

> Hi Charles,
>
> The persistent store is just ZooKeeper, and ZK is known to work poorly as
> a distributed DB. ZK works great for things like tokens, node registrations
> and the like. But, ZK scales very poorly for things like schemas (or query
> profiles or a list of active queries.)
>
> A more scalable approach may be to cache the schemas in each Drillbit,
> then translate them to Drill's format and include them in each Scan
> operator definition sent to each execution Drillbit. That solution avoids
> race conditions when the schemas change while a query is in flight. This
> is, in fact, the model used for storage plugin definitions. (The storage
> plugin definitions are, in fact, stored in ZK, but tend to be small and few
> in number.)
>
> - Paul
>
>
> On Wed, Oct 18, 2023 at 7:51 AM Charles Givre <cg...@gmail.com> wrote:
>
>> Hi Mike,
>> I hope all is well.  I remembered one other piece which might be useful
>> for you.  Drill has an interface called a PersistentStore which is used for
>> storing artifacts such as tokens etc.  I've uesd it on two occasions: in
>> the GoogleSheets plugin and the Http plugin.  In both cases, I used it to
>> store OAuth user tokens which need to be preserved and shared across
>> drillbits, and also frequently updated.  I was thinking that this might be
>> useful for caching the DFDL schemata.  If you take a look here:
>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
>>
>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
>> and here
>> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
>> you can see how I used that.
>>
>> Best,
>> -- C
>>
>>
>>
>>
>>
>>
>> > On Oct 13, 2023, at 1:25 PM, Mike Beckerle <mb...@apache.org>
>> wrote:
>> >
>> > Very helpful.
>> >
>> > Answers to your questions, and comments are below:
>> >
>> > On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cgivre@gmail.com
>> <ma...@gmail.com>> wrote:
>> >> HI Mike,
>> >> I hope all is well.  I'll take a stab at answering your questions.
>> But I have a few questions as well:
>> >>
>> >> 1.  Are you writing a storage or format plugin for DFDL?  My thinking
>> was that this would be a format plugin, but let me know if you were
>> thinking differently
>> >
>> > Format plugin.
>> >
>> >> 2.  In traditional deployments, where do people store the DFDL
>> schemata files?  Are they local or accessible via URL?
>> >
>> > Schemas are stored in files, or in jar files created when packaging a
>> schema project. Hence URI is the preferred identifier for them.  They are
>> not retrieved remotely or anything like that. It's a matter of whether they
>> are in jars on the classpath, directories on the classpath, or just a file
>> location.
>> >
>> > The source-code of DFDL schemas are often created using other schemas
>> as components, so a single "DFDL schema" may have parts that come from 5
>> jar files on the classpath e.g., 2 different header schemas, a library
>> schema, and the "main" schema that assembles them all.  Inside schemas they
>> refer to each other via xs:include or xs:import, and the schemaLocation
>> attribute takes a URI to the location of the included/imported schema and
>> those URIs are interpreted this same way we would want Drill to identify
>> the location of a schema.
>> >
>> > However, really people will want to pre-compile any real non-toy/test
>> DFDL schemas into binary ".bin" files for faster loading. Otherwise
>> Daffodil schema compilation time can be excessive (minutes for large DFDL
>> schemas - for example the DFDL schema for VMF is 180K lines of DFDL).
>> Compiled schemas live in exactly 1 file (relatively small. The compiled
>> form of VMF schema is 8Mbytes). So the path given for schema in Drill sql
>> query, or in the config wants to be allowed to be either a compiled schema
>> or a source-code schema (.xsd) this latter mostly being for test, training,
>> and toy examples that we would compile on-the-fly.
>> >
>> >> To get the DFDL schema file or URL we have a few options, all of which
>> revolve around setting a config variable.  For now, let's just say that the
>> schema file is contained in the same folder as the data.  (We can make this
>> more sophisticated later...)
>> >
>> > It would make life difficult if the schemas and test data must be
>> co-resident. Most schema projects have these in entirely separate
>> sub-trees. Schema will be under src/main/resources/..../xsd, compiled
>> schema would be under target/... and test data under
>> src/test/resources/.../data
>> >
>> > For now I think the easiest thing is just we get two URIs. One is for
>> the data, one is for the schema. We access them via
>> getClass().getResource().
>> >
>> > We should not worry about caching or anything for now. Once the above
>> works for a decent scope of tests we can worry about making it more
>> convenient to have a library of schemas at one's disposal.
>> >
>> >>
>> >> Here's what you have to do.
>> >>
>> >> 1.  In the formatConfig file, define a String called 'dfdlSchema'.
>>  Note... config variables must be private and final.  If they aren't it can
>> cause weird errors that are really difficult to debug.  For some reference,
>> take a look at the Excel plugin.  (
>> https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java
>> )
>> >>
>> >> Setting a config variable there will allow a user to set a global
>> schema definition.  This can also be configured individually for various
>> workspaces.  So let's say you had PCAP files in one workspace, you could
>> globally set the DFDL file for that and then another workspace which has
>> some other file, you could create another DFDL plugin instance for that.
>> >
>> > Ok, so the above lets me play with Drill and one schema by default. Ok
>> for using Drill to explore data, and useful for testing.
>> >
>> >>
>> >> Now, this is all fine and good, but a user might also want to define
>> the schema file at query time.  The good news is that Drill allows you to
>> do that via the table() function.
>> >>
>> >
>> > This would allow real data-integration queries against multiple
>> different DFDL-described data sources. Needed for a compelling demo.
>> >
>> >> So let's say that we want to use a different schema file than the
>> default, we could do something like this:
>> >>
>> >> SELECT ....
>> >> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl',
>> dfdlSchema=>'path_to_schema')
>> >>
>> >> Take a look at the Excel docs (
>> https://github.com/apache/drill/blob/master/contrib/format-excel/README.md)
>> which demonstrate how to write queries like that.  I believe that the
>> parameters in the table function take higher precedence than the parameters
>> from the config.  That would make sense at least.
>> >>
>> >
>> > Perfect. I'll start with this.
>> >
>> >>
>> >> 2.  Now that we have the schema file, the next thing would be to
>> convert that into a Drill schema.  Let's say that we have a function called
>> dfdlToDrill that handles the conversion.
>> >>
>> >> What you'd have to do is in the constructor for the BatchReader, you'd
>> have to set the schema there.  So pseudo code:
>> >>
>> >> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan,
>> FileSchemaNegotiator negotiator) {
>> >>      // Other stuff...
>> >>
>> >>      // Get Drill schema from DFDL
>> >>      TupleMetadata schema = dfldToDrill(<dfdl schema file);
>> >>
>> >>      // Here's the important part
>> >>      negotiator.tableSchema(schema, true);
>> >> }
>> >>
>> >> The negotiator.tableSchema() accepts two args, a TupleMetadata and a
>> boolean as to whether the schema is final or not.  Once this schema has
>> been added to the negotiator object, you can then create the writers.
>> >>
>> >
>> > That negotiator.tableSchema() is ideal. I was hoping that this was
>> going to be the only place the metadata had to be given to drill.
>> Excellent.
>> >
>> >>
>> >> Take a look here...
>> >>
>> >>
>> >>
>> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
>> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
>> >> github.com
>> >>  <
>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
>> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill <
>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
>> >
>> >> github.com <
>> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
>> >
>> >>
>> >>
>> >> I see Paul just responded so I'll leave you with this.  If you have
>> additional questions, send them our way.  Do take a look at the Excel
>> plugin as I think it will be helpful.
>> >>
>> > Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can
>> work similarly.
>> >
>> > This will take me a few more days to get to a pull request. The first
>> one will be initial review, i.e., not intended to merge without more tests.
>> Probably it will support only integer data fields, but should support lots
>> of data shapes including vectors, choices, sequences, nested records, etc.
>> >
>> > Thanks for the help.
>> >
>> >>
>> >>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbeckerle@apache.org
>> <ma...@apache.org>> wrote:
>> >>>
>> >>> So when a data format is described by a DFDL schema, I can generate
>> >>> equivalent Drill schema (TupleMetadata). This schema is always
>> complete. I
>> >>> have unit tests working with this.
>> >>>
>> >>> To do this for a real SQL query, I need the DFDL schema to be
>> identified on
>> >>> the SQL query by a file path or URI.
>> >>>
>> >>> Q: How do I get that DFDL schema File/URI parameter from the SQL
>> query?
>> >>>
>> >>> Next, assuming I have the DFDL schema identified, I generate an
>> equivalent
>> >>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>> >>>
>> >>> What objects do I call, or what classes do I have to create to make
>> this
>> >>> Drill TupleMetadata available to Drill so it uses it in all the ways a
>> >>> static Drill schema can be useful?
>> >>>
>> >>> I just need pointers to the code that illustrate how to do this.
>> Thanks
>> >>>
>> >>> -Mike Beckerle
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0328@gmail.com
>> <ma...@gmail.com>> wrote:
>> >>>
>> >>>> Mike,
>> >>>>
>> >>>> This is a complex question and has two answers.
>> >>>>
>> >>>> First, the standard enhanced vector framework (EVF) used by most
>> readers
>> >>>> assumes a "pull" model: read each record. This is where the next()
>> comes
>> >>>> in: readers just implement this to read the next record. But, the
>> code
>> >>>> under EVF works with a push model: the readers write to vectors, and
>> signal
>> >>>> the next record. EVF translates the lower-level push model to the
>> >>>> higher-level, easier-to-use pull model. The best example of this is
>> the
>> >>>> JSON reader which uses Jackson to parse JSON and responds to the
>> >>>> corresponding events.
>> >>>>
>> >>>> You can thus take over the task of filling a batch of records. I'd
>> have to
>> >>>> poke around the code to refresh my memory. Or, you can take a look
>> at the
>> >>>> (quite complex) JSON parser, or the EVF itself to see what it does.
>> There
>> >>>> are many unit tests that show this at various levels of abstraction.
>> >>>>
>> >>>> Basically, you have to:
>> >>>>
>> >>>> * Start a batch
>> >>>> * Ask if you can start the next record (which might be declined if
>> the
>> >>>> batch is full)
>> >>>> * Write each field. For complex fields, such as records, recursively
>> do the
>> >>>> start/end record work.
>> >>>> * Mark the record as complete.
>> >>>>
>> >>>> You should be able to map event handlers to EVF actions as a result.
>> Even
>> >>>> though DFDL wants to "drive", it still has to give up control once
>> the
>> >>>> batch is full. EVF will then handle the (surprisingly complex) task
>> of
>> >>>> finishing up the batch and returning it as the output of the Scan
>> operator.
>> >>>>
>> >>>> - Paul
>> >>>>
>> >>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbeckerle@apache.org
>> <ma...@apache.org>>
>> >>>> wrote:
>> >>>>
>> >>>>> Daffodil parsing generates event callbacks to an InfosetOutputter,
>> which
>> >>>> is
>> >>>>> analogous to a SAX event handler.
>> >>>>>
>> >>>>> Drill is expecting an iterator style of calling next() to advance
>> through
>> >>>>> the input, i.e., Drill has the control thread and expects to do pull
>> >>>>> parsing. At least from the code I studied in the format-xml contrib.
>> >>>>>
>> >>>>> Is there any alternative? Before I dig into creating another one of
>> these
>> >>>>> co-routine-style control inversions (which have proven to be
>> problematic
>> >>>>> for performance.
>>
>>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Posted by Paul Rogers <pa...@gmail.com>.

Hi Charles,

The persistent store is just ZooKeeper, and ZK is known to work poorly as a
distributed DB. ZK works great for things like tokens, node registrations
and the like. But, ZK scales very poorly for things like schemas (or query
profiles or a list of active queries.)

A more scalable approach may be to cache the schemas in each Drillbit, then
translate them to Drill's format and include them in each Scan operator
definition sent to each execution Drillbit. That solution avoids race
conditions when the schemas change while a query is in flight. This is, in
fact, the model used for storage plugin definitions. (The storage plugin
definitions are, in fact, stored in ZK, but tend to be small and few in
number.)

- Paul


On Wed, Oct 18, 2023 at 7:51 AM Charles Givre <cg...@gmail.com> wrote:

> Hi Mike,
> I hope all is well.  I remembered one other piece which might be useful
> for you.  Drill has an interface called a PersistentStore which is used for
> storing artifacts such as tokens etc.  I've uesd it on two occasions: in
> the GoogleSheets plugin and the Http plugin.  In both cases, I used it to
> store OAuth user tokens which need to be preserved and shared across
> drillbits, and also frequently updated.  I was thinking that this might be
> useful for caching the DFDL schemata.  If you take a look here:
> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java,
>
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth.
> and here
> https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java,
> you can see how I used that.
>
> Best,
> -- C
>
>
>
>
>
>
> > On Oct 13, 2023, at 1:25 PM, Mike Beckerle <mb...@apache.org> wrote:
> >
> > Very helpful.
> >
> > Answers to your questions, and comments are below:
> >
> > On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cgivre@gmail.com <mailto:
> cgivre@gmail.com>> wrote:
> >> HI Mike,
> >> I hope all is well.  I'll take a stab at answering your questions.  But
> I have a few questions as well:
> >>
> >> 1.  Are you writing a storage or format plugin for DFDL?  My thinking
> was that this would be a format plugin, but let me know if you were
> thinking differently
> >
> > Format plugin.
> >
> >> 2.  In traditional deployments, where do people store the DFDL schemata
> files?  Are they local or accessible via URL?
> >
> > Schemas are stored in files, or in jar files created when packaging a
> schema project. Hence URI is the preferred identifier for them.  They are
> not retrieved remotely or anything like that. It's a matter of whether they
> are in jars on the classpath, directories on the classpath, or just a file
> location.
> >
> > The source-code of DFDL schemas are often created using other schemas as
> components, so a single "DFDL schema" may have parts that come from 5 jar
> files on the classpath e.g., 2 different header schemas, a library schema,
> and the "main" schema that assembles them all.  Inside schemas they refer
> to each other via xs:include or xs:import, and the schemaLocation attribute
> takes a URI to the location of the included/imported schema and those URIs
> are interpreted this same way we would want Drill to identify the location
> of a schema.
> >
> > However, really people will want to pre-compile any real non-toy/test
> DFDL schemas into binary ".bin" files for faster loading. Otherwise
> Daffodil schema compilation time can be excessive (minutes for large DFDL
> schemas - for example the DFDL schema for VMF is 180K lines of DFDL).
> Compiled schemas live in exactly 1 file (relatively small. The compiled
> form of VMF schema is 8Mbytes). So the path given for schema in Drill sql
> query, or in the config wants to be allowed to be either a compiled schema
> or a source-code schema (.xsd) this latter mostly being for test, training,
> and toy examples that we would compile on-the-fly.
> >
> >> To get the DFDL schema file or URL we have a few options, all of which
> revolve around setting a config variable.  For now, let's just say that the
> schema file is contained in the same folder as the data.  (We can make this
> more sophisticated later...)
> >
> > It would make life difficult if the schemas and test data must be
> co-resident. Most schema projects have these in entirely separate
> sub-trees. Schema will be under src/main/resources/..../xsd, compiled
> schema would be under target/... and test data under
> src/test/resources/.../data
> >
> > For now I think the easiest thing is just we get two URIs. One is for
> the data, one is for the schema. We access them via
> getClass().getResource().
> >
> > We should not worry about caching or anything for now. Once the above
> works for a decent scope of tests we can worry about making it more
> convenient to have a library of schemas at one's disposal.
> >
> >>
> >> Here's what you have to do.
> >>
> >> 1.  In the formatConfig file, define a String called 'dfdlSchema'.
>  Note... config variables must be private and final.  If they aren't it can
> cause weird errors that are really difficult to debug.  For some reference,
> take a look at the Excel plugin.  (
> https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java
> )
> >>
> >> Setting a config variable there will allow a user to set a global
> schema definition.  This can also be configured individually for various
> workspaces.  So let's say you had PCAP files in one workspace, you could
> globally set the DFDL file for that and then another workspace which has
> some other file, you could create another DFDL plugin instance for that.
> >
> > Ok, so the above lets me play with Drill and one schema by default. Ok
> for using Drill to explore data, and useful for testing.
> >
> >>
> >> Now, this is all fine and good, but a user might also want to define
> the schema file at query time.  The good news is that Drill allows you to
> do that via the table() function.
> >>
> >
> > This would allow real data-integration queries against multiple
> different DFDL-described data sources. Needed for a compelling demo.
> >
> >> So let's say that we want to use a different schema file than the
> default, we could do something like this:
> >>
> >> SELECT ....
> >> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl',
> dfdlSchema=>'path_to_schema')
> >>
> >> Take a look at the Excel docs (
> https://github.com/apache/drill/blob/master/contrib/format-excel/README.md)
> which demonstrate how to write queries like that.  I believe that the
> parameters in the table function take higher precedence than the parameters
> from the config.  That would make sense at least.
> >>
> >
> > Perfect. I'll start with this.
> >
> >>
> >> 2.  Now that we have the schema file, the next thing would be to
> convert that into a Drill schema.  Let's say that we have a function called
> dfdlToDrill that handles the conversion.
> >>
> >> What you'd have to do is in the constructor for the BatchReader, you'd
> have to set the schema there.  So pseudo code:
> >>
> >> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan,
> FileSchemaNegotiator negotiator) {
> >>      // Other stuff...
> >>
> >>      // Get Drill schema from DFDL
> >>      TupleMetadata schema = dfldToDrill(<dfdl schema file);
> >>
> >>      // Here's the important part
> >>      negotiator.tableSchema(schema, true);
> >> }
> >>
> >> The negotiator.tableSchema() accepts two args, a TupleMetadata and a
> boolean as to whether the schema is final or not.  Once this schema has
> been added to the negotiator object, you can then create the writers.
> >>
> >
> > That negotiator.tableSchema() is ideal. I was hoping that this was going
> to be the only place the metadata had to be given to drill. Excellent.
> >
> >>
> >> Take a look here...
> >>
> >>
> >>
> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
> >> github.com
> >>  <
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill <
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
> >
> >> github.com <
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
> >
> >>
> >>
> >> I see Paul just responded so I'll leave you with this.  If you have
> additional questions, send them our way.  Do take a look at the Excel
> plugin as I think it will be helpful.
> >>
> > Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can
> work similarly.
> >
> > This will take me a few more days to get to a pull request. The first
> one will be initial review, i.e., not intended to merge without more tests.
> Probably it will support only integer data fields, but should support lots
> of data shapes including vectors, choices, sequences, nested records, etc.
> >
> > Thanks for the help.
> >
> >>
> >>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbeckerle@apache.org
> <ma...@apache.org>> wrote:
> >>>
> >>> So when a data format is described by a DFDL schema, I can generate
> >>> equivalent Drill schema (TupleMetadata). This schema is always
> complete. I
> >>> have unit tests working with this.
> >>>
> >>> To do this for a real SQL query, I need the DFDL schema to be
> identified on
> >>> the SQL query by a file path or URI.
> >>>
> >>> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
> >>>
> >>> Next, assuming I have the DFDL schema identified, I generate an
> equivalent
> >>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
> >>>
> >>> What objects do I call, or what classes do I have to create to make
> this
> >>> Drill TupleMetadata available to Drill so it uses it in all the ways a
> >>> static Drill schema can be useful?
> >>>
> >>> I just need pointers to the code that illustrate how to do this. Thanks
> >>>
> >>> -Mike Beckerle
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0328@gmail.com
> <ma...@gmail.com>> wrote:
> >>>
> >>>> Mike,
> >>>>
> >>>> This is a complex question and has two answers.
> >>>>
> >>>> First, the standard enhanced vector framework (EVF) used by most
> readers
> >>>> assumes a "pull" model: read each record. This is where the next()
> comes
> >>>> in: readers just implement this to read the next record. But, the code
> >>>> under EVF works with a push model: the readers write to vectors, and
> signal
> >>>> the next record. EVF translates the lower-level push model to the
> >>>> higher-level, easier-to-use pull model. The best example of this is
> the
> >>>> JSON reader which uses Jackson to parse JSON and responds to the
> >>>> corresponding events.
> >>>>
> >>>> You can thus take over the task of filling a batch of records. I'd
> have to
> >>>> poke around the code to refresh my memory. Or, you can take a look at
> the
> >>>> (quite complex) JSON parser, or the EVF itself to see what it does.
> There
> >>>> are many unit tests that show this at various levels of abstraction.
> >>>>
> >>>> Basically, you have to:
> >>>>
> >>>> * Start a batch
> >>>> * Ask if you can start the next record (which might be declined if the
> >>>> batch is full)
> >>>> * Write each field. For complex fields, such as records, recursively
> do the
> >>>> start/end record work.
> >>>> * Mark the record as complete.
> >>>>
> >>>> You should be able to map event handlers to EVF actions as a result.
> Even
> >>>> though DFDL wants to "drive", it still has to give up control once the
> >>>> batch is full. EVF will then handle the (surprisingly complex) task of
> >>>> finishing up the batch and returning it as the output of the Scan
> operator.
> >>>>
> >>>> - Paul
> >>>>
> >>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbeckerle@apache.org
> <ma...@apache.org>>
> >>>> wrote:
> >>>>
> >>>>> Daffodil parsing generates event callbacks to an InfosetOutputter,
> which
> >>>> is
> >>>>> analogous to a SAX event handler.
> >>>>>
> >>>>> Drill is expecting an iterator style of calling next() to advance
> through
> >>>>> the input, i.e., Drill has the control thread and expects to do pull
> >>>>> parsing. At least from the code I studied in the format-xml contrib.
> >>>>>
> >>>>> Is there any alternative? Before I dig into creating another one of
> these
> >>>>> co-routine-style control inversions (which have proven to be
> problematic
> >>>>> for performance.
>
>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Posted by Charles Givre <cg...@gmail.com>.

Hi Mike, 
I hope all is well.  I remembered one other piece which might be useful for you.  Drill has an interface called a PersistentStore which is used for storing artifacts such as tokens etc.  I've uesd it on two occasions: in the GoogleSheets plugin and the Http plugin.  In both cases, I used it to store OAuth user tokens which need to be preserved and shared across drillbits, and also frequently updated.  I was thinking that this might be useful for caching the DFDL schemata.  If you take a look here: https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/oauth/AccessTokenRepository.java, https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/oauth. and here https://github.com/apache/drill/blob/master/contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpStoragePlugin.java, you can see how I used that.

Best,
-- C

  




> On Oct 13, 2023, at 1:25 PM, Mike Beckerle <mb...@apache.org> wrote:
> 
> Very helpful.
> 
> Answers to your questions, and comments are below:
> 
> On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cgivre@gmail.com <ma...@gmail.com>> wrote:
>> HI Mike, 
>> I hope all is well.  I'll take a stab at answering your questions.  But I have a few questions as well:
>>  
>> 1.  Are you writing a storage or format plugin for DFDL?  My thinking was that this would be a format plugin, but let me know if you were thinking differently
> 
> Format plugin.
>  
>> 2.  In traditional deployments, where do people store the DFDL schemata files?  Are they local or accessible via URL?
> 
> Schemas are stored in files, or in jar files created when packaging a schema project. Hence URI is the preferred identifier for them.  They are not retrieved remotely or anything like that. It's a matter of whether they are in jars on the classpath, directories on the classpath, or just a file location. 
> 
> The source-code of DFDL schemas are often created using other schemas as components, so a single "DFDL schema" may have parts that come from 5 jar files on the classpath e.g., 2 different header schemas, a library schema, and the "main" schema that assembles them all.  Inside schemas they refer to each other via xs:include or xs:import, and the schemaLocation attribute takes a URI to the location of the included/imported schema and those URIs are interpreted this same way we would want Drill to identify the location of a schema. 
> 
> However, really people will want to pre-compile any real non-toy/test DFDL schemas into binary ".bin" files for faster loading. Otherwise Daffodil schema compilation time can be excessive (minutes for large DFDL schemas - for example the DFDL schema for VMF is 180K lines of DFDL). Compiled schemas live in exactly 1 file (relatively small. The compiled form of VMF schema is 8Mbytes). So the path given for schema in Drill sql query, or in the config wants to be allowed to be either a compiled schema or a source-code schema (.xsd) this latter mostly being for test, training, and toy examples that we would compile on-the-fly.  
>  
>> To get the DFDL schema file or URL we have a few options, all of which revolve around setting a config variable.  For now, let's just say that the schema file is contained in the same folder as the data.  (We can make this more sophisticated later...)
> 
> It would make life difficult if the schemas and test data must be co-resident. Most schema projects have these in entirely separate sub-trees. Schema will be under src/main/resources/..../xsd, compiled schema would be under target/... and test data under src/test/resources/.../data
> 
> For now I think the easiest thing is just we get two URIs. One is for the data, one is for the schema. We access them via getClass().getResource(). 
> 
> We should not worry about caching or anything for now. Once the above works for a decent scope of tests we can worry about making it more convenient to have a library of schemas at one's disposal. 
>  
>> 
>> Here's what you have to do.
>> 
>> 1.  In the formatConfig file, define a String called 'dfdlSchema'.   Note... config variables must be private and final.  If they aren't it can cause weird errors that are really difficult to debug.  For some reference, take a look at the Excel plugin.  (https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java)
>> 
>> Setting a config variable there will allow a user to set a global schema definition.  This can also be configured individually for various workspaces.  So let's say you had PCAP files in one workspace, you could globally set the DFDL file for that and then another workspace which has some other file, you could create another DFDL plugin instance for that. 
> 
> Ok, so the above lets me play with Drill and one schema by default. Ok for using Drill to explore data, and useful for testing. 
>  
>> 
>> Now, this is all fine and good, but a user might also want to define the schema file at query time.  The good news is that Drill allows you to do that via the table() function. 
>> 
> 
> This would allow real data-integration queries against multiple different DFDL-described data sources. Needed for a compelling demo. 
>  
>> So let's say that we want to use a different schema file than the default, we could do something like this:
>> 
>> SELECT ....
>> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', dfdlSchema=>'path_to_schema')
>> 
>> Take a look at the Excel docs (https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) which demonstrate how to write queries like that.  I believe that the parameters in the table function take higher precedence than the parameters from the config.  That would make sense at least.
>> 
> 
> Perfect. I'll start with this. 
>  
>> 
>> 2.  Now that we have the schema file, the next thing would be to convert that into a Drill schema.  Let's say that we have a function called dfdlToDrill that handles the conversion.
>> 
>> What you'd have to do is in the constructor for the BatchReader, you'd have to set the schema there.  So pseudo code:
>> 
>> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, FileSchemaNegotiator negotiator) {
>>    	// Other stuff...
>>  	
>> 	// Get Drill schema from DFDL
>> 	TupleMetadata schema = dfldToDrill(<dfdl schema file);
>> 	
>> 	// Here's the important part
>>   	negotiator.tableSchema(schema, true);
>> }
>> 
>> The negotiator.tableSchema() accepts two args, a TupleMetadata and a boolean as to whether the schema is final or not.  Once this schema has been added to the negotiator object, you can then create the writers. 
>> 
> 
> That negotiator.tableSchema() is ideal. I was hoping that this was going to be the only place the metadata had to be given to drill. Excellent. 
>  
>> 
>> Take a look here... 
>> 
>> 
>> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
>> github.com
>>  <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>
>> github.com <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>
>> 
>> 
>> I see Paul just responded so I'll leave you with this.  If you have additional questions, send them our way.  Do take a look at the Excel plugin as I think it will be helpful.
>> 
> Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can work similarly.
> 
> This will take me a few more days to get to a pull request. The first one will be initial review, i.e., not intended to merge without more tests. Probably it will support only integer data fields, but should support lots of data shapes including vectors, choices, sequences, nested records, etc. 
> 
> Thanks for the help. 
>  
>> 
>>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mbeckerle@apache.org <ma...@apache.org>> wrote:
>>> 
>>> So when a data format is described by a DFDL schema, I can generate
>>> equivalent Drill schema (TupleMetadata). This schema is always complete. I
>>> have unit tests working with this.
>>> 
>>> To do this for a real SQL query, I need the DFDL schema to be identified on
>>> the SQL query by a file path or URI.
>>> 
>>> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
>>> 
>>> Next, assuming I have the DFDL schema identified, I generate an equivalent
>>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>>> 
>>> What objects do I call, or what classes do I have to create to make this
>>> Drill TupleMetadata available to Drill so it uses it in all the ways a
>>> static Drill schema can be useful?
>>> 
>>> I just need pointers to the code that illustrate how to do this. Thanks
>>> 
>>> -Mike Beckerle
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <par0328@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>>> Mike,
>>>> 
>>>> This is a complex question and has two answers.
>>>> 
>>>> First, the standard enhanced vector framework (EVF) used by most readers
>>>> assumes a "pull" model: read each record. This is where the next() comes
>>>> in: readers just implement this to read the next record. But, the code
>>>> under EVF works with a push model: the readers write to vectors, and signal
>>>> the next record. EVF translates the lower-level push model to the
>>>> higher-level, easier-to-use pull model. The best example of this is the
>>>> JSON reader which uses Jackson to parse JSON and responds to the
>>>> corresponding events.
>>>> 
>>>> You can thus take over the task of filling a batch of records. I'd have to
>>>> poke around the code to refresh my memory. Or, you can take a look at the
>>>> (quite complex) JSON parser, or the EVF itself to see what it does. There
>>>> are many unit tests that show this at various levels of abstraction.
>>>> 
>>>> Basically, you have to:
>>>> 
>>>> * Start a batch
>>>> * Ask if you can start the next record (which might be declined if the
>>>> batch is full)
>>>> * Write each field. For complex fields, such as records, recursively do the
>>>> start/end record work.
>>>> * Mark the record as complete.
>>>> 
>>>> You should be able to map event handlers to EVF actions as a result. Even
>>>> though DFDL wants to "drive", it still has to give up control once the
>>>> batch is full. EVF will then handle the (surprisingly complex) task of
>>>> finishing up the batch and returning it as the output of the Scan operator.
>>>> 
>>>> - Paul
>>>> 
>>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mbeckerle@apache.org <ma...@apache.org>>
>>>> wrote:
>>>> 
>>>>> Daffodil parsing generates event callbacks to an InfosetOutputter, which
>>>> is
>>>>> analogous to a SAX event handler.
>>>>> 
>>>>> Drill is expecting an iterator style of calling next() to advance through
>>>>> the input, i.e., Drill has the control thread and expects to do pull
>>>>> parsing. At least from the code I studied in the format-xml contrib.
>>>>> 
>>>>> Is there any alternative? Before I dig into creating another one of these
>>>>> co-routine-style control inversions (which have proven to be problematic
>>>>> for performance.

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Posted by Mike Beckerle <mb...@apache.org>.

Very helpful.

Answers to your questions, and comments are below:

On Thu, Oct 12, 2023 at 5:14 PM Charles Givre <cg...@gmail.com> wrote:

> HI Mike,
> I hope all is well.  I'll take a stab at answering your questions.  But I
> have a few questions as well:
>
>
1.  Are you writing a storage or format plugin for DFDL?  My thinking was
> that this would be a format plugin, but let me know if you were thinking
> differently
>

Format plugin.


> 2.  In traditional deployments, where do people store the DFDL schemata
> files?  Are they local or accessible via URL?
>

Schemas are stored in files, or in jar files created when packaging a
schema project. Hence URI is the preferred identifier for them.  They are
not retrieved remotely or anything like that. It's a matter of whether they
are in jars on the classpath, directories on the classpath, or just a file
location.

The source-code of DFDL schemas are often created using other schemas as
components, so a single "DFDL schema" may have parts that come from 5 jar
files on the classpath e.g., 2 different header schemas, a library schema,
and the "main" schema that assembles them all.  Inside schemas they refer
to each other via xs:include or xs:import, and the schemaLocation attribute
takes a URI to the location of the included/imported schema and those URIs
are interpreted this same way we would want Drill to identify the location
of a schema.

However, really people will want to pre-compile any real non-toy/test DFDL
schemas into binary ".bin" files for faster loading. Otherwise Daffodil
schema compilation time can be excessive (minutes for large DFDL schemas -
for example the DFDL schema for VMF is 180K lines of DFDL). Compiled
schemas live in exactly 1 file (relatively small. The compiled form of VMF
schema is 8Mbytes). So the path given for schema in Drill sql query, or in
the config wants to be allowed to be either a compiled schema or a
source-code schema (.xsd) this latter mostly being for test, training, and
toy examples that we would compile on-the-fly.


> To get the DFDL schema file or URL we have a few options, all of which
> revolve around setting a config variable.  For now, let's just say that the
> schema file is contained in the same folder as the data.  (We can make this
> more sophisticated later...)
>

It would make life difficult if the schemas and test data must be
co-resident. Most schema projects have these in entirely separate
sub-trees. Schema will be under src/main/resources/..../xsd, compiled
schema would be under target/... and test data under
src/test/resources/.../data

For now I think the easiest thing is just we get two URIs. One is for the
data, one is for the schema. We access them via getClass().getResource().

We should not worry about caching or anything for now. Once the above works
for a decent scope of tests we can worry about making it more convenient to
have a library of schemas at one's disposal.


>
> Here's what you have to do.
>
> 1.  In the formatConfig file, define a String called 'dfdlSchema'.
> Note... config variables must be private and final.  If they aren't it can
> cause weird errors that are really difficult to debug.  For some reference,
> take a look at the Excel plugin.  (
> https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java
> )
>
> Setting a config variable there will allow a user to set a global schema
> definition.  This can also be configured individually for various
> workspaces.  So let's say you had PCAP files in one workspace, you could
> globally set the DFDL file for that and then another workspace which has
> some other file, you could create another DFDL plugin instance for that.
>

Ok, so the above lets me play with Drill and one schema by default. Ok for
using Drill to explore data, and useful for testing.


>
> Now, this is all fine and good, but a user might also want to define the
> schema file at query time.  The good news is that Drill allows you to do
> that via the table() function.
>
>
This would allow real data-integration queries against multiple different
DFDL-described data sources. Needed for a compelling demo.


> So let's say that we want to use a different schema file than the default,
> we could do something like this:
>
> SELECT ....
> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl',
> dfdlSchema=>'path_to_schema')
>
> Take a look at the Excel docs (
> https://github.com/apache/drill/blob/master/contrib/format-excel/README.md)
> which demonstrate how to write queries like that.  I believe that the
> parameters in the table function take higher precedence than the parameters
> from the config.  That would make sense at least.
>
>
Perfect. I'll start with this.


>
> 2.  Now that we have the schema file, the next thing would be to convert
> that into a Drill schema.  Let's say that we have a function called
> dfdlToDrill that handles the conversion.
>
> What you'd have to do is in the constructor for the BatchReader, you'd
> have to set the schema there.  So pseudo code:
>
> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan,
> FileSchemaNegotiator negotiator) {
>     // Other stuff...
>
> // Get Drill schema from DFDL
> TupleMetadata schema = dfldToDrill(<dfdl schema file);
> // Here's the important part
>    negotiator.tableSchema(schema, true);
> }
>
> The negotiator.tableSchema() accepts two args, a TupleMetadata and a
> boolean as to whether the schema is final or not.  Once this schema has
> been added to the negotiator object, you can then create the writers.
>
>
That negotiator.tableSchema() is ideal. I was hoping that this was going to
be the only place the metadata had to be given to drill. Excellent.


>
> Take a look here...
>
> [image: drill.png]
>
> drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
> at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>
> github.com
> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>
>
> <https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199>
>
>
> I see Paul just responded so I'll leave you with this.  If you have
> additional questions, send them our way.  Do take a look at the Excel
> plugin as I think it will be helpful.
>
> Yes, I've found the JsonLoaderImpl.readBatch() method, and Daffodil can
work similarly.

This will take me a few more days to get to a pull request. The first one
will be initial review, i.e., not intended to merge without more tests.
Probably it will support only integer data fields, but should support lots
of data shapes including vectors, choices, sequences, nested records, etc.

Thanks for the help.


>
> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mb...@apache.org> wrote:
>
> So when a data format is described by a DFDL schema, I can generate
> equivalent Drill schema (TupleMetadata). This schema is always complete. I
> have unit tests working with this.
>
> To do this for a real SQL query, I need the DFDL schema to be identified on
> the SQL query by a file path or URI.
>
> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
>
> Next, assuming I have the DFDL schema identified, I generate an equivalent
> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>
> What objects do I call, or what classes do I have to create to make this
> Drill TupleMetadata available to Drill so it uses it in all the ways a
> static Drill schema can be useful?
>
> I just need pointers to the code that illustrate how to do this. Thanks
>
> -Mike Beckerle
>
>
>
>
>
>
>
>
>
>
> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <pa...@gmail.com> wrote:
>
> Mike,
>
> This is a complex question and has two answers.
>
> First, the standard enhanced vector framework (EVF) used by most readers
> assumes a "pull" model: read each record. This is where the next() comes
> in: readers just implement this to read the next record. But, the code
> under EVF works with a push model: the readers write to vectors, and signal
> the next record. EVF translates the lower-level push model to the
> higher-level, easier-to-use pull model. The best example of this is the
> JSON reader which uses Jackson to parse JSON and responds to the
> corresponding events.
>
> You can thus take over the task of filling a batch of records. I'd have to
> poke around the code to refresh my memory. Or, you can take a look at the
> (quite complex) JSON parser, or the EVF itself to see what it does. There
> are many unit tests that show this at various levels of abstraction.
>
> Basically, you have to:
>
> * Start a batch
> * Ask if you can start the next record (which might be declined if the
> batch is full)
> * Write each field. For complex fields, such as records, recursively do the
> start/end record work.
> * Mark the record as complete.
>
> You should be able to map event handlers to EVF actions as a result. Even
> though DFDL wants to "drive", it still has to give up control once the
> batch is full. EVF will then handle the (surprisingly complex) task of
> finishing up the batch and returning it as the output of the Scan operator.
>
> - Paul
>
> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mb...@apache.org>
> wrote:
>
> Daffodil parsing generates event callbacks to an InfosetOutputter, which
>
> is
>
> analogous to a SAX event handler.
>
> Drill is expecting an iterator style of calling next() to advance through
> the input, i.e., Drill has the control thread and expects to do pull
> parsing. At least from the code I studied in the format-xml contrib.
>
> Is there any alternative? Before I dig into creating another one of these
> co-routine-style control inversions (which have proven to be problematic
> for performance.
>
>
>
>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Posted by Charles Givre <cg...@gmail.com>.

One more thought... As a suggestion, I'd recommend getting the batch reader to first work with the DFDL schema file.  Once that's done, Paul and I can assist with caching, metastores etc. 
-- C



> On Oct 12, 2023, at 5:13 PM, Charles Givre <cg...@gmail.com> wrote:
> 
> HI Mike, 
> I hope all is well.  I'll take a stab at answering your questions.  But I have a few questions as well:
> 
> 1.  Are you writing a storage or format plugin for DFDL?  My thinking was that this would be a format plugin, but let me know if you were thinking differently
> 2.  In traditional deployments, where do people store the DFDL schemata files?  Are they local or accessible via URL?
> 
> To get the DFDL schema file or URL we have a few options, all of which revolve around setting a config variable.  For now, let's just say that the schema file is contained in the same folder as the data.  (We can make this more sophisticated later...)
> 
> Here's what you have to do.
> 
> 1.  In the formatConfig file, define a String called 'dfdlSchema'.   Note... config variables must be private and final.  If they aren't it can cause weird errors that are really difficult to debug.  For some reference, take a look at the Excel plugin.  (https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java)
> 
> Setting a config variable there will allow a user to set a global schema definition.  This can also be configured individually for various workspaces.  So let's say you had PCAP files in one workspace, you could globally set the DFDL file for that and then another workspace which has some other file, you could create another DFDL plugin instance for that. 
> 
> Now, this is all fine and good, but a user might also want to define the schema file at query time.  The good news is that Drill allows you to do that via the table() function. 
> 
> So let's say that we want to use a different schema file than the default, we could do something like this:
> 
> SELECT ....
> FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', dfdlSchema=>'path_to_schema')
> 
> Take a look at the Excel docs (https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) which demonstrate how to write queries like that.  I believe that the parameters in the table function take higher precedence than the parameters from the config.  That would make sense at least.
> 
> 
> 2.  Now that we have the schema file, the next thing would be to convert that into a Drill schema.  Let's say that we have a function called dfdlToDrill that handles the conversion.
> 
> What you'd have to do is in the constructor for the BatchReader, you'd have to set the schema there.  So pseudo code:
> 
> public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, FileSchemaNegotiator negotiator) {
>    	// Other stuff...
>  	
> 	// Get Drill schema from DFDL
> 	TupleMetadata schema = dfldToDrill(<dfdl schema file);
> 	
> 	// Here's the important part
>   	negotiator.tableSchema(schema, true);
> }
> 
> The negotiator.tableSchema() accepts two args, a TupleMetadata and a boolean as to whether the schema is final or not.  Once this schema has been added to the negotiator object, you can then create the writers. 
> 
> 
> Take a look here... 
> 
> https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
> 
> 
> I see Paul just responded so I'll leave you with this.  If you have additional questions, send them our way.  Do take a look at the Excel plugin as I think it will be helpful.
> 
> Best,
> --C
> 
> 
>> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mb...@apache.org> wrote:
>> 
>> So when a data format is described by a DFDL schema, I can generate
>> equivalent Drill schema (TupleMetadata). This schema is always complete. I
>> have unit tests working with this.
>> 
>> To do this for a real SQL query, I need the DFDL schema to be identified on
>> the SQL query by a file path or URI.
>> 
>> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
>> 
>> Next, assuming I have the DFDL schema identified, I generate an equivalent
>> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
>> 
>> What objects do I call, or what classes do I have to create to make this
>> Drill TupleMetadata available to Drill so it uses it in all the ways a
>> static Drill schema can be useful?
>> 
>> I just need pointers to the code that illustrate how to do this. Thanks
>> 
>> -Mike Beckerle
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <pa...@gmail.com> wrote:
>> 
>>> Mike,
>>> 
>>> This is a complex question and has two answers.
>>> 
>>> First, the standard enhanced vector framework (EVF) used by most readers
>>> assumes a "pull" model: read each record. This is where the next() comes
>>> in: readers just implement this to read the next record. But, the code
>>> under EVF works with a push model: the readers write to vectors, and signal
>>> the next record. EVF translates the lower-level push model to the
>>> higher-level, easier-to-use pull model. The best example of this is the
>>> JSON reader which uses Jackson to parse JSON and responds to the
>>> corresponding events.
>>> 
>>> You can thus take over the task of filling a batch of records. I'd have to
>>> poke around the code to refresh my memory. Or, you can take a look at the
>>> (quite complex) JSON parser, or the EVF itself to see what it does. There
>>> are many unit tests that show this at various levels of abstraction.
>>> 
>>> Basically, you have to:
>>> 
>>> * Start a batch
>>> * Ask if you can start the next record (which might be declined if the
>>> batch is full)
>>> * Write each field. For complex fields, such as records, recursively do the
>>> start/end record work.
>>> * Mark the record as complete.
>>> 
>>> You should be able to map event handlers to EVF actions as a result. Even
>>> though DFDL wants to "drive", it still has to give up control once the
>>> batch is full. EVF will then handle the (surprisingly complex) task of
>>> finishing up the batch and returning it as the output of the Scan operator.
>>> 
>>> - Paul
>>> 
>>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mb...@apache.org>
>>> wrote:
>>> 
>>>> Daffodil parsing generates event callbacks to an InfosetOutputter, which
>>> is
>>>> analogous to a SAX event handler.
>>>> 
>>>> Drill is expecting an iterator style of calling next() to advance through
>>>> the input, i.e., Drill has the control thread and expects to do pull
>>>> parsing. At least from the code I studied in the format-xml contrib.
>>>> 
>>>> Is there any alternative? Before I dig into creating another one of these
>>>> co-routine-style control inversions (which have proven to be problematic
>>>> for performance.
>>>> 
>>> 
>

Re: Drill TupleMetadata created from DFDL Schema - how do I inform Drill about it

Posted by Charles Givre <cg...@gmail.com>.

HI Mike, 
I hope all is well.  I'll take a stab at answering your questions.  But I have a few questions as well:

1.  Are you writing a storage or format plugin for DFDL?  My thinking was that this would be a format plugin, but let me know if you were thinking differently
2.  In traditional deployments, where do people store the DFDL schemata files?  Are they local or accessible via URL?

To get the DFDL schema file or URL we have a few options, all of which revolve around setting a config variable.  For now, let's just say that the schema file is contained in the same folder as the data.  (We can make this more sophisticated later...)

Here's what you have to do.

1.  In the formatConfig file, define a String called 'dfdlSchema'.   Note... config variables must be private and final.  If they aren't it can cause weird errors that are really difficult to debug.  For some reference, take a look at the Excel plugin.  (https://github.com/apache/drill/blob/master/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelFormatConfig.java)

Setting a config variable there will allow a user to set a global schema definition.  This can also be configured individually for various workspaces.  So let's say you had PCAP files in one workspace, you could globally set the DFDL file for that and then another workspace which has some other file, you could create another DFDL plugin instance for that. 

Now, this is all fine and good, but a user might also want to define the schema file at query time.  The good news is that Drill allows you to do that via the table() function. 

So let's say that we want to use a different schema file than the default, we could do something like this:

SELECT ....
FROM table(dfs.dfdl_workspace.`myfile` (type=>'dfdl', dfdlSchema=>'path_to_schema')

Take a look at the Excel docs (https://github.com/apache/drill/blob/master/contrib/format-excel/README.md) which demonstrate how to write queries like that.  I believe that the parameters in the table function take higher precedence than the parameters from the config.  That would make sense at least.

2.  Now that we have the schema file, the next thing would be to convert that into a Drill schema.  Let's say that we have a function called dfdlToDrill that handles the conversion.

What you'd have to do is in the constructor for the BatchReader, you'd have to set the schema there.  So pseudo code:

public DFDLBatchReader(DFDLReaderConfig, EasySubScan scan, FileSchemaNegotiator negotiator) {
   	// Other stuff...

	// Get Drill schema from DFDL
	TupleMetadata schema = dfldToDrill(<dfdl schema file);

	// Here's the important part
  	negotiator.tableSchema(schema, true);
}

The negotiator.tableSchema() accepts two args, a TupleMetadata and a boolean as to whether the schema is final or not.  Once this schema has been added to the negotiator object, you can then create the writers. 

Take a look here... 

https://github.com/apache/drill/blob/2ab46a9411a52f12a0f9acb1144a318059439bc4/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java#L199
drill/contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java at 2ab46a9411a52f12a0f9acb1144a318059439bc4 · apache/drill
github.com

I see Paul just responded so I'll leave you with this.  If you have additional questions, send them our way.  Do take a look at the Excel plugin as I think it will be helpful.

Best,
--C

> On Oct 12, 2023, at 2:58 PM, Mike Beckerle <mb...@apache.org> wrote:
> 
> So when a data format is described by a DFDL schema, I can generate
> equivalent Drill schema (TupleMetadata). This schema is always complete. I
> have unit tests working with this.
> 
> To do this for a real SQL query, I need the DFDL schema to be identified on
> the SQL query by a file path or URI.
> 
> Q: How do I get that DFDL schema File/URI parameter from the SQL query?
> 
> Next, assuming I have the DFDL schema identified, I generate an equivalent
> Drill TupleMetadata from it. (Or, hopefully retrieve it from a cache)
> 
> What objects do I call, or what classes do I have to create to make this
> Drill TupleMetadata available to Drill so it uses it in all the ways a
> static Drill schema can be useful?
> 
> I just need pointers to the code that illustrate how to do this. Thanks
> 
> -Mike Beckerle
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Oct 12, 2023 at 12:13 AM Paul Rogers <pa...@gmail.com> wrote:
> 
>> Mike,
>> 
>> This is a complex question and has two answers.
>> 
>> First, the standard enhanced vector framework (EVF) used by most readers
>> assumes a "pull" model: read each record. This is where the next() comes
>> in: readers just implement this to read the next record. But, the code
>> under EVF works with a push model: the readers write to vectors, and signal
>> the next record. EVF translates the lower-level push model to the
>> higher-level, easier-to-use pull model. The best example of this is the
>> JSON reader which uses Jackson to parse JSON and responds to the
>> corresponding events.
>> 
>> You can thus take over the task of filling a batch of records. I'd have to
>> poke around the code to refresh my memory. Or, you can take a look at the
>> (quite complex) JSON parser, or the EVF itself to see what it does. There
>> are many unit tests that show this at various levels of abstraction.
>> 
>> Basically, you have to:
>> 
>> * Start a batch
>> * Ask if you can start the next record (which might be declined if the
>> batch is full)
>> * Write each field. For complex fields, such as records, recursively do the
>> start/end record work.
>> * Mark the record as complete.
>> 
>> You should be able to map event handlers to EVF actions as a result. Even
>> though DFDL wants to "drive", it still has to give up control once the
>> batch is full. EVF will then handle the (surprisingly complex) task of
>> finishing up the batch and returning it as the output of the Scan operator.
>> 
>> - Paul
>> 
>> On Wed, Oct 11, 2023 at 6:30 PM Mike Beckerle <mb...@apache.org>
>> wrote:
>> 
>>> Daffodil parsing generates event callbacks to an InfosetOutputter, which
>> is
>>> analogous to a SAX event handler.
>>> 
>>> Drill is expecting an iterator style of calling next() to advance through
>>> the input, i.e., Drill has the control thread and expects to do pull
>>> parsing. At least from the code I studied in the format-xml contrib.
>>> 
>>> Is there any alternative? Before I dig into creating another one of these
>>> co-routine-style control inversions (which have proven to be problematic
>>> for performance.
>>> 
>>