You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2015/09/30 17:07:57 UTC

Protobuf Files

I am looking at trying to make use a of a large collection of Protobuf
files. We have the schema definition, and at this time I understand that
Drill does not have a reader for Protobuf files.


*Disclosure: I am not a strong developer, thus me asking the questions.

1. What is the difficulty in creating a plugin for Drill that could read
these files "natively" like is done with Parquet. From the little
information I've been able to grok, it would require specifying a file with
Schema information, but beyond getting the schema, what other challenges
are inherent to Protobufs?

2. Is there a turn key way to convert Protobufs to Avro or Parquet in a
performant way?  Without the ability to write a storage plugin, this could
work for me, I'd like to "limit" ETL, but at the same time, I'd like to "at
scale" make use of these files.

3. Any other thoughts, projects, examples, that may help me in my quest
here?

I wish I had a better grasp of the data challenges between formats and how
Drill works, but alas, I will just post out my ignorance with the goals of
solving my problem, and hopefully getting smarter in the process.

John

Re: Protobuf Files

Posted by Jacques Nadeau <ja...@dremio.com>.
It could be done at the format plugin level today. We were just about to
propose a SQL from t with options or similar syntax. You could look at the
httpd plugin example to see how the first part could work until the new
syntax is supported.
On Oct 2, 2015 9:19 PM, "Ted Dunning" <te...@gmail.com> wrote:

> Protobuf is a bit different from parquet and other formats because you
> would need some way to associate which proto format to associate with each
> file.
>
> In 30 seconds, I don't see an easy way to do that in the current SQL
> syntax.
>
> Does somebody else have a good idea for this?
>
>
> On Fri, Oct 2, 2015 at 7:17 AM, Jim Scott <js...@maprtech.com> wrote:
>
> > John,
> >
> > You may want to ask this question on the dev list as well.
> >
> > I think, logically, this could be accomplished similar to the httpd log
> > parsing plugin that has recently been worked on. That plugin works by
> > specifying the apache log format pattern. While a proto definition is
> much
> > more complicated, it is a fundamentally similar approach.
> >
> > I've not really seen much in the way of discussion around protobuf data
> > files being added to Drill, so, not sure about the general interest
> level.
> >
> > Regarding a way to batch convert: I found this project in a quick search
> > for converting protobuf to json, as that would be where I would go with
> > it... https://github.com/dpp-name/protobuf-json
> >
> > Looks like that would do the trick for you in short order.
> >
> > Jim
> >
> > On Wed, Sep 30, 2015 at 10:07 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > > I am looking at trying to make use a of a large collection of Protobuf
> > > files. We have the schema definition, and at this time I understand
> that
> > > Drill does not have a reader for Protobuf files.
> > >
> > >
> > > *Disclosure: I am not a strong developer, thus me asking the questions.
> > >
> > > 1. What is the difficulty in creating a plugin for Drill that could
> read
> > > these files "natively" like is done with Parquet. From the little
> > > information I've been able to grok, it would require specifying a file
> > with
> > > Schema information, but beyond getting the schema, what other
> challenges
> > > are inherent to Protobufs?
> > >
> > > 2. Is there a turn key way to convert Protobufs to Avro or Parquet in a
> > > performant way?  Without the ability to write a storage plugin, this
> > could
> > > work for me, I'd like to "limit" ETL, but at the same time, I'd like to
> > "at
> > > scale" make use of these files.
> > >
> > > 3. Any other thoughts, projects, examples, that may help me in my quest
> > > here?
> > >
> > > I wish I had a better grasp of the data challenges between formats and
> > how
> > > Drill works, but alas, I will just post out my ignorance with the goals
> > of
> > > solving my problem, and hopefully getting smarter in the process.
> > >
> > > John
> > >
> >
> >
> >
> > --
> > *Jim Scott*
> > Director, Enterprise Strategy & Architecture
> > +1 (347) 746-9281
> > @kingmesal <https://twitter.com/kingmesal>
> >
> > <http://www.mapr.com/>
> > [image: MapR Technologies] <http://www.mapr.com>
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>

Re: Protobuf Files

Posted by Ted Dunning <te...@gmail.com>.
This is difficult given the sheer number of protos that are likely to be required. 

Jacques "from t with ..." suggestion works better I think.  

Sent from my iPhone

> On Oct 3, 2015, at 9:39, Jim Scott <js...@maprtech.com> wrote:
> 
> Logistically the same as the custom delimiter capability. .psv is for
> pipes, .csv is for commas... .mydata.proto could use a proto definition.

Re: Protobuf Files

Posted by Jim Scott <js...@maprtech.com>.
You could create a file format with an extension, which points to a proto
definition file.

Logistically the same as the custom delimiter capability. .psv is for
pipes, .csv is for commas... .mydata.proto could use a proto definition.


On Fri, Oct 2, 2015 at 11:18 PM, Ted Dunning <te...@gmail.com> wrote:

> Protobuf is a bit different from parquet and other formats because you
> would need some way to associate which proto format to associate with each
> file.
>
> In 30 seconds, I don't see an easy way to do that in the current SQL
> syntax.
>
> Does somebody else have a good idea for this?
>
>
> On Fri, Oct 2, 2015 at 7:17 AM, Jim Scott <js...@maprtech.com> wrote:
>
> > John,
> >
> > You may want to ask this question on the dev list as well.
> >
> > I think, logically, this could be accomplished similar to the httpd log
> > parsing plugin that has recently been worked on. That plugin works by
> > specifying the apache log format pattern. While a proto definition is
> much
> > more complicated, it is a fundamentally similar approach.
> >
> > I've not really seen much in the way of discussion around protobuf data
> > files being added to Drill, so, not sure about the general interest
> level.
> >
> > Regarding a way to batch convert: I found this project in a quick search
> > for converting protobuf to json, as that would be where I would go with
> > it... https://github.com/dpp-name/protobuf-json
> >
> > Looks like that would do the trick for you in short order.
> >
> > Jim
> >
> > On Wed, Sep 30, 2015 at 10:07 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > > I am looking at trying to make use a of a large collection of Protobuf
> > > files. We have the schema definition, and at this time I understand
> that
> > > Drill does not have a reader for Protobuf files.
> > >
> > >
> > > *Disclosure: I am not a strong developer, thus me asking the questions.
> > >
> > > 1. What is the difficulty in creating a plugin for Drill that could
> read
> > > these files "natively" like is done with Parquet. From the little
> > > information I've been able to grok, it would require specifying a file
> > with
> > > Schema information, but beyond getting the schema, what other
> challenges
> > > are inherent to Protobufs?
> > >
> > > 2. Is there a turn key way to convert Protobufs to Avro or Parquet in a
> > > performant way?  Without the ability to write a storage plugin, this
> > could
> > > work for me, I'd like to "limit" ETL, but at the same time, I'd like to
> > "at
> > > scale" make use of these files.
> > >
> > > 3. Any other thoughts, projects, examples, that may help me in my quest
> > > here?
> > >
> > > I wish I had a better grasp of the data challenges between formats and
> > how
> > > Drill works, but alas, I will just post out my ignorance with the goals
> > of
> > > solving my problem, and hopefully getting smarter in the process.
> > >
> > > John
> > >
>

Re: Protobuf Files

Posted by Ted Dunning <te...@gmail.com>.
Protobuf is a bit different from parquet and other formats because you
would need some way to associate which proto format to associate with each
file.

In 30 seconds, I don't see an easy way to do that in the current SQL syntax.

Does somebody else have a good idea for this?


On Fri, Oct 2, 2015 at 7:17 AM, Jim Scott <js...@maprtech.com> wrote:

> John,
>
> You may want to ask this question on the dev list as well.
>
> I think, logically, this could be accomplished similar to the httpd log
> parsing plugin that has recently been worked on. That plugin works by
> specifying the apache log format pattern. While a proto definition is much
> more complicated, it is a fundamentally similar approach.
>
> I've not really seen much in the way of discussion around protobuf data
> files being added to Drill, so, not sure about the general interest level.
>
> Regarding a way to batch convert: I found this project in a quick search
> for converting protobuf to json, as that would be where I would go with
> it... https://github.com/dpp-name/protobuf-json
>
> Looks like that would do the trick for you in short order.
>
> Jim
>
> On Wed, Sep 30, 2015 at 10:07 AM, John Omernik <jo...@omernik.com> wrote:
>
> > I am looking at trying to make use a of a large collection of Protobuf
> > files. We have the schema definition, and at this time I understand that
> > Drill does not have a reader for Protobuf files.
> >
> >
> > *Disclosure: I am not a strong developer, thus me asking the questions.
> >
> > 1. What is the difficulty in creating a plugin for Drill that could read
> > these files "natively" like is done with Parquet. From the little
> > information I've been able to grok, it would require specifying a file
> with
> > Schema information, but beyond getting the schema, what other challenges
> > are inherent to Protobufs?
> >
> > 2. Is there a turn key way to convert Protobufs to Avro or Parquet in a
> > performant way?  Without the ability to write a storage plugin, this
> could
> > work for me, I'd like to "limit" ETL, but at the same time, I'd like to
> "at
> > scale" make use of these files.
> >
> > 3. Any other thoughts, projects, examples, that may help me in my quest
> > here?
> >
> > I wish I had a better grasp of the data challenges between formats and
> how
> > Drill works, but alas, I will just post out my ignorance with the goals
> of
> > solving my problem, and hopefully getting smarter in the process.
> >
> > John
> >
>
>
>
> --
> *Jim Scott*
> Director, Enterprise Strategy & Architecture
> +1 (347) 746-9281
> @kingmesal <https://twitter.com/kingmesal>
>
> <http://www.mapr.com/>
> [image: MapR Technologies] <http://www.mapr.com>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Protobuf Files

Posted by Jim Scott <js...@maprtech.com>.
John,

You may want to ask this question on the dev list as well.

I think, logically, this could be accomplished similar to the httpd log
parsing plugin that has recently been worked on. That plugin works by
specifying the apache log format pattern. While a proto definition is much
more complicated, it is a fundamentally similar approach.

I've not really seen much in the way of discussion around protobuf data
files being added to Drill, so, not sure about the general interest level.

Regarding a way to batch convert: I found this project in a quick search
for converting protobuf to json, as that would be where I would go with
it... https://github.com/dpp-name/protobuf-json

Looks like that would do the trick for you in short order.

Jim

On Wed, Sep 30, 2015 at 10:07 AM, John Omernik <jo...@omernik.com> wrote:

> I am looking at trying to make use a of a large collection of Protobuf
> files. We have the schema definition, and at this time I understand that
> Drill does not have a reader for Protobuf files.
>
>
> *Disclosure: I am not a strong developer, thus me asking the questions.
>
> 1. What is the difficulty in creating a plugin for Drill that could read
> these files "natively" like is done with Parquet. From the little
> information I've been able to grok, it would require specifying a file with
> Schema information, but beyond getting the schema, what other challenges
> are inherent to Protobufs?
>
> 2. Is there a turn key way to convert Protobufs to Avro or Parquet in a
> performant way?  Without the ability to write a storage plugin, this could
> work for me, I'd like to "limit" ETL, but at the same time, I'd like to "at
> scale" make use of these files.
>
> 3. Any other thoughts, projects, examples, that may help me in my quest
> here?
>
> I wish I had a better grasp of the data challenges between formats and how
> Drill works, but alas, I will just post out my ignorance with the goals of
> solving my problem, and hopefully getting smarter in the process.
>
> John
>



-- 
*Jim Scott*
Director, Enterprise Strategy & Architecture
+1 (347) 746-9281
@kingmesal <https://twitter.com/kingmesal>

<http://www.mapr.com/>
[image: MapR Technologies] <http://www.mapr.com>

Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>