You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Furkan KAMACI <fu...@gmail.com> on 2019/01/10 08:43:28 UTC

Druid Auto Field Type Detection

Hi All,

I can define auto type detection for timestamp as follows:

"timestampSpec" : {
     "format" : "auto",
     "column" : "ts"
}

In similar manner, I cannot detect field type via parseSpec. I mean:

{"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}

{"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}

Both properties-key1 and properties-key2 are indexed as String. I expect to
index properties-key2 as Integer at Druid.

So, is there any mechanism at Druid about letting Druid auto filed type
detection for a newly created field? If not, I would like to implement such
a feature.

Kind Regards,
Furkan KAMACI

Re: Druid Auto Field Type Detection

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Gian,

I've created an issue for it:
https://github.com/apache/incubator-druid/issues/7027

Could you add a comment where I can start to implement such a feature.

Kind Regards,
Furkan KAMACI

On Tue, Feb 12, 2019 at 1:43 AM Gian Merlino <gi...@gmail.com> wrote:

> Yeah that's a good point. Maybe we should store some extra information
> about what the type was in the original input.
>
> On Sat, Jan 26, 2019 at 4:04 AM Furkan KAMACI <fu...@gmail.com>
> wrote:
>
> > Hi Gian,
> >
> > Same problem applies to null fields too. When first record is null, it
> will
> > not possible to detect such a field's type.
> >
> > However, problem is different at my case. You may have an ad-hoc field
> > which is not defined at beginning. Such a field should have strict type
> but
> > not known at the beginning. At your example case, we may define such
> field
> > as Integer and throw error or skip an entry which has a value if "foo"
> due
> > to field is initialized as Integer. On the other hand, sending a datum
> as:
> >
> > field: 3
> >
> > and
> >
> > field: "3"
> >
> > maybe threatened different. Second one could be String but first one
> should
> > be Integer.
> >
> > I think that Solr could be an example for us such a schemaless mode.
> > What do you think?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Fri, Jan 25, 2019 at 8:56 PM Gian Merlino <gi...@apache.org> wrote:
> >
> > > Hey Furkan,
> > >
> > > Right now when Druid detects dimensions (so called "schemaless" mode,
> > what
> > > you get when you have an empty dimensions list at ingestion time), it
> > > assumes they are all strings. It would definitely be better if it did
> > some
> > > analysis on the incoming data and chose the most appropriate type. I
> > think
> > > the main consideration here is that Druid has to pick a type as soon as
> > it
> > > sees a new column, but it might not get it right just by looking at the
> > > first record. Imagine some JSON data where you have a field that is the
> > > number 3 for the first row Druid sees, but the string "foo" in the
> > second.
> > > The right type would be string, but Druid wouldn't know that when it
> gets
> > > the first row.
> > >
> > > Maybe it would work to do some mechanism where auto-detected fields are
> > > ingested as strings initially into IncrementalIndex, and then
> potentially
> > > converted to a different type when written to disk.
> > >
> > > On Thu, Jan 10, 2019 at 12:43 AM Furkan KAMACI <furkankamaci@gmail.com
> >
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I can define auto type detection for timestamp as follows:
> > > >
> > > > "timestampSpec" : {
> > > >      "format" : "auto",
> > > >      "column" : "ts"
> > > > }
> > > >
> > > > In similar manner, I cannot detect field type via parseSpec. I mean:
> > > >
> > > >
> > > >
> > >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}
> > > >
> > > >
> > > >
> > >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}
> > > >
> > > > Both properties-key1 and properties-key2 are indexed as String. I
> > expect
> > > to
> > > > index properties-key2 as Integer at Druid.
> > > >
> > > > So, is there any mechanism at Druid about letting Druid auto filed
> type
> > > > detection for a newly created field? If not, I would like to
> implement
> > > such
> > > > a feature.
> > > >
> > > > Kind Regards,
> > > > Furkan KAMACI
> > > >
> > >
> >
>

Re: Druid Auto Field Type Detection

Posted by Gian Merlino <gi...@gmail.com>.
Yeah that's a good point. Maybe we should store some extra information
about what the type was in the original input.

On Sat, Jan 26, 2019 at 4:04 AM Furkan KAMACI <fu...@gmail.com>
wrote:

> Hi Gian,
>
> Same problem applies to null fields too. When first record is null, it will
> not possible to detect such a field's type.
>
> However, problem is different at my case. You may have an ad-hoc field
> which is not defined at beginning. Such a field should have strict type but
> not known at the beginning. At your example case, we may define such field
> as Integer and throw error or skip an entry which has a value if "foo" due
> to field is initialized as Integer. On the other hand, sending a datum as:
>
> field: 3
>
> and
>
> field: "3"
>
> maybe threatened different. Second one could be String but first one should
> be Integer.
>
> I think that Solr could be an example for us such a schemaless mode.
> What do you think?
>
> Kind Regards,
> Furkan KAMACI
>
> On Fri, Jan 25, 2019 at 8:56 PM Gian Merlino <gi...@apache.org> wrote:
>
> > Hey Furkan,
> >
> > Right now when Druid detects dimensions (so called "schemaless" mode,
> what
> > you get when you have an empty dimensions list at ingestion time), it
> > assumes they are all strings. It would definitely be better if it did
> some
> > analysis on the incoming data and chose the most appropriate type. I
> think
> > the main consideration here is that Druid has to pick a type as soon as
> it
> > sees a new column, but it might not get it right just by looking at the
> > first record. Imagine some JSON data where you have a field that is the
> > number 3 for the first row Druid sees, but the string "foo" in the
> second.
> > The right type would be string, but Druid wouldn't know that when it gets
> > the first row.
> >
> > Maybe it would work to do some mechanism where auto-detected fields are
> > ingested as strings initially into IncrementalIndex, and then potentially
> > converted to a different type when written to disk.
> >
> > On Thu, Jan 10, 2019 at 12:43 AM Furkan KAMACI <fu...@gmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > I can define auto type detection for timestamp as follows:
> > >
> > > "timestampSpec" : {
> > >      "format" : "auto",
> > >      "column" : "ts"
> > > }
> > >
> > > In similar manner, I cannot detect field type via parseSpec. I mean:
> > >
> > >
> > >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}
> > >
> > >
> > >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}
> > >
> > > Both properties-key1 and properties-key2 are indexed as String. I
> expect
> > to
> > > index properties-key2 as Integer at Druid.
> > >
> > > So, is there any mechanism at Druid about letting Druid auto filed type
> > > detection for a newly created field? If not, I would like to implement
> > such
> > > a feature.
> > >
> > > Kind Regards,
> > > Furkan KAMACI
> > >
> >
>

Re: Druid Auto Field Type Detection

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Gian,

Same problem applies to null fields too. When first record is null, it will
not possible to detect such a field's type.

However, problem is different at my case. You may have an ad-hoc field
which is not defined at beginning. Such a field should have strict type but
not known at the beginning. At your example case, we may define such field
as Integer and throw error or skip an entry which has a value if "foo" due
to field is initialized as Integer. On the other hand, sending a datum as:

field: 3

and

field: "3"

maybe threatened different. Second one could be String but first one should
be Integer.

I think that Solr could be an example for us such a schemaless mode.
What do you think?

Kind Regards,
Furkan KAMACI

On Fri, Jan 25, 2019 at 8:56 PM Gian Merlino <gi...@apache.org> wrote:

> Hey Furkan,
>
> Right now when Druid detects dimensions (so called "schemaless" mode, what
> you get when you have an empty dimensions list at ingestion time), it
> assumes they are all strings. It would definitely be better if it did some
> analysis on the incoming data and chose the most appropriate type. I think
> the main consideration here is that Druid has to pick a type as soon as it
> sees a new column, but it might not get it right just by looking at the
> first record. Imagine some JSON data where you have a field that is the
> number 3 for the first row Druid sees, but the string "foo" in the second.
> The right type would be string, but Druid wouldn't know that when it gets
> the first row.
>
> Maybe it would work to do some mechanism where auto-detected fields are
> ingested as strings initially into IncrementalIndex, and then potentially
> converted to a different type when written to disk.
>
> On Thu, Jan 10, 2019 at 12:43 AM Furkan KAMACI <fu...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > I can define auto type detection for timestamp as follows:
> >
> > "timestampSpec" : {
> >      "format" : "auto",
> >      "column" : "ts"
> > }
> >
> > In similar manner, I cannot detect field type via parseSpec. I mean:
> >
> >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}
> >
> >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}
> >
> > Both properties-key1 and properties-key2 are indexed as String. I expect
> to
> > index properties-key2 as Integer at Druid.
> >
> > So, is there any mechanism at Druid about letting Druid auto filed type
> > detection for a newly created field? If not, I would like to implement
> such
> > a feature.
> >
> > Kind Regards,
> > Furkan KAMACI
> >
>

Re: Druid Auto Field Type Detection

Posted by Gian Merlino <gi...@apache.org>.
Hey Furkan,

Right now when Druid detects dimensions (so called "schemaless" mode, what
you get when you have an empty dimensions list at ingestion time), it
assumes they are all strings. It would definitely be better if it did some
analysis on the incoming data and chose the most appropriate type. I think
the main consideration here is that Druid has to pick a type as soon as it
sees a new column, but it might not get it right just by looking at the
first record. Imagine some JSON data where you have a field that is the
number 3 for the first row Druid sees, but the string "foo" in the second.
The right type would be string, but Druid wouldn't know that when it gets
the first row.

Maybe it would work to do some mechanism where auto-detected fields are
ingested as strings initially into IncrementalIndex, and then potentially
converted to a different type when written to disk.

On Thu, Jan 10, 2019 at 12:43 AM Furkan KAMACI <fu...@gmail.com>
wrote:

> Hi All,
>
> I can define auto type detection for timestamp as follows:
>
> "timestampSpec" : {
>      "format" : "auto",
>      "column" : "ts"
> }
>
> In similar manner, I cannot detect field type via parseSpec. I mean:
>
>
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}
>
>
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}
>
> Both properties-key1 and properties-key2 are indexed as String. I expect to
> index properties-key2 as Integer at Druid.
>
> So, is there any mechanism at Druid about letting Druid auto filed type
> detection for a newly created field? If not, I would like to implement such
> a feature.
>
> Kind Regards,
> Furkan KAMACI
>