You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Gian Merlino <gi...@gmail.com> on 2019/02/11 22:43:29 UTC

Re: Druid Auto Field Type Detection

Yeah that's a good point. Maybe we should store some extra information
about what the type was in the original input.

On Sat, Jan 26, 2019 at 4:04 AM Furkan KAMACI <fu...@gmail.com>
wrote:

> Hi Gian,
>
> Same problem applies to null fields too. When first record is null, it will
> not possible to detect such a field's type.
>
> However, problem is different at my case. You may have an ad-hoc field
> which is not defined at beginning. Such a field should have strict type but
> not known at the beginning. At your example case, we may define such field
> as Integer and throw error or skip an entry which has a value if "foo" due
> to field is initialized as Integer. On the other hand, sending a datum as:
>
> field: 3
>
> and
>
> field: "3"
>
> maybe threatened different. Second one could be String but first one should
> be Integer.
>
> I think that Solr could be an example for us such a schemaless mode.
> What do you think?
>
> Kind Regards,
> Furkan KAMACI
>
> On Fri, Jan 25, 2019 at 8:56 PM Gian Merlino <gi...@apache.org> wrote:
>
> > Hey Furkan,
> >
> > Right now when Druid detects dimensions (so called "schemaless" mode,
> what
> > you get when you have an empty dimensions list at ingestion time), it
> > assumes they are all strings. It would definitely be better if it did
> some
> > analysis on the incoming data and chose the most appropriate type. I
> think
> > the main consideration here is that Druid has to pick a type as soon as
> it
> > sees a new column, but it might not get it right just by looking at the
> > first record. Imagine some JSON data where you have a field that is the
> > number 3 for the first row Druid sees, but the string "foo" in the
> second.
> > The right type would be string, but Druid wouldn't know that when it gets
> > the first row.
> >
> > Maybe it would work to do some mechanism where auto-detected fields are
> > ingested as strings initially into IncrementalIndex, and then potentially
> > converted to a different type when written to disk.
> >
> > On Thu, Jan 10, 2019 at 12:43 AM Furkan KAMACI <fu...@gmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > I can define auto type detection for timestamp as follows:
> > >
> > > "timestampSpec" : {
> > >      "format" : "auto",
> > >      "column" : "ts"
> > > }
> > >
> > > In similar manner, I cannot detect field type via parseSpec. I mean:
> > >
> > >
> > >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}
> > >
> > >
> > >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}
> > >
> > > Both properties-key1 and properties-key2 are indexed as String. I
> expect
> > to
> > > index properties-key2 as Integer at Druid.
> > >
> > > So, is there any mechanism at Druid about letting Druid auto filed type
> > > detection for a newly created field? If not, I would like to implement
> > such
> > > a feature.
> > >
> > > Kind Regards,
> > > Furkan KAMACI
> > >
> >
>

Re: Druid Auto Field Type Detection

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Gian,

I've created an issue for it:
https://github.com/apache/incubator-druid/issues/7027

Could you add a comment where I can start to implement such a feature.

Kind Regards,
Furkan KAMACI

On Tue, Feb 12, 2019 at 1:43 AM Gian Merlino <gi...@gmail.com> wrote:

> Yeah that's a good point. Maybe we should store some extra information
> about what the type was in the original input.
>
> On Sat, Jan 26, 2019 at 4:04 AM Furkan KAMACI <fu...@gmail.com>
> wrote:
>
> > Hi Gian,
> >
> > Same problem applies to null fields too. When first record is null, it
> will
> > not possible to detect such a field's type.
> >
> > However, problem is different at my case. You may have an ad-hoc field
> > which is not defined at beginning. Such a field should have strict type
> but
> > not known at the beginning. At your example case, we may define such
> field
> > as Integer and throw error or skip an entry which has a value if "foo"
> due
> > to field is initialized as Integer. On the other hand, sending a datum
> as:
> >
> > field: 3
> >
> > and
> >
> > field: "3"
> >
> > maybe threatened different. Second one could be String but first one
> should
> > be Integer.
> >
> > I think that Solr could be an example for us such a schemaless mode.
> > What do you think?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Fri, Jan 25, 2019 at 8:56 PM Gian Merlino <gi...@apache.org> wrote:
> >
> > > Hey Furkan,
> > >
> > > Right now when Druid detects dimensions (so called "schemaless" mode,
> > what
> > > you get when you have an empty dimensions list at ingestion time), it
> > > assumes they are all strings. It would definitely be better if it did
> > some
> > > analysis on the incoming data and chose the most appropriate type. I
> > think
> > > the main consideration here is that Druid has to pick a type as soon as
> > it
> > > sees a new column, but it might not get it right just by looking at the
> > > first record. Imagine some JSON data where you have a field that is the
> > > number 3 for the first row Druid sees, but the string "foo" in the
> > second.
> > > The right type would be string, but Druid wouldn't know that when it
> gets
> > > the first row.
> > >
> > > Maybe it would work to do some mechanism where auto-detected fields are
> > > ingested as strings initially into IncrementalIndex, and then
> potentially
> > > converted to a different type when written to disk.
> > >
> > > On Thu, Jan 10, 2019 at 12:43 AM Furkan KAMACI <furkankamaci@gmail.com
> >
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I can define auto type detection for timestamp as follows:
> > > >
> > > > "timestampSpec" : {
> > > >      "format" : "auto",
> > > >      "column" : "ts"
> > > > }
> > > >
> > > > In similar manner, I cannot detect field type via parseSpec. I mean:
> > > >
> > > >
> > > >
> > >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}
> > > >
> > > >
> > > >
> > >
> >
> {"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}
> > > >
> > > > Both properties-key1 and properties-key2 are indexed as String. I
> > expect
> > > to
> > > > index properties-key2 as Integer at Druid.
> > > >
> > > > So, is there any mechanism at Druid about letting Druid auto filed
> type
> > > > detection for a newly created field? If not, I would like to
> implement
> > > such
> > > > a feature.
> > > >
> > > > Kind Regards,
> > > > Furkan KAMACI
> > > >
> > >
> >
>