You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by Edward Sargisson <es...@pobox.com> on 2013/04/18 20:57:30 UTC

ElasticSearchSink: should we combined the proposed interfaces for event serialization and id assignment

Hi all,
(I read this list in digest mode; would you mind ccing me on any replies?)

I've got two patches progressing through Jira (FLUME-1782, FLUME-1972).
-1782 fixes a defect where the wrong timestamp field and elasticsearch
index name are used. -1972 adds an interface which users can implement to
assign an id instead of letting elasticsearch randomly assign on.

The question to discuss: should we(I) combine those interfaces and just
have a single interface.

Mike Percy has kindly reviewed FLUME-1782 and the knock-on effect of his
comments require that the ElasticSearchEventSerializer interface be changed
- and thus this becomes a breaking change. I had been attempting to avoid
that and this is why -1972 has a new interface.

If we're going to break the interface then maybe we should go all the way
and put the new id provider functionality on to it as well? We could also
rename it to ElasticsearchEventSerializer (lower case s on search) to match
the way the maintainer of elasticsearch spells it.

The new interface would be:
public interface ElasticSearchEventSerializer extends Configurable,
    ConfigurableComponent {

  static final Charset charset = Charset.defaultCharset();
  XContentBuilder getContentBuilder(Event event, Long timestampOverride)
throws IOException;

  String getId(Event event)
}

The timestampOverride would only be non-null if there was no timestamp
header.

Thoughts?

Cheers,
Edward

Re: ElasticSearchSink: should we combined the proposed interfaces for event serialization and id assignment

Posted by Israel Ekpo <is...@gmail.com>.
Edward,

I would recommend directly subscribing to the developer and user mailing
lists

dev-subscribe@flume.apache.org
user-subscribe@flume.apache.org

A lot of times, folks just click reply and it goes directly to
dev@flume.apache.org since that is what is set in the email headers.

All you have to do is send an email to the email addresses above and then
reply to the response and you are good.


On Thu, Apr 18, 2013 at 2:57 PM, Edward Sargisson <es...@pobox.com> wrote:

> Hi all,
> (I read this list in digest mode; would you mind ccing me on any replies?)
>
> I've got two patches progressing through Jira (FLUME-1782, FLUME-1972).
> -1782 fixes a defect where the wrong timestamp field and elasticsearch
> index name are used. -1972 adds an interface which users can implement to
> assign an id instead of letting elasticsearch randomly assign on.
>
> The question to discuss: should we(I) combine those interfaces and just
> have a single interface.
>
> Mike Percy has kindly reviewed FLUME-1782 and the knock-on effect of his
> comments require that the ElasticSearchEventSerializer interface be changed
> - and thus this becomes a breaking change. I had been attempting to avoid
> that and this is why -1972 has a new interface.
>
> If we're going to break the interface then maybe we should go all the way
> and put the new id provider functionality on to it as well? We could also
> rename it to ElasticsearchEventSerializer (lower case s on search) to match
> the way the maintainer of elasticsearch spells it.
>
> The new interface would be:
> public interface ElasticSearchEventSerializer extends Configurable,
>     ConfigurableComponent {
>
>   static final Charset charset = Charset.defaultCharset();
>   XContentBuilder getContentBuilder(Event event, Long timestampOverride)
> throws IOException;
>
>   String getId(Event event)
> }
>
> The timestampOverride would only be non-null if there was no timestamp
> header.
>
> Thoughts?
>
> Cheers,
> Edward
>



-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Re: ElasticSearchSink: should we combined the proposed interfaces for event serialization and id assignment

Posted by Edward Sargisson <es...@pobox.com>.
Hi Mike,
Sure - then I think I'll combine the event serialization and id assignment
interfaces for the new interface.
I will probably also fix the elasticsearch optional dependency issue
(FLUME-1992) while I'm there.

Cheers,
Edward


On Thu, Apr 18, 2013 at 5:19 PM, Mike Percy <mp...@apache.org> wrote:

> Thanks for the additional info, Edward. Let's avoid breaking API backwards
> compatibility. I've added a comment on the review board, let's see if we
> can just use instanceof to support an additional serializer API.
>
> Mike
>
>
>
> On Thu, Apr 18, 2013 at 4:18 PM, Edward Sargisson <es...@pobox.com>wrote:
>
>> Hi Mike,
>> To get more detailed, the problem is that writing the event to
>> elasticsearch so that Kibana can read it requires that it be written to
>> the
>> correct index and have the correct @timestamp header. If the timestamps
>> used for both of those things is not the same then there is a risk that
>> Kibana won't be able to find the event. This is because Kibana searches
>> for
>> events by taking the time interval you specify in the interface,
>> determining which indexes include that time interval and then querying
>> those indexes by the @timestamp field.
>>
>> So, if the sink decides that it needs to provide the timestamp, and it
>> can't put the timestamp into the event itself then the
>> ElasticSearchEventSerializer implementation has no way to get that
>> timestamp. If we relied on the serializer doing System.currentTimeMillis()
>> itself then we run the risk of the event being written at midnight. It
>> might decide to put the event in yesterday's index with today's date.
>>
>> Could we keep the old interface and only use the new one if it's there?
>> The problem with that is that the sink may decide to provide a new
>> timestamp - but the interface implementation has no way of being told what
>> it is.
>>
>> I've just written the code to do it this way and will be adding it to the
>> ticket sometime soon.
>>
>> Cheers,
>> Edward
>>
>>
>> On Thu, Apr 18, 2013 at 4:04 PM, Mike Percy <mp...@apache.org> wrote:
>>
>> > Edward,
>> > I added you to CC but I would also recommend subscribing to the dev list
>> > due to how the list headers are configured.
>> >
>> > This is a rough situation. I am loathe to break API compatibility but at
>> > the same time I don't know much about ES and feel I need to find some
>> time
>> > to invest in understanding the ES concepts and how Flume is interacting
>> > with them in this sink.
>> >
>> > Without thinking about it very hard yet, I'd ask if we can just add a
>> new
>> > interface that doesn't suck and maybe is more extensible without
>> breaking
>> > the old one.
>> >
>> > Also wondering if other devs have thoughts on this.
>> >
>> > Mike
>> >
>> >
>> > On Thu, Apr 18, 2013 at 11:57 AM, Edward Sargisson <esarge@pobox.com
>> >wrote:
>> >
>> >> Hi all,
>> >> (I read this list in digest mode; would you mind ccing me on any
>> replies?)
>> >>
>> >> I've got two patches progressing through Jira (FLUME-1782, FLUME-1972).
>> >> -1782 fixes a defect where the wrong timestamp field and elasticsearch
>> >> index name are used. -1972 adds an interface which users can implement
>> to
>> >> assign an id instead of letting elasticsearch randomly assign on.
>> >>
>> >> The question to discuss: should we(I) combine those interfaces and just
>> >> have a single interface.
>> >>
>> >> Mike Percy has kindly reviewed FLUME-1782 and the knock-on effect of
>> his
>> >> comments require that the ElasticSearchEventSerializer interface be
>> >> changed
>> >> - and thus this becomes a breaking change. I had been attempting to
>> avoid
>> >> that and this is why -1972 has a new interface.
>> >>
>> >> If we're going to break the interface then maybe we should go all the
>> way
>> >> and put the new id provider functionality on to it as well? We could
>> also
>> >> rename it to ElasticsearchEventSerializer (lower case s on search) to
>> >> match
>> >> the way the maintainer of elasticsearch spells it.
>> >>
>> >> The new interface would be:
>> >> public interface ElasticSearchEventSerializer extends Configurable,
>> >>     ConfigurableComponent {
>> >>
>> >>   static final Charset charset = Charset.defaultCharset();
>> >>   XContentBuilder getContentBuilder(Event event, Long
>> timestampOverride)
>> >> throws IOException;
>> >>
>> >>   String getId(Event event)
>> >> }
>> >>
>> >> The timestampOverride would only be non-null if there was no timestamp
>> >> header.
>> >>
>> >> Thoughts?
>> >>
>> >> Cheers,
>> >> Edward
>> >>
>> >
>> >
>>
>
>

Re: ElasticSearchSink: should we combined the proposed interfaces for event serialization and id assignment

Posted by Mike Percy <mp...@apache.org>.
Thanks for the additional info, Edward. Let's avoid breaking API backwards
compatibility. I've added a comment on the review board, let's see if we
can just use instanceof to support an additional serializer API.

Mike



On Thu, Apr 18, 2013 at 4:18 PM, Edward Sargisson <es...@pobox.com> wrote:

> Hi Mike,
> To get more detailed, the problem is that writing the event to
> elasticsearch so that Kibana can read it requires that it be written to the
> correct index and have the correct @timestamp header. If the timestamps
> used for both of those things is not the same then there is a risk that
> Kibana won't be able to find the event. This is because Kibana searches for
> events by taking the time interval you specify in the interface,
> determining which indexes include that time interval and then querying
> those indexes by the @timestamp field.
>
> So, if the sink decides that it needs to provide the timestamp, and it
> can't put the timestamp into the event itself then the
> ElasticSearchEventSerializer implementation has no way to get that
> timestamp. If we relied on the serializer doing System.currentTimeMillis()
> itself then we run the risk of the event being written at midnight. It
> might decide to put the event in yesterday's index with today's date.
>
> Could we keep the old interface and only use the new one if it's there?
> The problem with that is that the sink may decide to provide a new
> timestamp - but the interface implementation has no way of being told what
> it is.
>
> I've just written the code to do it this way and will be adding it to the
> ticket sometime soon.
>
> Cheers,
> Edward
>
>
> On Thu, Apr 18, 2013 at 4:04 PM, Mike Percy <mp...@apache.org> wrote:
>
> > Edward,
> > I added you to CC but I would also recommend subscribing to the dev list
> > due to how the list headers are configured.
> >
> > This is a rough situation. I am loathe to break API compatibility but at
> > the same time I don't know much about ES and feel I need to find some
> time
> > to invest in understanding the ES concepts and how Flume is interacting
> > with them in this sink.
> >
> > Without thinking about it very hard yet, I'd ask if we can just add a new
> > interface that doesn't suck and maybe is more extensible without breaking
> > the old one.
> >
> > Also wondering if other devs have thoughts on this.
> >
> > Mike
> >
> >
> > On Thu, Apr 18, 2013 at 11:57 AM, Edward Sargisson <esarge@pobox.com
> >wrote:
> >
> >> Hi all,
> >> (I read this list in digest mode; would you mind ccing me on any
> replies?)
> >>
> >> I've got two patches progressing through Jira (FLUME-1782, FLUME-1972).
> >> -1782 fixes a defect where the wrong timestamp field and elasticsearch
> >> index name are used. -1972 adds an interface which users can implement
> to
> >> assign an id instead of letting elasticsearch randomly assign on.
> >>
> >> The question to discuss: should we(I) combine those interfaces and just
> >> have a single interface.
> >>
> >> Mike Percy has kindly reviewed FLUME-1782 and the knock-on effect of his
> >> comments require that the ElasticSearchEventSerializer interface be
> >> changed
> >> - and thus this becomes a breaking change. I had been attempting to
> avoid
> >> that and this is why -1972 has a new interface.
> >>
> >> If we're going to break the interface then maybe we should go all the
> way
> >> and put the new id provider functionality on to it as well? We could
> also
> >> rename it to ElasticsearchEventSerializer (lower case s on search) to
> >> match
> >> the way the maintainer of elasticsearch spells it.
> >>
> >> The new interface would be:
> >> public interface ElasticSearchEventSerializer extends Configurable,
> >>     ConfigurableComponent {
> >>
> >>   static final Charset charset = Charset.defaultCharset();
> >>   XContentBuilder getContentBuilder(Event event, Long timestampOverride)
> >> throws IOException;
> >>
> >>   String getId(Event event)
> >> }
> >>
> >> The timestampOverride would only be non-null if there was no timestamp
> >> header.
> >>
> >> Thoughts?
> >>
> >> Cheers,
> >> Edward
> >>
> >
> >
>

Re: ElasticSearchSink: should we combined the proposed interfaces for event serialization and id assignment

Posted by Edward Sargisson <es...@pobox.com>.
Hi Mike,
To get more detailed, the problem is that writing the event to
elasticsearch so that Kibana can read it requires that it be written to the
correct index and have the correct @timestamp header. If the timestamps
used for both of those things is not the same then there is a risk that
Kibana won't be able to find the event. This is because Kibana searches for
events by taking the time interval you specify in the interface,
determining which indexes include that time interval and then querying
those indexes by the @timestamp field.

So, if the sink decides that it needs to provide the timestamp, and it
can't put the timestamp into the event itself then the
ElasticSearchEventSerializer implementation has no way to get that
timestamp. If we relied on the serializer doing System.currentTimeMillis()
itself then we run the risk of the event being written at midnight. It
might decide to put the event in yesterday's index with today's date.

Could we keep the old interface and only use the new one if it's there?
The problem with that is that the sink may decide to provide a new
timestamp - but the interface implementation has no way of being told what
it is.

I've just written the code to do it this way and will be adding it to the
ticket sometime soon.

Cheers,
Edward


On Thu, Apr 18, 2013 at 4:04 PM, Mike Percy <mp...@apache.org> wrote:

> Edward,
> I added you to CC but I would also recommend subscribing to the dev list
> due to how the list headers are configured.
>
> This is a rough situation. I am loathe to break API compatibility but at
> the same time I don't know much about ES and feel I need to find some time
> to invest in understanding the ES concepts and how Flume is interacting
> with them in this sink.
>
> Without thinking about it very hard yet, I'd ask if we can just add a new
> interface that doesn't suck and maybe is more extensible without breaking
> the old one.
>
> Also wondering if other devs have thoughts on this.
>
> Mike
>
>
> On Thu, Apr 18, 2013 at 11:57 AM, Edward Sargisson <es...@pobox.com>wrote:
>
>> Hi all,
>> (I read this list in digest mode; would you mind ccing me on any replies?)
>>
>> I've got two patches progressing through Jira (FLUME-1782, FLUME-1972).
>> -1782 fixes a defect where the wrong timestamp field and elasticsearch
>> index name are used. -1972 adds an interface which users can implement to
>> assign an id instead of letting elasticsearch randomly assign on.
>>
>> The question to discuss: should we(I) combine those interfaces and just
>> have a single interface.
>>
>> Mike Percy has kindly reviewed FLUME-1782 and the knock-on effect of his
>> comments require that the ElasticSearchEventSerializer interface be
>> changed
>> - and thus this becomes a breaking change. I had been attempting to avoid
>> that and this is why -1972 has a new interface.
>>
>> If we're going to break the interface then maybe we should go all the way
>> and put the new id provider functionality on to it as well? We could also
>> rename it to ElasticsearchEventSerializer (lower case s on search) to
>> match
>> the way the maintainer of elasticsearch spells it.
>>
>> The new interface would be:
>> public interface ElasticSearchEventSerializer extends Configurable,
>>     ConfigurableComponent {
>>
>>   static final Charset charset = Charset.defaultCharset();
>>   XContentBuilder getContentBuilder(Event event, Long timestampOverride)
>> throws IOException;
>>
>>   String getId(Event event)
>> }
>>
>> The timestampOverride would only be non-null if there was no timestamp
>> header.
>>
>> Thoughts?
>>
>> Cheers,
>> Edward
>>
>
>

Re: ElasticSearchSink: should we combined the proposed interfaces for event serialization and id assignment

Posted by Mike Percy <mp...@apache.org>.
Edward,
I added you to CC but I would also recommend subscribing to the dev list
due to how the list headers are configured.

This is a rough situation. I am loathe to break API compatibility but at
the same time I don't know much about ES and feel I need to find some time
to invest in understanding the ES concepts and how Flume is interacting
with them in this sink.

Without thinking about it very hard yet, I'd ask if we can just add a new
interface that doesn't suck and maybe is more extensible without breaking
the old one.

Also wondering if other devs have thoughts on this.

Mike


On Thu, Apr 18, 2013 at 11:57 AM, Edward Sargisson <es...@pobox.com> wrote:

> Hi all,
> (I read this list in digest mode; would you mind ccing me on any replies?)
>
> I've got two patches progressing through Jira (FLUME-1782, FLUME-1972).
> -1782 fixes a defect where the wrong timestamp field and elasticsearch
> index name are used. -1972 adds an interface which users can implement to
> assign an id instead of letting elasticsearch randomly assign on.
>
> The question to discuss: should we(I) combine those interfaces and just
> have a single interface.
>
> Mike Percy has kindly reviewed FLUME-1782 and the knock-on effect of his
> comments require that the ElasticSearchEventSerializer interface be changed
> - and thus this becomes a breaking change. I had been attempting to avoid
> that and this is why -1972 has a new interface.
>
> If we're going to break the interface then maybe we should go all the way
> and put the new id provider functionality on to it as well? We could also
> rename it to ElasticsearchEventSerializer (lower case s on search) to match
> the way the maintainer of elasticsearch spells it.
>
> The new interface would be:
> public interface ElasticSearchEventSerializer extends Configurable,
>     ConfigurableComponent {
>
>   static final Charset charset = Charset.defaultCharset();
>   XContentBuilder getContentBuilder(Event event, Long timestampOverride)
> throws IOException;
>
>   String getId(Event event)
> }
>
> The timestampOverride would only be non-null if there was no timestamp
> header.
>
> Thoughts?
>
> Cheers,
> Edward
>