You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Otis Gospodnetic <ot...@gmail.com> on 2014/08/01 01:58:03 UTC

Re: AWS S3 flume source

+1 for seeing S3Source, starting with a JIRA issue.

But being able to dynamically add/remove S3 buckets from which to pull data
seems important.

Any suggestions for how to approach that?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan <hshreedharan@cloudera.com
> wrote:

> Please go ahead and file a jira. If you are willing to submit a patch, you
> can post it on the jira.
>
> Viral Bajaria wrote:
>
>
> I have a similar use case that cropped up yesterday. I saw the archive
> and found that there was a recommendation to build it as Sharninder
> suggested.
>
> For now, I went down the route of writing a python script which
> downloads from S3 and puts the files in a directory which is
> configured to be picked up via a spooldir.
>
> I would prefer to get a direct S3 source, and maybe we could
> collaborate on it and open-source it. Let me know if you prefer that
> and we can work directly on it by creating a JIRA.
>
> Thanks,
> Viral
>
>
>
> On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
> <hshreedharan@cloudera.com <ma...@cloudera.com>> wrote:
>
>     In both cases, Sharninder is right :)
>
>     Sharninder wrote:
>
>
>
>     As far as I know, there is no (open source) implementation of an S3
>     source, so yes, you'll have to implement your own. You'll have to
>     implement a Pollable source and the dev documentation has an outline
>     that you can use. You can also look at the existing Execsource and
>     work your way up.
>
>     As far as I know, there is no way to configure flume without
>     using the
>     configuration file.
>
>
>
>     On Thu, Jul 31, 2014 at 7:57 PM, Paweł <prog88@gmail.com
>     <ma...@gmail.com>
>     <mailto:prog88@gmail.com <ma...@gmail.com>>> wrote:
>
>         Hi,
>         I'm wondering if Flume is able to read directly from S3.
>
>         I'll describe my case. I have log files stored in AWS S3. I have
>         to fetch periodically new S3 objects and read log lines from it.
>         Than use log lines (events) are processed in standard flume's way
>         (as with other sources).
>
>         *1) Is there any way to fetch S3 objects or I have to write
>     my own
>         Source?*
>
>
>         There is also second case. I want to have flume configuration
>         dynamic. Flume sources can change in time. New AWS key and S3
>         bucket can be added or deleted.
>
>         *2) Is there any other way to configure Flume than by static
>         configuration file?*
>
>         --
>         Paweł Róg
>
>
>

Re: AWS S3 flume source

Posted by Otis Gospodnetic <ot...@gmail.com>.

I was thinking the same.  I think the store (DB, FS, ZK, something else)
used to track state (what's been read from S3, what's been processed, etc.)
would ideally be abstract/extensible.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Mon, Aug 11, 2014 at 9:33 AM, Ashish <pa...@gmail.com> wrote:

> May be best is not to depend on Zk directly. create some sort of
> abstraction which can use Zk, DB or some other mechanism to share the
> distributed state. How about keeping the distributed state out of picture
> till we have a working S3 source, and plugin the meta-data information
> piece to it later. It can store the state locally like SpoolingDirectory
> Source.
>
> wdyt?
>
>
> On Mon, Aug 11, 2014 at 12:53 PM, Jonathan Natkins <na...@streamsets.com>
> wrote:
>
>> Yeah, I realize that. The reason I think it should be somewhat dependent
>> upon FLUME-1491 is that ZooKeeper seems to me to be a pretty heavy-weight
>> requirement just to use a particular source. FLUME-1491 would make Flume
>> generally dependent upon ZooKeeper, which is a good transition point to
>> start using ZK for other state that would be necessary for Flume
>> components. Would you agree?
>>
>>
>> On Sun, Aug 10, 2014 at 11:35 PM, Ashish <pa...@gmail.com> wrote:
>>
>>> Seems like a bit of confusion here. Flume-1491 only deals with
>>> configuration part, nothing else. Even if it get integrated, you would
>>> still need to write/expose API to store meta-data info in Zk (Flume-1491
>>> doesn't bring that in).
>>>
>>> HTH !
>>>
>>>
>>> On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <natty@streamsets.com
>>> > wrote:
>>>
>>>> Given that FLUME-1491 hasn't been committed yet, and may still be a
>>>> ways away, does it seem reasonable to punt on having multiple sources
>>>> working off of a single bucket until ZK is integrated into Flume? The
>>>> alternative probably requires write access to the S3 bucket to record some
>>>> shared state, and would likely have to get rewritten once ZK integration
>>>> happens anyway.
>>>>
>>>>
>>>> On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <pr...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I think that it is not possible to simply use SpoolDirectorySource.
>>>>> Maybe it will be possible to use some elements of SpoolDirectory but
>>>>> without touching it's code I think SpoolDirectory is not a good base. At
>>>>> the very beginning SpoolDirectorySource does this:
>>>>>
>>>>> File directory = new File(spoolDirectory);
>>>>>
>>>>> ReliableSpoolingFileEventReader also instantiate File class.
>>>>> There is also a question. How ReliableSpoolingFileEventReader stores
>>>>> information about files that has been already processed in non-Deleting
>>>>> mode? What happens after Flume restart?
>>>>>
>>>>> I agree with Jonathan that S3 source should be able to store last
>>>>> processed file eg. in Zookeeper.
>>>>> Another thing Jonathan: I think you shouldn't care about multiple
>>>>> buckets processed handled by a single S3Source. As you wrote multiple
>>>>> sources is the solution here. I thought it was already discussed but maybe
>>>>> I'm wrong.
>>>>>
>>>>>
>>>>> >> 2. Is it fair to assume that we're dealing with character files,
>>>>> rather than binary objects?
>>>>>
>>>>> In my opinion S3 source can by default read file as simple text file
>>>>> but also take in configuration a parameter with class name of a
>>>>> "InputStream processor". This processor will we able to eg. unzip,
>>>>> deserialize avro or read JSON and convert it into log events. What do you
>>>>> think?
>>>>>
>>>>> --
>>>>> Paweł Róg
>>>>>
>>>>> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <vi...@gmail.com>:
>>>>>
>>>>> Agree to the feedback provided by Ashish.
>>>>>>
>>>>>> I have started writing one which is similar to the ExecSource, but I
>>>>>> like the idea of doing something where spooldir takes over most of the hard
>>>>>> work of spitting out events to sinks. Let me think more on how to structure
>>>>>> that.
>>>>>>
>>>>>> Quick thinking out loud, I could create a source which extends the
>>>>>> spooldir and just spins off a thread to manage moving things from S3 to the
>>>>>> spooldir via a temporary directory.
>>>>>>
>>>>>> Regarding maintaining metadata, there are 2 ways:
>>>>>> 1) DB: I currently maintain it in a database because there are a lot
>>>>>> of other tools build around it
>>>>>> 2) File: Just keep the info in memory and in file to help from crash
>>>>>> recovery and/or high memory usage.
>>>>>>
>>>>>> Thanks,
>>>>>> Viral
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <pa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sharing some random thoughts
>>>>>>>
>>>>>>> 1. Download the file using S3 SDK and let the SpoolDirectory
>>>>>>> implementation take care of rest. Like a Decorator in front of
>>>>>>> SpoolDirectory
>>>>>>>
>>>>>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code
>>>>>>> and create events out of it.
>>>>>>>
>>>>>>> Would be great to reuse an existing implementation which is based on
>>>>>>> InputStream and feed it with S3 object input stream, concern of metadata
>>>>>>> storage still remains. Most often S3 objects are stored in compressed form,
>>>>>>> so this source would need to take care of compression gz/avro/others.
>>>>>>>
>>>>>>> Best is to start with something that works and then start adding
>>>>>>> more features to it.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <
>>>>>>> natty@streamsets.com> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I started trying to write some code on this, and realized there are
>>>>>>>> a number of issues that need to be discussed in order to really design this
>>>>>>>> feature effectively. The requirements that have been discussed thus far are:
>>>>>>>>
>>>>>>>> 1. Fetching data from S3 periodically
>>>>>>>> 2. Fetching data from multiple S3 buckets -- This may be something
>>>>>>>> that should be punted on until later. For a first implementation, this
>>>>>>>> could be solved just by having multiple sources, each with a single S3
>>>>>>>> bucket
>>>>>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can
>>>>>>>> you clarify what you mean by this?*
>>>>>>>> 4. Dynamically reconfigure the source -- This is blocked by
>>>>>>>> FLUME-1491, so I think this is out-of-scope for discussions at the moment
>>>>>>>>
>>>>>>>> Some questions I want to try to answer:
>>>>>>>>
>>>>>>>> 1. How do we identify and track objects that need to be processed
>>>>>>>> versus objects that have been processed already?
>>>>>>>> 1a. What about if we want to have multiple sources working against
>>>>>>>> the same bucket to speed processing?
>>>>>>>> 2. Is it fair to assume that we're dealing with character files,
>>>>>>>> rather than binary objects?
>>>>>>>>
>>>>>>>>  For the first question, if we ignore the multiple source
>>>>>>>> extension of the question, I think the simplest answer is to do something
>>>>>>>> on the local filesystem, like have a tracking directory that contains a
>>>>>>>> list of to-be-processed objects and a list of already-processed objects.
>>>>>>>> However, if the source goes down, what should the restart semantics be? It
>>>>>>>> seems that the ideal situation is to store this state in a system like
>>>>>>>> ZooKeeper, which would ensure that a number of sources could operate off of
>>>>>>>> the same bucket, but this probably requires FLUME-1491 first.
>>>>>>>>
>>>>>>>> For the second question, my feeling was just that we should work
>>>>>>>> with similar assumptions to how the SpoolingDirectorySource works, where
>>>>>>>> each line is a separate event. Does that seem reasonable?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Natty
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> Thanks for explanation Jonathan. I think I will also start working
>>>>>>>>> on it. When you have any patch (even draft) I'd be glad if you can attach
>>>>>>>>> it in JIRA. I'll do the same.
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Paweł Róg
>>>>>>>>>
>>>>>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <
>>>>>>>>> hshreedharan@cloudera.com>:
>>>>>>>>>
>>>>>>>>> +1 on an S3 Source. I would gladly review.
>>>>>>>>>>
>>>>>>>>>> Jonathan Natkins wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hey Pawel,
>>>>>>>>>>
>>>>>>>>>> My intention is to start working on it, but I don't know exactly
>>>>>>>>>> how
>>>>>>>>>> long it will take, and I'm not a committer, so time estimates
>>>>>>>>>> would
>>>>>>>>>> have to be taken with a grain of salt regardless. If this is
>>>>>>>>>> something
>>>>>>>>>> that you need urgently, it may not be ideal to wait for me to
>>>>>>>>>> start
>>>>>>>>>> building something for yourself.
>>>>>>>>>>
>>>>>>>>>> That said, as mentioned in the other thread, dynamic
>>>>>>>>>> configuration can
>>>>>>>>>> be done by refreshing the configuration files across the set of
>>>>>>>>>> Flume
>>>>>>>>>> agents. It's certainly not as great as having a single place to
>>>>>>>>>> change
>>>>>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Natty
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>
>>>>>>>>>>     Hi,
>>>>>>>>>>     Jonathan how should we interpret your last e-mail? You opened
>>>>>>>>>> an
>>>>>>>>>>     JIRA issue and want to start implementing this and do you
>>>>>>>>>> have any
>>>>>>>>>>     estimate how long it will take?
>>>>>>>>>>
>>>>>>>>>>     I think the biggest challenge here is to have dynamic
>>>>>>>>>>     configuration of Flume. It doesn't seem to be part of
>>>>>>>>>> FLUME-2437
>>>>>>>>>>     issue. Am I right?
>>>>>>>>>>
>>>>>>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>>>>>>     directories with the same source?
>>>>>>>>>>
>>>>>>>>>>     I think we don't need to track multiple S3 buckets with a
>>>>>>>>>> single
>>>>>>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>>>>>>     added or deleted on demand and attached to any Channel. I'm
>>>>>>>>>> only
>>>>>>>>>>     afraid about this dynamic configuration. I'll open a new
>>>>>>>>>> thread
>>>>>>>>>>     about this. It seems we have two totally separate things:
>>>>>>>>>>     * build S3 source
>>>>>>>>>>     * make flume configurable dynamically
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>>>>     Paweł
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>>>>>>     <otis.gospodnetic@gmail.com <mailto:
>>>>>>>>>> otis.gospodnetic@gmail.com>>:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Hi,
>>>>>>>>>>
>>>>>>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>>>>>>         <natty@streamsets.com <ma...@streamsets.com>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>             Hey all,
>>>>>>>>>>
>>>>>>>>>>             I created a JIRA for this:
>>>>>>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Thanks!  Should Fix Version be set to the next Flume
>>>>>>>>>> release
>>>>>>>>>>         version?
>>>>>>>>>>
>>>>>>>>>>             I thought I'd start working on one myself, which can
>>>>>>>>>>             hopefully be contributed back. I'm curious: do you
>>>>>>>>>> have
>>>>>>>>>>             particular requirements? Based on the emails in this
>>>>>>>>>>             thread, it sounds like the original goal was to have
>>>>>>>>>>             something that's like a SpoolDirectorySource that just
>>>>>>>>>>             picks up new files from S3. Is that accurate?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Yes, I think so.  We need to be able to:
>>>>>>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>>>>>>         <http://sematext.com/logsene/>) from S3 periodically
>>>>>>>>>> (e.g.
>>>>>>>>>>
>>>>>>>>>>         every 1 min, every 5 min, etc.)
>>>>>>>>>>         * fetch data from multiple S3 buckets
>>>>>>>>>>         * associate an S3 bucket with a user/token/key
>>>>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>>>>         stored on disk) add new S3 buckets from which data should
>>>>>>>>>> be fetch
>>>>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>             Would you need to be able to pull files from multiple
>>>>>>>>>> S3
>>>>>>>>>>             directories with the same source?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         I think the above addresses this question.
>>>>>>>>>>
>>>>>>>>>>             Thanks,
>>>>>>>>>>             Natty
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Thanks!
>>>>>>>>>>
>>>>>>>>>>         Otis
>>>>>>>>>>         --
>>>>>>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>>>>>>             <otis.gospodnetic@gmail.com
>>>>>>>>>>             <ma...@gmail.com>> wrote:
>>>>>>>>>>
>>>>>>>>>>                 +1 for seeing S3Source, starting with a JIRA
>>>>>>>>>> issue.
>>>>>>>>>>
>>>>>>>>>>                 But being able to dynamically add/remove S3
>>>>>>>>>> buckets
>>>>>>>>>>                 from which to pull data seems important.
>>>>>>>>>>
>>>>>>>>>>                 Any suggestions for how to approach that?
>>>>>>>>>>
>>>>>>>>>>                 Otis
>>>>>>>>>>                 --
>>>>>>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>>>>>>> Analytics
>>>>>>>>>>                 Solr & Elasticsearch Support *
>>>>>>>>>> http://sematext.com/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>>>>>>                 <hshreedharan@cloudera.com
>>>>>>>>>>                 <ma...@cloudera.com>> wrote:
>>>>>>>>>>
>>>>>>>>>>                     Please go ahead and file a jira. If you are
>>>>>>>>>>                     willing to submit a patch, you can post it on
>>>>>>>>>> the
>>>>>>>>>>                     jira.
>>>>>>>>>>
>>>>>>>>>>                     Viral Bajaria wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     I have a similar use case that cropped up
>>>>>>>>>>                     yesterday. I saw the archive
>>>>>>>>>>                     and found that there was a recommendation to
>>>>>>>>>>                     build it as Sharninder
>>>>>>>>>>                     suggested.
>>>>>>>>>>
>>>>>>>>>>                     For now, I went down the route of writing a
>>>>>>>>>>                     python script which
>>>>>>>>>>                     downloads from S3 and puts the files in a
>>>>>>>>>>                     directory which is
>>>>>>>>>>                     configured to be picked up via a spooldir.
>>>>>>>>>>
>>>>>>>>>>                     I would prefer to get a direct S3 source, and
>>>>>>>>>>                     maybe we could
>>>>>>>>>>                     collaborate on it and open-source it. Let me
>>>>>>>>>> know
>>>>>>>>>>                     if you prefer that
>>>>>>>>>>                     and we can work directly on it by creating a
>>>>>>>>>> JIRA.
>>>>>>>>>>
>>>>>>>>>>                     Thanks,
>>>>>>>>>>                     Viral
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari
>>>>>>>>>> Shreedharan
>>>>>>>>>>                     <hshreedharan@cloudera.com
>>>>>>>>>>                     <ma...@cloudera.com>
>>>>>>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>>>>>>
>>>>>>>>>>                     <ma...@cloudera.com>>> wrote:
>>>>>>>>>>
>>>>>>>>>>                         In both cases, Sharninder is right :)
>>>>>>>>>>
>>>>>>>>>>                         Sharninder wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                         As far as I know, there is no (open
>>>>>>>>>> source)
>>>>>>>>>>                     implementation of an S3
>>>>>>>>>>                         source, so yes, you'll have to implement
>>>>>>>>>>                     your own. You'll have to
>>>>>>>>>>                         implement a Pollable source and the dev
>>>>>>>>>>                     documentation has an outline
>>>>>>>>>>                         that you can use. You can also look at the
>>>>>>>>>>                     existing Execsource and
>>>>>>>>>>                         work your way up.
>>>>>>>>>>
>>>>>>>>>>                         As far as I know, there is no way to
>>>>>>>>>>                     configure flume without
>>>>>>>>>>                         using the
>>>>>>>>>>                         configuration file.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>>>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>>>>>>>                     <mailto:prog88@gmail.com <mailto:
>>>>>>>>>> prog88@gmail.com>>
>>>>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>>>>                     <ma...@gmail.com>
>>>>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>>>>                     <ma...@gmail.com>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>                             Hi,
>>>>>>>>>>                             I'm wondering if Flume is able to read
>>>>>>>>>>                     directly from S3.
>>>>>>>>>>
>>>>>>>>>>                             I'll describe my case. I have log
>>>>>>>>>> files
>>>>>>>>>>                     stored in AWS S3. I have
>>>>>>>>>>                             to fetch periodically new S3 objects
>>>>>>>>>> and
>>>>>>>>>>                     read log lines from it.
>>>>>>>>>>                             Than use log lines (events) are
>>>>>>>>>>                     processed in standard flume's way
>>>>>>>>>>                             (as with other sources).
>>>>>>>>>>
>>>>>>>>>>                             *1) Is there any way to fetch S3
>>>>>>>>>> objects
>>>>>>>>>>                     or I have to write
>>>>>>>>>>                         my own
>>>>>>>>>>                             Source?*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                             There is also second case. I want to
>>>>>>>>>>                     have flume configuration
>>>>>>>>>>                             dynamic. Flume sources can change in
>>>>>>>>>>                     time. New AWS key and S3
>>>>>>>>>>                             bucket can be added or deleted.
>>>>>>>>>>
>>>>>>>>>>                             *2) Is there any other way to
>>>>>>>>>> configure
>>>>>>>>>>                     Flume than by static
>>>>>>>>>>                             configuration file?*
>>>>>>>>>>
>>>>>>>>>>                             --
>>>>>>>>>>                             Paweł Róg
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> thanks
>>>>>>> ashish
>>>>>>>
>>>>>>> Blog: http://www.ashishpaliwal.com/blog
>>>>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> thanks
>>> ashish
>>>
>>> Blog: http://www.ashishpaliwal.com/blog
>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>
>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: AWS S3 flume source

Posted by Ashish <pa...@gmail.com>.

May be best is not to depend on Zk directly. create some sort of
abstraction which can use Zk, DB or some other mechanism to share the
distributed state. How about keeping the distributed state out of picture
till we have a working S3 source, and plugin the meta-data information
piece to it later. It can store the state locally like SpoolingDirectory
Source.

wdyt?


On Mon, Aug 11, 2014 at 12:53 PM, Jonathan Natkins <na...@streamsets.com>
wrote:

> Yeah, I realize that. The reason I think it should be somewhat dependent
> upon FLUME-1491 is that ZooKeeper seems to me to be a pretty heavy-weight
> requirement just to use a particular source. FLUME-1491 would make Flume
> generally dependent upon ZooKeeper, which is a good transition point to
> start using ZK for other state that would be necessary for Flume
> components. Would you agree?
>
>
> On Sun, Aug 10, 2014 at 11:35 PM, Ashish <pa...@gmail.com> wrote:
>
>> Seems like a bit of confusion here. Flume-1491 only deals with
>> configuration part, nothing else. Even if it get integrated, you would
>> still need to write/expose API to store meta-data info in Zk (Flume-1491
>> doesn't bring that in).
>>
>> HTH !
>>
>>
>> On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <na...@streamsets.com>
>> wrote:
>>
>>> Given that FLUME-1491 hasn't been committed yet, and may still be a ways
>>> away, does it seem reasonable to punt on having multiple sources working
>>> off of a single bucket until ZK is integrated into Flume? The alternative
>>> probably requires write access to the S3 bucket to record some shared
>>> state, and would likely have to get rewritten once ZK integration happens
>>> anyway.
>>>
>>>
>>> On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <pr...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think that it is not possible to simply use SpoolDirectorySource.
>>>> Maybe it will be possible to use some elements of SpoolDirectory but
>>>> without touching it's code I think SpoolDirectory is not a good base. At
>>>> the very beginning SpoolDirectorySource does this:
>>>>
>>>> File directory = new File(spoolDirectory);
>>>>
>>>> ReliableSpoolingFileEventReader also instantiate File class.
>>>> There is also a question. How ReliableSpoolingFileEventReader stores
>>>> information about files that has been already processed in non-Deleting
>>>> mode? What happens after Flume restart?
>>>>
>>>> I agree with Jonathan that S3 source should be able to store last
>>>> processed file eg. in Zookeeper.
>>>> Another thing Jonathan: I think you shouldn't care about multiple
>>>> buckets processed handled by a single S3Source. As you wrote multiple
>>>> sources is the solution here. I thought it was already discussed but maybe
>>>> I'm wrong.
>>>>
>>>>
>>>> >> 2. Is it fair to assume that we're dealing with character files,
>>>> rather than binary objects?
>>>>
>>>> In my opinion S3 source can by default read file as simple text file
>>>> but also take in configuration a parameter with class name of a
>>>> "InputStream processor". This processor will we able to eg. unzip,
>>>> deserialize avro or read JSON and convert it into log events. What do you
>>>> think?
>>>>
>>>> --
>>>> Paweł Róg
>>>>
>>>> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <vi...@gmail.com>:
>>>>
>>>> Agree to the feedback provided by Ashish.
>>>>>
>>>>> I have started writing one which is similar to the ExecSource, but I
>>>>> like the idea of doing something where spooldir takes over most of the hard
>>>>> work of spitting out events to sinks. Let me think more on how to structure
>>>>> that.
>>>>>
>>>>> Quick thinking out loud, I could create a source which extends the
>>>>> spooldir and just spins off a thread to manage moving things from S3 to the
>>>>> spooldir via a temporary directory.
>>>>>
>>>>> Regarding maintaining metadata, there are 2 ways:
>>>>> 1) DB: I currently maintain it in a database because there are a lot
>>>>> of other tools build around it
>>>>> 2) File: Just keep the info in memory and in file to help from crash
>>>>> recovery and/or high memory usage.
>>>>>
>>>>> Thanks,
>>>>> Viral
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <pa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Sharing some random thoughts
>>>>>>
>>>>>> 1. Download the file using S3 SDK and let the SpoolDirectory
>>>>>> implementation take care of rest. Like a Decorator in front of
>>>>>> SpoolDirectory
>>>>>>
>>>>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code
>>>>>> and create events out of it.
>>>>>>
>>>>>> Would be great to reuse an existing implementation which is based on
>>>>>> InputStream and feed it with S3 object input stream, concern of metadata
>>>>>> storage still remains. Most often S3 objects are stored in compressed form,
>>>>>> so this source would need to take care of compression gz/avro/others.
>>>>>>
>>>>>> Best is to start with something that works and then start adding more
>>>>>> features to it.
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <
>>>>>> natty@streamsets.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I started trying to write some code on this, and realized there are
>>>>>>> a number of issues that need to be discussed in order to really design this
>>>>>>> feature effectively. The requirements that have been discussed thus far are:
>>>>>>>
>>>>>>> 1. Fetching data from S3 periodically
>>>>>>> 2. Fetching data from multiple S3 buckets -- This may be something
>>>>>>> that should be punted on until later. For a first implementation, this
>>>>>>> could be solved just by having multiple sources, each with a single S3
>>>>>>> bucket
>>>>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can
>>>>>>> you clarify what you mean by this?*
>>>>>>> 4. Dynamically reconfigure the source -- This is blocked by
>>>>>>> FLUME-1491, so I think this is out-of-scope for discussions at the moment
>>>>>>>
>>>>>>> Some questions I want to try to answer:
>>>>>>>
>>>>>>> 1. How do we identify and track objects that need to be processed
>>>>>>> versus objects that have been processed already?
>>>>>>> 1a. What about if we want to have multiple sources working against
>>>>>>> the same bucket to speed processing?
>>>>>>> 2. Is it fair to assume that we're dealing with character files,
>>>>>>> rather than binary objects?
>>>>>>>
>>>>>>>  For the first question, if we ignore the multiple source extension
>>>>>>> of the question, I think the simplest answer is to do something on the
>>>>>>> local filesystem, like have a tracking directory that contains a list of
>>>>>>> to-be-processed objects and a list of already-processed objects. However,
>>>>>>> if the source goes down, what should the restart semantics be? It seems
>>>>>>> that the ideal situation is to store this state in a system like ZooKeeper,
>>>>>>> which would ensure that a number of sources could operate off of the same
>>>>>>> bucket, but this probably requires FLUME-1491 first.
>>>>>>>
>>>>>>> For the second question, my feeling was just that we should work
>>>>>>> with similar assumptions to how the SpoolingDirectorySource works, where
>>>>>>> each line is a separate event. Does that seem reasonable?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Natty
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> Thanks for explanation Jonathan. I think I will also start working
>>>>>>>> on it. When you have any patch (even draft) I'd be glad if you can attach
>>>>>>>> it in JIRA. I'll do the same.
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Paweł Róg
>>>>>>>>
>>>>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <
>>>>>>>> hshreedharan@cloudera.com>:
>>>>>>>>
>>>>>>>> +1 on an S3 Source. I would gladly review.
>>>>>>>>>
>>>>>>>>> Jonathan Natkins wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hey Pawel,
>>>>>>>>>
>>>>>>>>> My intention is to start working on it, but I don't know exactly
>>>>>>>>> how
>>>>>>>>> long it will take, and I'm not a committer, so time estimates
>>>>>>>>> would
>>>>>>>>> have to be taken with a grain of salt regardless. If this is
>>>>>>>>> something
>>>>>>>>> that you need urgently, it may not be ideal to wait for me to
>>>>>>>>> start
>>>>>>>>> building something for yourself.
>>>>>>>>>
>>>>>>>>> That said, as mentioned in the other thread, dynamic configuration
>>>>>>>>> can
>>>>>>>>> be done by refreshing the configuration files across the set of
>>>>>>>>> Flume
>>>>>>>>> agents. It's certainly not as great as having a single place to
>>>>>>>>> change
>>>>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Natty
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>
>>>>>>>>>     Hi,
>>>>>>>>>     Jonathan how should we interpret your last e-mail? You opened
>>>>>>>>> an
>>>>>>>>>     JIRA issue and want to start implementing this and do you have
>>>>>>>>> any
>>>>>>>>>     estimate how long it will take?
>>>>>>>>>
>>>>>>>>>     I think the biggest challenge here is to have dynamic
>>>>>>>>>     configuration of Flume. It doesn't seem to be part of
>>>>>>>>> FLUME-2437
>>>>>>>>>     issue. Am I right?
>>>>>>>>>
>>>>>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>>>>>     directories with the same source?
>>>>>>>>>
>>>>>>>>>     I think we don't need to track multiple S3 buckets with a
>>>>>>>>> single
>>>>>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>>>>>     added or deleted on demand and attached to any Channel. I'm
>>>>>>>>> only
>>>>>>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>>>>>>     about this. It seems we have two totally separate things:
>>>>>>>>>     * build S3 source
>>>>>>>>>     * make flume configurable dynamically
>>>>>>>>>
>>>>>>>>>     --
>>>>>>>>>     Paweł
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>>>>>     <otis.gospodnetic@gmail.com <mailto:otis.gospodnetic@gmail.com
>>>>>>>>> >>:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         Hi,
>>>>>>>>>
>>>>>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>>>>>         <natty@streamsets.com <ma...@streamsets.com>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>             Hey all,
>>>>>>>>>
>>>>>>>>>             I created a JIRA for this:
>>>>>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         Thanks!  Should Fix Version be set to the next Flume
>>>>>>>>> release
>>>>>>>>>         version?
>>>>>>>>>
>>>>>>>>>             I thought I'd start working on one myself, which can
>>>>>>>>>             hopefully be contributed back. I'm curious: do you have
>>>>>>>>>             particular requirements? Based on the emails in this
>>>>>>>>>             thread, it sounds like the original goal was to have
>>>>>>>>>             something that's like a SpoolDirectorySource that just
>>>>>>>>>             picks up new files from S3. Is that accurate?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         Yes, I think so.  We need to be able to:
>>>>>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>>>>>>
>>>>>>>>>         every 1 min, every 5 min, etc.)
>>>>>>>>>         * fetch data from multiple S3 buckets
>>>>>>>>>         * associate an S3 bucket with a user/token/key
>>>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>>>         stored on disk) add new S3 buckets from which data should
>>>>>>>>> be fetch
>>>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>             Would you need to be able to pull files from multiple
>>>>>>>>> S3
>>>>>>>>>             directories with the same source?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         I think the above addresses this question.
>>>>>>>>>
>>>>>>>>>             Thanks,
>>>>>>>>>             Natty
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         Thanks!
>>>>>>>>>
>>>>>>>>>         Otis
>>>>>>>>>         --
>>>>>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>>>>>             <otis.gospodnetic@gmail.com
>>>>>>>>>             <ma...@gmail.com>> wrote:
>>>>>>>>>
>>>>>>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>>>>>>
>>>>>>>>>                 But being able to dynamically add/remove S3 buckets
>>>>>>>>>                 from which to pull data seems important.
>>>>>>>>>
>>>>>>>>>                 Any suggestions for how to approach that?
>>>>>>>>>
>>>>>>>>>                 Otis
>>>>>>>>>                 --
>>>>>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>>>>>> Analytics
>>>>>>>>>                 Solr & Elasticsearch Support *
>>>>>>>>> http://sematext.com/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>>>>>                 <hshreedharan@cloudera.com
>>>>>>>>>                 <ma...@cloudera.com>> wrote:
>>>>>>>>>
>>>>>>>>>                     Please go ahead and file a jira. If you are
>>>>>>>>>                     willing to submit a patch, you can post it on
>>>>>>>>> the
>>>>>>>>>                     jira.
>>>>>>>>>
>>>>>>>>>                     Viral Bajaria wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                     I have a similar use case that cropped up
>>>>>>>>>                     yesterday. I saw the archive
>>>>>>>>>                     and found that there was a recommendation to
>>>>>>>>>                     build it as Sharninder
>>>>>>>>>                     suggested.
>>>>>>>>>
>>>>>>>>>                     For now, I went down the route of writing a
>>>>>>>>>                     python script which
>>>>>>>>>                     downloads from S3 and puts the files in a
>>>>>>>>>                     directory which is
>>>>>>>>>                     configured to be picked up via a spooldir.
>>>>>>>>>
>>>>>>>>>                     I would prefer to get a direct S3 source, and
>>>>>>>>>                     maybe we could
>>>>>>>>>                     collaborate on it and open-source it. Let me
>>>>>>>>> know
>>>>>>>>>                     if you prefer that
>>>>>>>>>                     and we can work directly on it by creating a
>>>>>>>>> JIRA.
>>>>>>>>>
>>>>>>>>>                     Thanks,
>>>>>>>>>                     Viral
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari
>>>>>>>>> Shreedharan
>>>>>>>>>                     <hshreedharan@cloudera.com
>>>>>>>>>                     <ma...@cloudera.com>
>>>>>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>>>>>
>>>>>>>>>                     <ma...@cloudera.com>>> wrote:
>>>>>>>>>
>>>>>>>>>                         In both cases, Sharninder is right :)
>>>>>>>>>
>>>>>>>>>                         Sharninder wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                         As far as I know, there is no (open source)
>>>>>>>>>                     implementation of an S3
>>>>>>>>>                         source, so yes, you'll have to implement
>>>>>>>>>                     your own. You'll have to
>>>>>>>>>                         implement a Pollable source and the dev
>>>>>>>>>                     documentation has an outline
>>>>>>>>>                         that you can use. You can also look at the
>>>>>>>>>                     existing Execsource and
>>>>>>>>>                         work your way up.
>>>>>>>>>
>>>>>>>>>                         As far as I know, there is no way to
>>>>>>>>>                     configure flume without
>>>>>>>>>                         using the
>>>>>>>>>                         configuration file.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>>>>>>                     <mailto:prog88@gmail.com <mailto:
>>>>>>>>> prog88@gmail.com>>
>>>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>>>                     <ma...@gmail.com>
>>>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>>>                     <ma...@gmail.com>>>> wrote:
>>>>>>>>>
>>>>>>>>>                             Hi,
>>>>>>>>>                             I'm wondering if Flume is able to read
>>>>>>>>>                     directly from S3.
>>>>>>>>>
>>>>>>>>>                             I'll describe my case. I have log files
>>>>>>>>>                     stored in AWS S3. I have
>>>>>>>>>                             to fetch periodically new S3 objects
>>>>>>>>> and
>>>>>>>>>                     read log lines from it.
>>>>>>>>>                             Than use log lines (events) are
>>>>>>>>>                     processed in standard flume's way
>>>>>>>>>                             (as with other sources).
>>>>>>>>>
>>>>>>>>>                             *1) Is there any way to fetch S3
>>>>>>>>> objects
>>>>>>>>>                     or I have to write
>>>>>>>>>                         my own
>>>>>>>>>                             Source?*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                             There is also second case. I want to
>>>>>>>>>                     have flume configuration
>>>>>>>>>                             dynamic. Flume sources can change in
>>>>>>>>>                     time. New AWS key and S3
>>>>>>>>>                             bucket can be added or deleted.
>>>>>>>>>
>>>>>>>>>                             *2) Is there any other way to configure
>>>>>>>>>                     Flume than by static
>>>>>>>>>                             configuration file?*
>>>>>>>>>
>>>>>>>>>                             --
>>>>>>>>>                             Paweł Róg
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> thanks
>>>>>> ashish
>>>>>>
>>>>>> Blog: http://www.ashishpaliwal.com/blog
>>>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: AWS S3 flume source

Posted by Jonathan Natkins <na...@streamsets.com>.

Yeah, I realize that. The reason I think it should be somewhat dependent
upon FLUME-1491 is that ZooKeeper seems to me to be a pretty heavy-weight
requirement just to use a particular source. FLUME-1491 would make Flume
generally dependent upon ZooKeeper, which is a good transition point to
start using ZK for other state that would be necessary for Flume
components. Would you agree?


On Sun, Aug 10, 2014 at 11:35 PM, Ashish <pa...@gmail.com> wrote:

> Seems like a bit of confusion here. Flume-1491 only deals with
> configuration part, nothing else. Even if it get integrated, you would
> still need to write/expose API to store meta-data info in Zk (Flume-1491
> doesn't bring that in).
>
> HTH !
>
>
> On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <na...@streamsets.com>
> wrote:
>
>> Given that FLUME-1491 hasn't been committed yet, and may still be a ways
>> away, does it seem reasonable to punt on having multiple sources working
>> off of a single bucket until ZK is integrated into Flume? The alternative
>> probably requires write access to the S3 bucket to record some shared
>> state, and would likely have to get rewritten once ZK integration happens
>> anyway.
>>
>>
>> On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <pr...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I think that it is not possible to simply use SpoolDirectorySource.
>>> Maybe it will be possible to use some elements of SpoolDirectory but
>>> without touching it's code I think SpoolDirectory is not a good base. At
>>> the very beginning SpoolDirectorySource does this:
>>>
>>> File directory = new File(spoolDirectory);
>>>
>>> ReliableSpoolingFileEventReader also instantiate File class.
>>> There is also a question. How ReliableSpoolingFileEventReader stores
>>> information about files that has been already processed in non-Deleting
>>> mode? What happens after Flume restart?
>>>
>>> I agree with Jonathan that S3 source should be able to store last
>>> processed file eg. in Zookeeper.
>>> Another thing Jonathan: I think you shouldn't care about multiple
>>> buckets processed handled by a single S3Source. As you wrote multiple
>>> sources is the solution here. I thought it was already discussed but maybe
>>> I'm wrong.
>>>
>>>
>>> >> 2. Is it fair to assume that we're dealing with character files,
>>> rather than binary objects?
>>>
>>> In my opinion S3 source can by default read file as simple text file but
>>> also take in configuration a parameter with class name of a "InputStream
>>> processor". This processor will we able to eg. unzip, deserialize avro or
>>> read JSON and convert it into log events. What do you think?
>>>
>>> --
>>> Paweł Róg
>>>
>>> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <vi...@gmail.com>:
>>>
>>> Agree to the feedback provided by Ashish.
>>>>
>>>> I have started writing one which is similar to the ExecSource, but I
>>>> like the idea of doing something where spooldir takes over most of the hard
>>>> work of spitting out events to sinks. Let me think more on how to structure
>>>> that.
>>>>
>>>> Quick thinking out loud, I could create a source which extends the
>>>> spooldir and just spins off a thread to manage moving things from S3 to the
>>>> spooldir via a temporary directory.
>>>>
>>>> Regarding maintaining metadata, there are 2 ways:
>>>> 1) DB: I currently maintain it in a database because there are a lot of
>>>> other tools build around it
>>>> 2) File: Just keep the info in memory and in file to help from crash
>>>> recovery and/or high memory usage.
>>>>
>>>> Thanks,
>>>> Viral
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <pa...@gmail.com> wrote:
>>>>
>>>>> Sharing some random thoughts
>>>>>
>>>>> 1. Download the file using S3 SDK and let the SpoolDirectory
>>>>> implementation take care of rest. Like a Decorator in front of
>>>>> SpoolDirectory
>>>>>
>>>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
>>>>> create events out of it.
>>>>>
>>>>> Would be great to reuse an existing implementation which is based on
>>>>> InputStream and feed it with S3 object input stream, concern of metadata
>>>>> storage still remains. Most often S3 objects are stored in compressed form,
>>>>> so this source would need to take care of compression gz/avro/others.
>>>>>
>>>>> Best is to start with something that works and then start adding more
>>>>> features to it.
>>>>>
>>>>>
>>>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <natty@streamsets.com
>>>>> > wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I started trying to write some code on this, and realized there are a
>>>>>> number of issues that need to be discussed in order to really design this
>>>>>> feature effectively. The requirements that have been discussed thus far are:
>>>>>>
>>>>>> 1. Fetching data from S3 periodically
>>>>>> 2. Fetching data from multiple S3 buckets -- This may be something
>>>>>> that should be punted on until later. For a first implementation, this
>>>>>> could be solved just by having multiple sources, each with a single S3
>>>>>> bucket
>>>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>>>>>> clarify what you mean by this?*
>>>>>> 4. Dynamically reconfigure the source -- This is blocked by
>>>>>> FLUME-1491, so I think this is out-of-scope for discussions at the moment
>>>>>>
>>>>>> Some questions I want to try to answer:
>>>>>>
>>>>>> 1. How do we identify and track objects that need to be processed
>>>>>> versus objects that have been processed already?
>>>>>> 1a. What about if we want to have multiple sources working against
>>>>>> the same bucket to speed processing?
>>>>>> 2. Is it fair to assume that we're dealing with character files,
>>>>>> rather than binary objects?
>>>>>>
>>>>>>  For the first question, if we ignore the multiple source extension
>>>>>> of the question, I think the simplest answer is to do something on the
>>>>>> local filesystem, like have a tracking directory that contains a list of
>>>>>> to-be-processed objects and a list of already-processed objects. However,
>>>>>> if the source goes down, what should the restart semantics be? It seems
>>>>>> that the ideal situation is to store this state in a system like ZooKeeper,
>>>>>> which would ensure that a number of sources could operate off of the same
>>>>>> bucket, but this probably requires FLUME-1491 first.
>>>>>>
>>>>>> For the second question, my feeling was just that we should work with
>>>>>> similar assumptions to how the SpoolingDirectorySource works, where each
>>>>>> line is a separate event. Does that seem reasonable?
>>>>>>
>>>>>> Thanks,
>>>>>> Natty
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> Thanks for explanation Jonathan. I think I will also start working
>>>>>>> on it. When you have any patch (even draft) I'd be glad if you can attach
>>>>>>> it in JIRA. I'll do the same.
>>>>>>> What do you think?
>>>>>>>
>>>>>>> --
>>>>>>> Paweł Róg
>>>>>>>
>>>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <
>>>>>>> hshreedharan@cloudera.com>:
>>>>>>>
>>>>>>> +1 on an S3 Source. I would gladly review.
>>>>>>>>
>>>>>>>> Jonathan Natkins wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hey Pawel,
>>>>>>>>
>>>>>>>> My intention is to start working on it, but I don't know exactly
>>>>>>>> how
>>>>>>>> long it will take, and I'm not a committer, so time estimates would
>>>>>>>> have to be taken with a grain of salt regardless. If this is
>>>>>>>> something
>>>>>>>> that you need urgently, it may not be ideal to wait for me to start
>>>>>>>> building something for yourself.
>>>>>>>>
>>>>>>>> That said, as mentioned in the other thread, dynamic configuration
>>>>>>>> can
>>>>>>>> be done by refreshing the configuration files across the set of
>>>>>>>> Flume
>>>>>>>> agents. It's certainly not as great as having a single place to
>>>>>>>> change
>>>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Natty
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>
>>>>>>>>     Hi,
>>>>>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>>>>>     JIRA issue and want to start implementing this and do you have
>>>>>>>> any
>>>>>>>>     estimate how long it will take?
>>>>>>>>
>>>>>>>>     I think the biggest challenge here is to have dynamic
>>>>>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>>>>>     issue. Am I right?
>>>>>>>>
>>>>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>>>>     directories with the same source?
>>>>>>>>
>>>>>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>>>>>     about this. It seems we have two totally separate things:
>>>>>>>>     * build S3 source
>>>>>>>>     * make flume configurable dynamically
>>>>>>>>
>>>>>>>>     --
>>>>>>>>     Paweł
>>>>>>>>
>>>>>>>>
>>>>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>>>>     <otis.gospodnetic@gmail.com <mailto:otis.gospodnetic@gmail.com
>>>>>>>> >>:
>>>>>>>>
>>>>>>>>
>>>>>>>>         Hi,
>>>>>>>>
>>>>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>>>>>>
>>>>>>>>             Hey all,
>>>>>>>>
>>>>>>>>             I created a JIRA for this:
>>>>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>>>>
>>>>>>>>
>>>>>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>>>>>         version?
>>>>>>>>
>>>>>>>>             I thought I'd start working on one myself, which can
>>>>>>>>             hopefully be contributed back. I'm curious: do you have
>>>>>>>>             particular requirements? Based on the emails in this
>>>>>>>>             thread, it sounds like the original goal was to have
>>>>>>>>             something that's like a SpoolDirectorySource that just
>>>>>>>>             picks up new files from S3. Is that accurate?
>>>>>>>>
>>>>>>>>
>>>>>>>>         Yes, I think so.  We need to be able to:
>>>>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>>>>>
>>>>>>>>         every 1 min, every 5 min, etc.)
>>>>>>>>         * fetch data from multiple S3 buckets
>>>>>>>>         * associate an S3 bucket with a user/token/key
>>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>>         stored on disk) add new S3 buckets from which data should
>>>>>>>> be fetch
>>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>>>>
>>>>>>>>
>>>>>>>>             Would you need to be able to pull files from multiple S3
>>>>>>>>             directories with the same source?
>>>>>>>>
>>>>>>>>
>>>>>>>>         I think the above addresses this question.
>>>>>>>>
>>>>>>>>             Thanks,
>>>>>>>>             Natty
>>>>>>>>
>>>>>>>>
>>>>>>>>         Thanks!
>>>>>>>>
>>>>>>>>         Otis
>>>>>>>>         --
>>>>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>>>>             <otis.gospodnetic@gmail.com
>>>>>>>>             <ma...@gmail.com>> wrote:
>>>>>>>>
>>>>>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>>>>>
>>>>>>>>                 But being able to dynamically add/remove S3 buckets
>>>>>>>>                 from which to pull data seems important.
>>>>>>>>
>>>>>>>>                 Any suggestions for how to approach that?
>>>>>>>>
>>>>>>>>                 Otis
>>>>>>>>                 --
>>>>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>>>>> Analytics
>>>>>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>>>>                 <hshreedharan@cloudera.com
>>>>>>>>                 <ma...@cloudera.com>> wrote:
>>>>>>>>
>>>>>>>>                     Please go ahead and file a jira. If you are
>>>>>>>>                     willing to submit a patch, you can post it on
>>>>>>>> the
>>>>>>>>                     jira.
>>>>>>>>
>>>>>>>>                     Viral Bajaria wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                     I have a similar use case that cropped up
>>>>>>>>                     yesterday. I saw the archive
>>>>>>>>                     and found that there was a recommendation to
>>>>>>>>                     build it as Sharninder
>>>>>>>>                     suggested.
>>>>>>>>
>>>>>>>>                     For now, I went down the route of writing a
>>>>>>>>                     python script which
>>>>>>>>                     downloads from S3 and puts the files in a
>>>>>>>>                     directory which is
>>>>>>>>                     configured to be picked up via a spooldir.
>>>>>>>>
>>>>>>>>                     I would prefer to get a direct S3 source, and
>>>>>>>>                     maybe we could
>>>>>>>>                     collaborate on it and open-source it. Let me
>>>>>>>> know
>>>>>>>>                     if you prefer that
>>>>>>>>                     and we can work directly on it by creating a
>>>>>>>> JIRA.
>>>>>>>>
>>>>>>>>                     Thanks,
>>>>>>>>                     Viral
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari
>>>>>>>> Shreedharan
>>>>>>>>                     <hshreedharan@cloudera.com
>>>>>>>>                     <ma...@cloudera.com>
>>>>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>>>>
>>>>>>>>                     <ma...@cloudera.com>>> wrote:
>>>>>>>>
>>>>>>>>                         In both cases, Sharninder is right :)
>>>>>>>>
>>>>>>>>                         Sharninder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                         As far as I know, there is no (open source)
>>>>>>>>                     implementation of an S3
>>>>>>>>                         source, so yes, you'll have to implement
>>>>>>>>                     your own. You'll have to
>>>>>>>>                         implement a Pollable source and the dev
>>>>>>>>                     documentation has an outline
>>>>>>>>                         that you can use. You can also look at the
>>>>>>>>                     existing Execsource and
>>>>>>>>                         work your way up.
>>>>>>>>
>>>>>>>>                         As far as I know, there is no way to
>>>>>>>>                     configure flume without
>>>>>>>>                         using the
>>>>>>>>                         configuration file.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>>>>>                     <mailto:prog88@gmail.com <mailto:
>>>>>>>> prog88@gmail.com>>
>>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>>                     <ma...@gmail.com>
>>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>>                     <ma...@gmail.com>>>> wrote:
>>>>>>>>
>>>>>>>>                             Hi,
>>>>>>>>                             I'm wondering if Flume is able to read
>>>>>>>>                     directly from S3.
>>>>>>>>
>>>>>>>>                             I'll describe my case. I have log files
>>>>>>>>                     stored in AWS S3. I have
>>>>>>>>                             to fetch periodically new S3 objects and
>>>>>>>>                     read log lines from it.
>>>>>>>>                             Than use log lines (events) are
>>>>>>>>                     processed in standard flume's way
>>>>>>>>                             (as with other sources).
>>>>>>>>
>>>>>>>>                             *1) Is there any way to fetch S3 objects
>>>>>>>>                     or I have to write
>>>>>>>>                         my own
>>>>>>>>                             Source?*
>>>>>>>>
>>>>>>>>
>>>>>>>>                             There is also second case. I want to
>>>>>>>>                     have flume configuration
>>>>>>>>                             dynamic. Flume sources can change in
>>>>>>>>                     time. New AWS key and S3
>>>>>>>>                             bucket can be added or deleted.
>>>>>>>>
>>>>>>>>                             *2) Is there any other way to configure
>>>>>>>>                     Flume than by static
>>>>>>>>                             configuration file?*
>>>>>>>>
>>>>>>>>                             --
>>>>>>>>                             Paweł Róg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> thanks
>>>>> ashish
>>>>>
>>>>> Blog: http://www.ashishpaliwal.com/blog
>>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: AWS S3 flume source

Posted by Ashish <pa...@gmail.com>.

Seems like a bit of confusion here. Flume-1491 only deals with
configuration part, nothing else. Even if it get integrated, you would
still need to write/expose API to store meta-data info in Zk (Flume-1491
doesn't bring that in).

HTH !


On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <na...@streamsets.com>
wrote:

> Given that FLUME-1491 hasn't been committed yet, and may still be a ways
> away, does it seem reasonable to punt on having multiple sources working
> off of a single bucket until ZK is integrated into Flume? The alternative
> probably requires write access to the S3 bucket to record some shared
> state, and would likely have to get rewritten once ZK integration happens
> anyway.
>
>
> On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <pr...@gmail.com> wrote:
>
>> Hi,
>>
>> I think that it is not possible to simply use SpoolDirectorySource. Maybe
>> it will be possible to use some elements of SpoolDirectory but without
>> touching it's code I think SpoolDirectory is not a good base. At the very
>> beginning SpoolDirectorySource does this:
>>
>> File directory = new File(spoolDirectory);
>>
>> ReliableSpoolingFileEventReader also instantiate File class.
>> There is also a question. How ReliableSpoolingFileEventReader stores
>> information about files that has been already processed in non-Deleting
>> mode? What happens after Flume restart?
>>
>> I agree with Jonathan that S3 source should be able to store last
>> processed file eg. in Zookeeper.
>> Another thing Jonathan: I think you shouldn't care about multiple buckets
>> processed handled by a single S3Source. As you wrote multiple sources is
>> the solution here. I thought it was already discussed but maybe I'm wrong.
>>
>>
>> >> 2. Is it fair to assume that we're dealing with character files,
>> rather than binary objects?
>>
>> In my opinion S3 source can by default read file as simple text file but
>> also take in configuration a parameter with class name of a "InputStream
>> processor". This processor will we able to eg. unzip, deserialize avro or
>> read JSON and convert it into log events. What do you think?
>>
>> --
>> Paweł Róg
>>
>> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <vi...@gmail.com>:
>>
>> Agree to the feedback provided by Ashish.
>>>
>>> I have started writing one which is similar to the ExecSource, but I
>>> like the idea of doing something where spooldir takes over most of the hard
>>> work of spitting out events to sinks. Let me think more on how to structure
>>> that.
>>>
>>> Quick thinking out loud, I could create a source which extends the
>>> spooldir and just spins off a thread to manage moving things from S3 to the
>>> spooldir via a temporary directory.
>>>
>>> Regarding maintaining metadata, there are 2 ways:
>>> 1) DB: I currently maintain it in a database because there are a lot of
>>> other tools build around it
>>> 2) File: Just keep the info in memory and in file to help from crash
>>> recovery and/or high memory usage.
>>>
>>> Thanks,
>>> Viral
>>>
>>>
>>>
>>>
>>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <pa...@gmail.com> wrote:
>>>
>>>> Sharing some random thoughts
>>>>
>>>> 1. Download the file using S3 SDK and let the SpoolDirectory
>>>> implementation take care of rest. Like a Decorator in front of
>>>> SpoolDirectory
>>>>
>>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
>>>> create events out of it.
>>>>
>>>> Would be great to reuse an existing implementation which is based on
>>>> InputStream and feed it with S3 object input stream, concern of metadata
>>>> storage still remains. Most often S3 objects are stored in compressed form,
>>>> so this source would need to take care of compression gz/avro/others.
>>>>
>>>> Best is to start with something that works and then start adding more
>>>> features to it.
>>>>
>>>>
>>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I started trying to write some code on this, and realized there are a
>>>>> number of issues that need to be discussed in order to really design this
>>>>> feature effectively. The requirements that have been discussed thus far are:
>>>>>
>>>>> 1. Fetching data from S3 periodically
>>>>> 2. Fetching data from multiple S3 buckets -- This may be something
>>>>> that should be punted on until later. For a first implementation, this
>>>>> could be solved just by having multiple sources, each with a single S3
>>>>> bucket
>>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>>>>> clarify what you mean by this?*
>>>>> 4. Dynamically reconfigure the source -- This is blocked by
>>>>> FLUME-1491, so I think this is out-of-scope for discussions at the moment
>>>>>
>>>>> Some questions I want to try to answer:
>>>>>
>>>>> 1. How do we identify and track objects that need to be processed
>>>>> versus objects that have been processed already?
>>>>> 1a. What about if we want to have multiple sources working against the
>>>>> same bucket to speed processing?
>>>>> 2. Is it fair to assume that we're dealing with character files,
>>>>> rather than binary objects?
>>>>>
>>>>>  For the first question, if we ignore the multiple source extension
>>>>> of the question, I think the simplest answer is to do something on the
>>>>> local filesystem, like have a tracking directory that contains a list of
>>>>> to-be-processed objects and a list of already-processed objects. However,
>>>>> if the source goes down, what should the restart semantics be? It seems
>>>>> that the ideal situation is to store this state in a system like ZooKeeper,
>>>>> which would ensure that a number of sources could operate off of the same
>>>>> bucket, but this probably requires FLUME-1491 first.
>>>>>
>>>>> For the second question, my feeling was just that we should work with
>>>>> similar assumptions to how the SpoolingDirectorySource works, where each
>>>>> line is a separate event. Does that seem reasonable?
>>>>>
>>>>> Thanks,
>>>>> Natty
>>>>>
>>>>>
>>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> Thanks for explanation Jonathan. I think I will also start working on
>>>>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>>>>> in JIRA. I'll do the same.
>>>>>> What do you think?
>>>>>>
>>>>>> --
>>>>>> Paweł Róg
>>>>>>
>>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <
>>>>>> hshreedharan@cloudera.com>:
>>>>>>
>>>>>> +1 on an S3 Source. I would gladly review.
>>>>>>>
>>>>>>> Jonathan Natkins wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hey Pawel,
>>>>>>>
>>>>>>> My intention is to start working on it, but I don't know exactly how
>>>>>>> long it will take, and I'm not a committer, so time estimates would
>>>>>>> have to be taken with a grain of salt regardless. If this is
>>>>>>> something
>>>>>>> that you need urgently, it may not be ideal to wait for me to start
>>>>>>> building something for yourself.
>>>>>>>
>>>>>>> That said, as mentioned in the other thread, dynamic configuration
>>>>>>> can
>>>>>>> be done by refreshing the configuration files across the set of
>>>>>>> Flume
>>>>>>> agents. It's certainly not as great as having a single place to
>>>>>>> change
>>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Natty
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>
>>>>>>>     Hi,
>>>>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>>>>     JIRA issue and want to start implementing this and do you have
>>>>>>> any
>>>>>>>     estimate how long it will take?
>>>>>>>
>>>>>>>     I think the biggest challenge here is to have dynamic
>>>>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>>>>     issue. Am I right?
>>>>>>>
>>>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>>>     directories with the same source?
>>>>>>>
>>>>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>>>>     about this. It seems we have two totally separate things:
>>>>>>>     * build S3 source
>>>>>>>     * make flume configurable dynamically
>>>>>>>
>>>>>>>     --
>>>>>>>     Paweł
>>>>>>>
>>>>>>>
>>>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>>>     <otis.gospodnetic@gmail.com <mailto:otis.gospodnetic@gmail.com
>>>>>>> >>:
>>>>>>>
>>>>>>>
>>>>>>>         Hi,
>>>>>>>
>>>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>>>>>
>>>>>>>             Hey all,
>>>>>>>
>>>>>>>             I created a JIRA for this:
>>>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>>>
>>>>>>>
>>>>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>>>>         version?
>>>>>>>
>>>>>>>             I thought I'd start working on one myself, which can
>>>>>>>             hopefully be contributed back. I'm curious: do you have
>>>>>>>             particular requirements? Based on the emails in this
>>>>>>>             thread, it sounds like the original goal was to have
>>>>>>>             something that's like a SpoolDirectorySource that just
>>>>>>>             picks up new files from S3. Is that accurate?
>>>>>>>
>>>>>>>
>>>>>>>         Yes, I think so.  We need to be able to:
>>>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>>>>
>>>>>>>         every 1 min, every 5 min, etc.)
>>>>>>>         * fetch data from multiple S3 buckets
>>>>>>>         * associate an S3 bucket with a user/token/key
>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>         stored on disk) add new S3 buckets from which data should be
>>>>>>> fetch
>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>>>
>>>>>>>
>>>>>>>             Would you need to be able to pull files from multiple S3
>>>>>>>             directories with the same source?
>>>>>>>
>>>>>>>
>>>>>>>         I think the above addresses this question.
>>>>>>>
>>>>>>>             Thanks,
>>>>>>>             Natty
>>>>>>>
>>>>>>>
>>>>>>>         Thanks!
>>>>>>>
>>>>>>>         Otis
>>>>>>>         --
>>>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>>>             <otis.gospodnetic@gmail.com
>>>>>>>             <ma...@gmail.com>> wrote:
>>>>>>>
>>>>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>>>>
>>>>>>>                 But being able to dynamically add/remove S3 buckets
>>>>>>>                 from which to pull data seems important.
>>>>>>>
>>>>>>>                 Any suggestions for how to approach that?
>>>>>>>
>>>>>>>                 Otis
>>>>>>>                 --
>>>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>>>> Analytics
>>>>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>
>>>>>>>
>>>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>>>                 <hshreedharan@cloudera.com
>>>>>>>                 <ma...@cloudera.com>> wrote:
>>>>>>>
>>>>>>>                     Please go ahead and file a jira. If you are
>>>>>>>                     willing to submit a patch, you can post it on the
>>>>>>>                     jira.
>>>>>>>
>>>>>>>                     Viral Bajaria wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                     I have a similar use case that cropped up
>>>>>>>                     yesterday. I saw the archive
>>>>>>>                     and found that there was a recommendation to
>>>>>>>                     build it as Sharninder
>>>>>>>                     suggested.
>>>>>>>
>>>>>>>                     For now, I went down the route of writing a
>>>>>>>                     python script which
>>>>>>>                     downloads from S3 and puts the files in a
>>>>>>>                     directory which is
>>>>>>>                     configured to be picked up via a spooldir.
>>>>>>>
>>>>>>>                     I would prefer to get a direct S3 source, and
>>>>>>>                     maybe we could
>>>>>>>                     collaborate on it and open-source it. Let me know
>>>>>>>                     if you prefer that
>>>>>>>                     and we can work directly on it by creating a
>>>>>>> JIRA.
>>>>>>>
>>>>>>>                     Thanks,
>>>>>>>                     Viral
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari
>>>>>>> Shreedharan
>>>>>>>                     <hshreedharan@cloudera.com
>>>>>>>                     <ma...@cloudera.com>
>>>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>>>
>>>>>>>                     <ma...@cloudera.com>>> wrote:
>>>>>>>
>>>>>>>                         In both cases, Sharninder is right :)
>>>>>>>
>>>>>>>                         Sharninder wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                         As far as I know, there is no (open source)
>>>>>>>                     implementation of an S3
>>>>>>>                         source, so yes, you'll have to implement
>>>>>>>                     your own. You'll have to
>>>>>>>                         implement a Pollable source and the dev
>>>>>>>                     documentation has an outline
>>>>>>>                         that you can use. You can also look at the
>>>>>>>                     existing Execsource and
>>>>>>>                         work your way up.
>>>>>>>
>>>>>>>                         As far as I know, there is no way to
>>>>>>>                     configure flume without
>>>>>>>                         using the
>>>>>>>                         configuration file.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>>>>                     <mailto:prog88@gmail.com <mailto:
>>>>>>> prog88@gmail.com>>
>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>                     <ma...@gmail.com>
>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>                     <ma...@gmail.com>>>> wrote:
>>>>>>>
>>>>>>>                             Hi,
>>>>>>>                             I'm wondering if Flume is able to read
>>>>>>>                     directly from S3.
>>>>>>>
>>>>>>>                             I'll describe my case. I have log files
>>>>>>>                     stored in AWS S3. I have
>>>>>>>                             to fetch periodically new S3 objects and
>>>>>>>                     read log lines from it.
>>>>>>>                             Than use log lines (events) are
>>>>>>>                     processed in standard flume's way
>>>>>>>                             (as with other sources).
>>>>>>>
>>>>>>>                             *1) Is there any way to fetch S3 objects
>>>>>>>                     or I have to write
>>>>>>>                         my own
>>>>>>>                             Source?*
>>>>>>>
>>>>>>>
>>>>>>>                             There is also second case. I want to
>>>>>>>                     have flume configuration
>>>>>>>                             dynamic. Flume sources can change in
>>>>>>>                     time. New AWS key and S3
>>>>>>>                             bucket can be added or deleted.
>>>>>>>
>>>>>>>                             *2) Is there any other way to configure
>>>>>>>                     Flume than by static
>>>>>>>                             configuration file?*
>>>>>>>
>>>>>>>                             --
>>>>>>>                             Paweł Róg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> thanks
>>>> ashish
>>>>
>>>> Blog: http://www.ashishpaliwal.com/blog
>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>>
>>>
>>>
>>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: AWS S3 flume source

Posted by Jonathan Natkins <na...@streamsets.com>.

Given that FLUME-1491 hasn't been committed yet, and may still be a ways
away, does it seem reasonable to punt on having multiple sources working
off of a single bucket until ZK is integrated into Flume? The alternative
probably requires write access to the S3 bucket to record some shared
state, and would likely have to get rewritten once ZK integration happens
anyway.


On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <pr...@gmail.com> wrote:

> Hi,
>
> I think that it is not possible to simply use SpoolDirectorySource. Maybe
> it will be possible to use some elements of SpoolDirectory but without
> touching it's code I think SpoolDirectory is not a good base. At the very
> beginning SpoolDirectorySource does this:
>
> File directory = new File(spoolDirectory);
>
> ReliableSpoolingFileEventReader also instantiate File class.
> There is also a question. How ReliableSpoolingFileEventReader stores
> information about files that has been already processed in non-Deleting
> mode? What happens after Flume restart?
>
> I agree with Jonathan that S3 source should be able to store last
> processed file eg. in Zookeeper.
> Another thing Jonathan: I think you shouldn't care about multiple buckets
> processed handled by a single S3Source. As you wrote multiple sources is
> the solution here. I thought it was already discussed but maybe I'm wrong.
>
>
> >> 2. Is it fair to assume that we're dealing with character files, rather
> than binary objects?
>
> In my opinion S3 source can by default read file as simple text file but
> also take in configuration a parameter with class name of a "InputStream
> processor". This processor will we able to eg. unzip, deserialize avro or
> read JSON and convert it into log events. What do you think?
>
> --
> Paweł Róg
>
> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <vi...@gmail.com>:
>
> Agree to the feedback provided by Ashish.
>>
>> I have started writing one which is similar to the ExecSource, but I like
>> the idea of doing something where spooldir takes over most of the hard work
>> of spitting out events to sinks. Let me think more on how to structure
>> that.
>>
>> Quick thinking out loud, I could create a source which extends the
>> spooldir and just spins off a thread to manage moving things from S3 to the
>> spooldir via a temporary directory.
>>
>> Regarding maintaining metadata, there are 2 ways:
>> 1) DB: I currently maintain it in a database because there are a lot of
>> other tools build around it
>> 2) File: Just keep the info in memory and in file to help from crash
>> recovery and/or high memory usage.
>>
>> Thanks,
>> Viral
>>
>>
>>
>>
>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <pa...@gmail.com> wrote:
>>
>>> Sharing some random thoughts
>>>
>>> 1. Download the file using S3 SDK and let the SpoolDirectory
>>> implementation take care of rest. Like a Decorator in front of
>>> SpoolDirectory
>>>
>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
>>> create events out of it.
>>>
>>> Would be great to reuse an existing implementation which is based on
>>> InputStream and feed it with S3 object input stream, concern of metadata
>>> storage still remains. Most often S3 objects are stored in compressed form,
>>> so this source would need to take care of compression gz/avro/others.
>>>
>>> Best is to start with something that works and then start adding more
>>> features to it.
>>>
>>>
>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I started trying to write some code on this, and realized there are a
>>>> number of issues that need to be discussed in order to really design this
>>>> feature effectively. The requirements that have been discussed thus far are:
>>>>
>>>> 1. Fetching data from S3 periodically
>>>> 2. Fetching data from multiple S3 buckets -- This may be something that
>>>> should be punted on until later. For a first implementation, this could be
>>>> solved just by having multiple sources, each with a single S3 bucket
>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>>>> clarify what you mean by this?*
>>>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491,
>>>> so I think this is out-of-scope for discussions at the moment
>>>>
>>>> Some questions I want to try to answer:
>>>>
>>>> 1. How do we identify and track objects that need to be processed
>>>> versus objects that have been processed already?
>>>> 1a. What about if we want to have multiple sources working against the
>>>> same bucket to speed processing?
>>>> 2. Is it fair to assume that we're dealing with character files, rather
>>>> than binary objects?
>>>>
>>>>  For the first question, if we ignore the multiple source extension of
>>>> the question, I think the simplest answer is to do something on the local
>>>> filesystem, like have a tracking directory that contains a list of
>>>> to-be-processed objects and a list of already-processed objects. However,
>>>> if the source goes down, what should the restart semantics be? It seems
>>>> that the ideal situation is to store this state in a system like ZooKeeper,
>>>> which would ensure that a number of sources could operate off of the same
>>>> bucket, but this probably requires FLUME-1491 first.
>>>>
>>>> For the second question, my feeling was just that we should work with
>>>> similar assumptions to how the SpoolingDirectorySource works, where each
>>>> line is a separate event. Does that seem reasonable?
>>>>
>>>> Thanks,
>>>> Natty
>>>>
>>>>
>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> Thanks for explanation Jonathan. I think I will also start working on
>>>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>>>> in JIRA. I'll do the same.
>>>>> What do you think?
>>>>>
>>>>> --
>>>>> Paweł Róg
>>>>>
>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hshreedharan@cloudera.com
>>>>> >:
>>>>>
>>>>> +1 on an S3 Source. I would gladly review.
>>>>>>
>>>>>> Jonathan Natkins wrote:
>>>>>>
>>>>>>
>>>>>> Hey Pawel,
>>>>>>
>>>>>> My intention is to start working on it, but I don't know exactly how
>>>>>> long it will take, and I'm not a committer, so time estimates would
>>>>>> have to be taken with a grain of salt regardless. If this is
>>>>>> something
>>>>>> that you need urgently, it may not be ideal to wait for me to start
>>>>>> building something for yourself.
>>>>>>
>>>>>> That said, as mentioned in the other thread, dynamic configuration
>>>>>> can
>>>>>> be done by refreshing the configuration files across the set of Flume
>>>>>> agents. It's certainly not as great as having a single place to
>>>>>> change
>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>>
>>>>>> Thanks,
>>>>>> Natty
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>>> <ma...@gmail.com>> wrote:
>>>>>>
>>>>>>     Hi,
>>>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>>>     JIRA issue and want to start implementing this and do you have any
>>>>>>     estimate how long it will take?
>>>>>>
>>>>>>     I think the biggest challenge here is to have dynamic
>>>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>>>     issue. Am I right?
>>>>>>
>>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>>     directories with the same source?
>>>>>>
>>>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>>>     about this. It seems we have two totally separate things:
>>>>>>     * build S3 source
>>>>>>     * make flume configurable dynamically
>>>>>>
>>>>>>     --
>>>>>>     Paweł
>>>>>>
>>>>>>
>>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>>>>>>
>>>>>>
>>>>>>         Hi,
>>>>>>
>>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>>>>
>>>>>>             Hey all,
>>>>>>
>>>>>>             I created a JIRA for this:
>>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>>
>>>>>>
>>>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>>>         version?
>>>>>>
>>>>>>             I thought I'd start working on one myself, which can
>>>>>>             hopefully be contributed back. I'm curious: do you have
>>>>>>             particular requirements? Based on the emails in this
>>>>>>             thread, it sounds like the original goal was to have
>>>>>>             something that's like a SpoolDirectorySource that just
>>>>>>             picks up new files from S3. Is that accurate?
>>>>>>
>>>>>>
>>>>>>         Yes, I think so.  We need to be able to:
>>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>>>
>>>>>>         every 1 min, every 5 min, etc.)
>>>>>>         * fetch data from multiple S3 buckets
>>>>>>         * associate an S3 bucket with a user/token/key
>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>         stored on disk) add new S3 buckets from which data should be
>>>>>> fetch
>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>>
>>>>>>
>>>>>>             Would you need to be able to pull files from multiple S3
>>>>>>             directories with the same source?
>>>>>>
>>>>>>
>>>>>>         I think the above addresses this question.
>>>>>>
>>>>>>             Thanks,
>>>>>>             Natty
>>>>>>
>>>>>>
>>>>>>         Thanks!
>>>>>>
>>>>>>         Otis
>>>>>>         --
>>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>>
>>>>>>
>>>>>>
>>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>>             <otis.gospodnetic@gmail.com
>>>>>>             <ma...@gmail.com>> wrote:
>>>>>>
>>>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>>>
>>>>>>                 But being able to dynamically add/remove S3 buckets
>>>>>>                 from which to pull data seems important.
>>>>>>
>>>>>>                 Any suggestions for how to approach that?
>>>>>>
>>>>>>                 Otis
>>>>>>                 --
>>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>>> Analytics
>>>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>>>
>>>>>>
>>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>>                 <hshreedharan@cloudera.com
>>>>>>                 <ma...@cloudera.com>> wrote:
>>>>>>
>>>>>>                     Please go ahead and file a jira. If you are
>>>>>>                     willing to submit a patch, you can post it on the
>>>>>>                     jira.
>>>>>>
>>>>>>                     Viral Bajaria wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>                     I have a similar use case that cropped up
>>>>>>                     yesterday. I saw the archive
>>>>>>                     and found that there was a recommendation to
>>>>>>                     build it as Sharninder
>>>>>>                     suggested.
>>>>>>
>>>>>>                     For now, I went down the route of writing a
>>>>>>                     python script which
>>>>>>                     downloads from S3 and puts the files in a
>>>>>>                     directory which is
>>>>>>                     configured to be picked up via a spooldir.
>>>>>>
>>>>>>                     I would prefer to get a direct S3 source, and
>>>>>>                     maybe we could
>>>>>>                     collaborate on it and open-source it. Let me know
>>>>>>                     if you prefer that
>>>>>>                     and we can work directly on it by creating a JIRA.
>>>>>>
>>>>>>                     Thanks,
>>>>>>                     Viral
>>>>>>
>>>>>>
>>>>>>
>>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>>>                     <hshreedharan@cloudera.com
>>>>>>                     <ma...@cloudera.com>
>>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>>
>>>>>>                     <ma...@cloudera.com>>> wrote:
>>>>>>
>>>>>>                         In both cases, Sharninder is right :)
>>>>>>
>>>>>>                         Sharninder wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                         As far as I know, there is no (open source)
>>>>>>                     implementation of an S3
>>>>>>                         source, so yes, you'll have to implement
>>>>>>                     your own. You'll have to
>>>>>>                         implement a Pollable source and the dev
>>>>>>                     documentation has an outline
>>>>>>                         that you can use. You can also look at the
>>>>>>                     existing Execsource and
>>>>>>                         work your way up.
>>>>>>
>>>>>>                         As far as I know, there is no way to
>>>>>>                     configure flume without
>>>>>>                         using the
>>>>>>                         configuration file.
>>>>>>
>>>>>>
>>>>>>
>>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>>>                     <mailto:prog88@gmail.com <mailto:prog88@gmail.com
>>>>>> >>
>>>>>>                     <mailto:prog88@gmail.com
>>>>>>                     <ma...@gmail.com>
>>>>>>                     <mailto:prog88@gmail.com
>>>>>>                     <ma...@gmail.com>>>> wrote:
>>>>>>
>>>>>>                             Hi,
>>>>>>                             I'm wondering if Flume is able to read
>>>>>>                     directly from S3.
>>>>>>
>>>>>>                             I'll describe my case. I have log files
>>>>>>                     stored in AWS S3. I have
>>>>>>                             to fetch periodically new S3 objects and
>>>>>>                     read log lines from it.
>>>>>>                             Than use log lines (events) are
>>>>>>                     processed in standard flume's way
>>>>>>                             (as with other sources).
>>>>>>
>>>>>>                             *1) Is there any way to fetch S3 objects
>>>>>>                     or I have to write
>>>>>>                         my own
>>>>>>                             Source?*
>>>>>>
>>>>>>
>>>>>>                             There is also second case. I want to
>>>>>>                     have flume configuration
>>>>>>                             dynamic. Flume sources can change in
>>>>>>                     time. New AWS key and S3
>>>>>>                             bucket can be added or deleted.
>>>>>>
>>>>>>                             *2) Is there any other way to configure
>>>>>>                     Flume than by static
>>>>>>                             configuration file?*
>>>>>>
>>>>>>                             --
>>>>>>                             Paweł Róg
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> thanks
>>> ashish
>>>
>>> Blog: http://www.ashishpaliwal.com/blog
>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>
>>
>>
>

Re: AWS S3 flume source

Posted by Paweł Róg <pr...@gmail.com>.

Hi,

I think that it is not possible to simply use SpoolDirectorySource. Maybe
it will be possible to use some elements of SpoolDirectory but without
touching it's code I think SpoolDirectory is not a good base. At the very
beginning SpoolDirectorySource does this:

File directory = new File(spoolDirectory);

ReliableSpoolingFileEventReader also instantiate File class.
There is also a question. How ReliableSpoolingFileEventReader stores
information about files that has been already processed in non-Deleting
mode? What happens after Flume restart?

I agree with Jonathan that S3 source should be able to store last processed
file eg. in Zookeeper.
Another thing Jonathan: I think you shouldn't care about multiple buckets
processed handled by a single S3Source. As you wrote multiple sources is
the solution here. I thought it was already discussed but maybe I'm wrong.


>> 2. Is it fair to assume that we're dealing with character files, rather
than binary objects?

In my opinion S3 source can by default read file as simple text file but
also take in configuration a parameter with class name of a "InputStream
processor". This processor will we able to eg. unzip, deserialize avro or
read JSON and convert it into log events. What do you think?

--
Paweł Róg

2014-08-06 5:12 GMT+02:00 Viral Bajaria <vi...@gmail.com>:

> Agree to the feedback provided by Ashish.
>
> I have started writing one which is similar to the ExecSource, but I like
> the idea of doing something where spooldir takes over most of the hard work
> of spitting out events to sinks. Let me think more on how to structure
> that.
>
> Quick thinking out loud, I could create a source which extends the
> spooldir and just spins off a thread to manage moving things from S3 to the
> spooldir via a temporary directory.
>
> Regarding maintaining metadata, there are 2 ways:
> 1) DB: I currently maintain it in a database because there are a lot of
> other tools build around it
> 2) File: Just keep the info in memory and in file to help from crash
> recovery and/or high memory usage.
>
> Thanks,
> Viral
>
>
>
>
> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <pa...@gmail.com> wrote:
>
>> Sharing some random thoughts
>>
>> 1. Download the file using S3 SDK and let the SpoolDirectory
>> implementation take care of rest. Like a Decorator in front of
>> SpoolDirectory
>>
>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
>> create events out of it.
>>
>> Would be great to reuse an existing implementation which is based on
>> InputStream and feed it with S3 object input stream, concern of metadata
>> storage still remains. Most often S3 objects are stored in compressed form,
>> so this source would need to take care of compression gz/avro/others.
>>
>> Best is to start with something that works and then start adding more
>> features to it.
>>
>>
>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I started trying to write some code on this, and realized there are a
>>> number of issues that need to be discussed in order to really design this
>>> feature effectively. The requirements that have been discussed thus far are:
>>>
>>> 1. Fetching data from S3 periodically
>>> 2. Fetching data from multiple S3 buckets -- This may be something that
>>> should be punted on until later. For a first implementation, this could be
>>> solved just by having multiple sources, each with a single S3 bucket
>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>>> clarify what you mean by this?*
>>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491,
>>> so I think this is out-of-scope for discussions at the moment
>>>
>>> Some questions I want to try to answer:
>>>
>>> 1. How do we identify and track objects that need to be processed versus
>>> objects that have been processed already?
>>> 1a. What about if we want to have multiple sources working against the
>>> same bucket to speed processing?
>>> 2. Is it fair to assume that we're dealing with character files, rather
>>> than binary objects?
>>>
>>>  For the first question, if we ignore the multiple source extension of
>>> the question, I think the simplest answer is to do something on the local
>>> filesystem, like have a tracking directory that contains a list of
>>> to-be-processed objects and a list of already-processed objects. However,
>>> if the source goes down, what should the restart semantics be? It seems
>>> that the ideal situation is to store this state in a system like ZooKeeper,
>>> which would ensure that a number of sources could operate off of the same
>>> bucket, but this probably requires FLUME-1491 first.
>>>
>>> For the second question, my feeling was just that we should work with
>>> similar assumptions to how the SpoolingDirectorySource works, where each
>>> line is a separate event. Does that seem reasonable?
>>>
>>> Thanks,
>>> Natty
>>>
>>>
>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Thanks for explanation Jonathan. I think I will also start working on
>>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>>> in JIRA. I'll do the same.
>>>> What do you think?
>>>>
>>>> --
>>>> Paweł Róg
>>>>
>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hs...@cloudera.com>
>>>> :
>>>>
>>>> +1 on an S3 Source. I would gladly review.
>>>>>
>>>>> Jonathan Natkins wrote:
>>>>>
>>>>>
>>>>> Hey Pawel,
>>>>>
>>>>> My intention is to start working on it, but I don't know exactly how
>>>>> long it will take, and I'm not a committer, so time estimates would
>>>>> have to be taken with a grain of salt regardless. If this is something
>>>>> that you need urgently, it may not be ideal to wait for me to start
>>>>> building something for yourself.
>>>>>
>>>>> That said, as mentioned in the other thread, dynamic configuration can
>>>>> be done by refreshing the configuration files across the set of Flume
>>>>> agents. It's certainly not as great as having a single place to change
>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>
>>>>> Thanks,
>>>>> Natty
>>>>>
>>>>>
>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>> <ma...@gmail.com>> wrote:
>>>>>
>>>>>     Hi,
>>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>>     JIRA issue and want to start implementing this and do you have any
>>>>>     estimate how long it will take?
>>>>>
>>>>>     I think the biggest challenge here is to have dynamic
>>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>>     issue. Am I right?
>>>>>
>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>     directories with the same source?
>>>>>
>>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>>     about this. It seems we have two totally separate things:
>>>>>     * build S3 source
>>>>>     * make flume configurable dynamically
>>>>>
>>>>>     --
>>>>>     Paweł
>>>>>
>>>>>
>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>>>>>
>>>>>
>>>>>         Hi,
>>>>>
>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>>>
>>>>>             Hey all,
>>>>>
>>>>>             I created a JIRA for this:
>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>
>>>>>
>>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>>         version?
>>>>>
>>>>>             I thought I'd start working on one myself, which can
>>>>>             hopefully be contributed back. I'm curious: do you have
>>>>>             particular requirements? Based on the emails in this
>>>>>             thread, it sounds like the original goal was to have
>>>>>             something that's like a SpoolDirectorySource that just
>>>>>             picks up new files from S3. Is that accurate?
>>>>>
>>>>>
>>>>>         Yes, I think so.  We need to be able to:
>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>>
>>>>>         every 1 min, every 5 min, etc.)
>>>>>         * fetch data from multiple S3 buckets
>>>>>         * associate an S3 bucket with a user/token/key
>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>         stored on disk) add new S3 buckets from which data should be
>>>>> fetch
>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>
>>>>>
>>>>>             Would you need to be able to pull files from multiple S3
>>>>>             directories with the same source?
>>>>>
>>>>>
>>>>>         I think the above addresses this question.
>>>>>
>>>>>             Thanks,
>>>>>             Natty
>>>>>
>>>>>
>>>>>         Thanks!
>>>>>
>>>>>         Otis
>>>>>         --
>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>
>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>             <otis.gospodnetic@gmail.com
>>>>>             <ma...@gmail.com>> wrote:
>>>>>
>>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>>
>>>>>                 But being able to dynamically add/remove S3 buckets
>>>>>                 from which to pull data seems important.
>>>>>
>>>>>                 Any suggestions for how to approach that?
>>>>>
>>>>>                 Otis
>>>>>                 --
>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>> Analytics
>>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>                 <hshreedharan@cloudera.com
>>>>>                 <ma...@cloudera.com>> wrote:
>>>>>
>>>>>                     Please go ahead and file a jira. If you are
>>>>>                     willing to submit a patch, you can post it on the
>>>>>                     jira.
>>>>>
>>>>>                     Viral Bajaria wrote:
>>>>>
>>>>>
>>>>>
>>>>>                     I have a similar use case that cropped up
>>>>>                     yesterday. I saw the archive
>>>>>                     and found that there was a recommendation to
>>>>>                     build it as Sharninder
>>>>>                     suggested.
>>>>>
>>>>>                     For now, I went down the route of writing a
>>>>>                     python script which
>>>>>                     downloads from S3 and puts the files in a
>>>>>                     directory which is
>>>>>                     configured to be picked up via a spooldir.
>>>>>
>>>>>                     I would prefer to get a direct S3 source, and
>>>>>                     maybe we could
>>>>>                     collaborate on it and open-source it. Let me know
>>>>>                     if you prefer that
>>>>>                     and we can work directly on it by creating a JIRA.
>>>>>
>>>>>                     Thanks,
>>>>>                     Viral
>>>>>
>>>>>
>>>>>
>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>>                     <hshreedharan@cloudera.com
>>>>>                     <ma...@cloudera.com>
>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>
>>>>>                     <ma...@cloudera.com>>> wrote:
>>>>>
>>>>>                         In both cases, Sharninder is right :)
>>>>>
>>>>>                         Sharninder wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                         As far as I know, there is no (open source)
>>>>>                     implementation of an S3
>>>>>                         source, so yes, you'll have to implement
>>>>>                     your own. You'll have to
>>>>>                         implement a Pollable source and the dev
>>>>>                     documentation has an outline
>>>>>                         that you can use. You can also look at the
>>>>>                     existing Execsource and
>>>>>                         work your way up.
>>>>>
>>>>>                         As far as I know, there is no way to
>>>>>                     configure flume without
>>>>>                         using the
>>>>>                         configuration file.
>>>>>
>>>>>
>>>>>
>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>>                     <mailto:prog88@gmail.com <mailto:prog88@gmail.com
>>>>> >>
>>>>>                     <mailto:prog88@gmail.com
>>>>>                     <ma...@gmail.com>
>>>>>                     <mailto:prog88@gmail.com
>>>>>                     <ma...@gmail.com>>>> wrote:
>>>>>
>>>>>                             Hi,
>>>>>                             I'm wondering if Flume is able to read
>>>>>                     directly from S3.
>>>>>
>>>>>                             I'll describe my case. I have log files
>>>>>                     stored in AWS S3. I have
>>>>>                             to fetch periodically new S3 objects and
>>>>>                     read log lines from it.
>>>>>                             Than use log lines (events) are
>>>>>                     processed in standard flume's way
>>>>>                             (as with other sources).
>>>>>
>>>>>                             *1) Is there any way to fetch S3 objects
>>>>>                     or I have to write
>>>>>                         my own
>>>>>                             Source?*
>>>>>
>>>>>
>>>>>                             There is also second case. I want to
>>>>>                     have flume configuration
>>>>>                             dynamic. Flume sources can change in
>>>>>                     time. New AWS key and S3
>>>>>                             bucket can be added or deleted.
>>>>>
>>>>>                             *2) Is there any other way to configure
>>>>>                     Flume than by static
>>>>>                             configuration file?*
>>>>>
>>>>>                             --
>>>>>                             Paweł Róg
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>
>

Re: AWS S3 flume source

Posted by Viral Bajaria <vi...@gmail.com>.

Agree to the feedback provided by Ashish.

I have started writing one which is similar to the ExecSource, but I like
the idea of doing something where spooldir takes over most of the hard work
of spitting out events to sinks. Let me think more on how to structure
that.

Quick thinking out loud, I could create a source which extends the spooldir
and just spins off a thread to manage moving things from S3 to the spooldir
via a temporary directory.

Regarding maintaining metadata, there are 2 ways:
1) DB: I currently maintain it in a database because there are a lot of
other tools build around it
2) File: Just keep the info in memory and in file to help from crash
recovery and/or high memory usage.

Thanks,
Viral




On Tue, Aug 5, 2014 at 8:04 PM, Ashish <pa...@gmail.com> wrote:

> Sharing some random thoughts
>
> 1. Download the file using S3 SDK and let the SpoolDirectory
> implementation take care of rest. Like a Decorator in front of
> SpoolDirectory
>
> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
> create events out of it.
>
> Would be great to reuse an existing implementation which is based on
> InputStream and feed it with S3 object input stream, concern of metadata
> storage still remains. Most often S3 objects are stored in compressed form,
> so this source would need to take care of compression gz/avro/others.
>
> Best is to start with something that works and then start adding more
> features to it.
>
>
> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
> wrote:
>
>> Hi all,
>>
>> I started trying to write some code on this, and realized there are a
>> number of issues that need to be discussed in order to really design this
>> feature effectively. The requirements that have been discussed thus far are:
>>
>> 1. Fetching data from S3 periodically
>> 2. Fetching data from multiple S3 buckets -- This may be something that
>> should be punted on until later. For a first implementation, this could be
>> solved just by having multiple sources, each with a single S3 bucket
>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>> clarify what you mean by this?*
>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491,
>> so I think this is out-of-scope for discussions at the moment
>>
>> Some questions I want to try to answer:
>>
>> 1. How do we identify and track objects that need to be processed versus
>> objects that have been processed already?
>> 1a. What about if we want to have multiple sources working against the
>> same bucket to speed processing?
>> 2. Is it fair to assume that we're dealing with character files, rather
>> than binary objects?
>>
>>  For the first question, if we ignore the multiple source extension of
>> the question, I think the simplest answer is to do something on the local
>> filesystem, like have a tracking directory that contains a list of
>> to-be-processed objects and a list of already-processed objects. However,
>> if the source goes down, what should the restart semantics be? It seems
>> that the ideal situation is to store this state in a system like ZooKeeper,
>> which would ensure that a number of sources could operate off of the same
>> bucket, but this probably requires FLUME-1491 first.
>>
>> For the second question, my feeling was just that we should work with
>> similar assumptions to how the SpoolingDirectorySource works, where each
>> line is a separate event. Does that seem reasonable?
>>
>> Thanks,
>> Natty
>>
>>
>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>
>>> Hi,
>>> Thanks for explanation Jonathan. I think I will also start working on
>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>> in JIRA. I'll do the same.
>>> What do you think?
>>>
>>> --
>>> Paweł Róg
>>>
>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hs...@cloudera.com>:
>>>
>>> +1 on an S3 Source. I would gladly review.
>>>>
>>>> Jonathan Natkins wrote:
>>>>
>>>>
>>>> Hey Pawel,
>>>>
>>>> My intention is to start working on it, but I don't know exactly how
>>>> long it will take, and I'm not a committer, so time estimates would
>>>> have to be taken with a grain of salt regardless. If this is something
>>>> that you need urgently, it may not be ideal to wait for me to start
>>>> building something for yourself.
>>>>
>>>> That said, as mentioned in the other thread, dynamic configuration can
>>>> be done by refreshing the configuration files across the set of Flume
>>>> agents. It's certainly not as great as having a single place to change
>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>
>>>> Thanks,
>>>> Natty
>>>>
>>>>
>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>
>>>>     Hi,
>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>     JIRA issue and want to start implementing this and do you have any
>>>>     estimate how long it will take?
>>>>
>>>>     I think the biggest challenge here is to have dynamic
>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>     issue. Am I right?
>>>>
>>>>     > Would you need to be able to pull files from multiple S3
>>>>     directories with the same source?
>>>>
>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>     source. I just imagine an approach where each S3 source can be
>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>     about this. It seems we have two totally separate things:
>>>>     * build S3 source
>>>>     * make flume configurable dynamically
>>>>
>>>>     --
>>>>     Paweł
>>>>
>>>>
>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>>>>
>>>>
>>>>         Hi,
>>>>
>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>>
>>>>             Hey all,
>>>>
>>>>             I created a JIRA for this:
>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>
>>>>
>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>         version?
>>>>
>>>>             I thought I'd start working on one myself, which can
>>>>             hopefully be contributed back. I'm curious: do you have
>>>>             particular requirements? Based on the emails in this
>>>>             thread, it sounds like the original goal was to have
>>>>             something that's like a SpoolDirectorySource that just
>>>>             picks up new files from S3. Is that accurate?
>>>>
>>>>
>>>>         Yes, I think so.  We need to be able to:
>>>>         * fetch data (logs for pulling them in Logsene
>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>
>>>>         every 1 min, every 5 min, etc.)
>>>>         * fetch data from multiple S3 buckets
>>>>         * associate an S3 bucket with a user/token/key
>>>>         * dynamically (i.e. without editing/writing config files
>>>>         stored on disk) add new S3 buckets from which data should be
>>>> fetch
>>>>         * dynamically (i.e. without editing/writing config files
>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>
>>>>
>>>>             Would you need to be able to pull files from multiple S3
>>>>             directories with the same source?
>>>>
>>>>
>>>>         I think the above addresses this question.
>>>>
>>>>             Thanks,
>>>>             Natty
>>>>
>>>>
>>>>         Thanks!
>>>>
>>>>         Otis
>>>>         --
>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>>
>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>             <otis.gospodnetic@gmail.com
>>>>             <ma...@gmail.com>> wrote:
>>>>
>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>
>>>>                 But being able to dynamically add/remove S3 buckets
>>>>                 from which to pull data seems important.
>>>>
>>>>                 Any suggestions for how to approach that?
>>>>
>>>>                 Otis
>>>>                 --
>>>>                 Performance Monitoring * Log Analytics * Search
>>>> Analytics
>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>                 <hshreedharan@cloudera.com
>>>>                 <ma...@cloudera.com>> wrote:
>>>>
>>>>                     Please go ahead and file a jira. If you are
>>>>                     willing to submit a patch, you can post it on the
>>>>                     jira.
>>>>
>>>>                     Viral Bajaria wrote:
>>>>
>>>>
>>>>
>>>>                     I have a similar use case that cropped up
>>>>                     yesterday. I saw the archive
>>>>                     and found that there was a recommendation to
>>>>                     build it as Sharninder
>>>>                     suggested.
>>>>
>>>>                     For now, I went down the route of writing a
>>>>                     python script which
>>>>                     downloads from S3 and puts the files in a
>>>>                     directory which is
>>>>                     configured to be picked up via a spooldir.
>>>>
>>>>                     I would prefer to get a direct S3 source, and
>>>>                     maybe we could
>>>>                     collaborate on it and open-source it. Let me know
>>>>                     if you prefer that
>>>>                     and we can work directly on it by creating a JIRA.
>>>>
>>>>                     Thanks,
>>>>                     Viral
>>>>
>>>>
>>>>
>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>                     <hshreedharan@cloudera.com
>>>>                     <ma...@cloudera.com>
>>>>                     <mailto:hshreedharan@cloudera.com
>>>>
>>>>                     <ma...@cloudera.com>>> wrote:
>>>>
>>>>                         In both cases, Sharninder is right :)
>>>>
>>>>                         Sharninder wrote:
>>>>
>>>>
>>>>
>>>>
>>>>                         As far as I know, there is no (open source)
>>>>                     implementation of an S3
>>>>                         source, so yes, you'll have to implement
>>>>                     your own. You'll have to
>>>>                         implement a Pollable source and the dev
>>>>                     documentation has an outline
>>>>                         that you can use. You can also look at the
>>>>                     existing Execsource and
>>>>                         work your way up.
>>>>
>>>>                         As far as I know, there is no way to
>>>>                     configure flume without
>>>>                         using the
>>>>                         configuration file.
>>>>
>>>>
>>>>
>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>                     <mailto:prog88@gmail.com <ma...@gmail.com>>
>>>>                     <mailto:prog88@gmail.com
>>>>                     <ma...@gmail.com>
>>>>                     <mailto:prog88@gmail.com
>>>>                     <ma...@gmail.com>>>> wrote:
>>>>
>>>>                             Hi,
>>>>                             I'm wondering if Flume is able to read
>>>>                     directly from S3.
>>>>
>>>>                             I'll describe my case. I have log files
>>>>                     stored in AWS S3. I have
>>>>                             to fetch periodically new S3 objects and
>>>>                     read log lines from it.
>>>>                             Than use log lines (events) are
>>>>                     processed in standard flume's way
>>>>                             (as with other sources).
>>>>
>>>>                             *1) Is there any way to fetch S3 objects
>>>>                     or I have to write
>>>>                         my own
>>>>                             Source?*
>>>>
>>>>
>>>>                             There is also second case. I want to
>>>>                     have flume configuration
>>>>                             dynamic. Flume sources can change in
>>>>                     time. New AWS key and S3
>>>>                             bucket can be added or deleted.
>>>>
>>>>                             *2) Is there any other way to configure
>>>>                     Flume than by static
>>>>                             configuration file?*
>>>>
>>>>                             --
>>>>                             Paweł Róg
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: AWS S3 flume source

Posted by Ashish <pa...@gmail.com>.

On Mon, Aug 11, 2014 at 4:04 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi,
>
> On Wed, Aug 6, 2014 at 5:04 AM, Ashish <pa...@gmail.com> wrote:
>
>> Sharing some random thoughts
>>
>> 1. Download the file using S3 SDK and let the SpoolDirectory
>> implementation take care of rest. Like a Decorator in front of
>> SpoolDirectory
>>
>
> My worry is that using SpoolDirectory requires temporary writes to the FS
> and if you are using Flume to process a lot of data, then any large amounts
> of data to disk will slow things down quite a bit.
>

True, it does adds its own echo-system of troubles like moving files
around. It's a quick solution that can work, not necessarily the best one :)


>
> But maybe there is no way of avoiding disk anyway because of Flume's
> checkpointing and other parts that write to disk already?
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
>> create events out of it.
>>
>> Would be great to reuse an existing implementation which is based on
>> InputStream and feed it with S3 object input stream, concern of metadata
>> storage still remains. Most often S3 objects are stored in compressed form,
>> so this source would need to take care of compression gz/avro/others.
>>
>> Best is to start with something that works and then start adding more
>> features to it.
>>
>>
>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I started trying to write some code on this, and realized there are a
>>> number of issues that need to be discussed in order to really design this
>>> feature effectively. The requirements that have been discussed thus far are:
>>>
>>> 1. Fetching data from S3 periodically
>>> 2. Fetching data from multiple S3 buckets -- This may be something that
>>> should be punted on until later. For a first implementation, this could be
>>> solved just by having multiple sources, each with a single S3 bucket
>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>>> clarify what you mean by this?*
>>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491,
>>> so I think this is out-of-scope for discussions at the moment
>>>
>>> Some questions I want to try to answer:
>>>
>>> 1. How do we identify and track objects that need to be processed versus
>>> objects that have been processed already?
>>> 1a. What about if we want to have multiple sources working against the
>>> same bucket to speed processing?
>>> 2. Is it fair to assume that we're dealing with character files, rather
>>> than binary objects?
>>>
>>>  For the first question, if we ignore the multiple source extension of
>>> the question, I think the simplest answer is to do something on the local
>>> filesystem, like have a tracking directory that contains a list of
>>> to-be-processed objects and a list of already-processed objects. However,
>>> if the source goes down, what should the restart semantics be? It seems
>>> that the ideal situation is to store this state in a system like ZooKeeper,
>>> which would ensure that a number of sources could operate off of the same
>>> bucket, but this probably requires FLUME-1491 first.
>>>
>>> For the second question, my feeling was just that we should work with
>>> similar assumptions to how the SpoolingDirectorySource works, where each
>>> line is a separate event. Does that seem reasonable?
>>>
>>> Thanks,
>>> Natty
>>>
>>>
>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Thanks for explanation Jonathan. I think I will also start working on
>>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>>> in JIRA. I'll do the same.
>>>> What do you think?
>>>>
>>>> --
>>>> Paweł Róg
>>>>
>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hs...@cloudera.com>
>>>> :
>>>>
>>>> +1 on an S3 Source. I would gladly review.
>>>>>
>>>>> Jonathan Natkins wrote:
>>>>>
>>>>>
>>>>> Hey Pawel,
>>>>>
>>>>> My intention is to start working on it, but I don't know exactly how
>>>>> long it will take, and I'm not a committer, so time estimates would
>>>>> have to be taken with a grain of salt regardless. If this is something
>>>>> that you need urgently, it may not be ideal to wait for me to start
>>>>> building something for yourself.
>>>>>
>>>>> That said, as mentioned in the other thread, dynamic configuration can
>>>>> be done by refreshing the configuration files across the set of Flume
>>>>> agents. It's certainly not as great as having a single place to change
>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>
>>>>> Thanks,
>>>>> Natty
>>>>>
>>>>>
>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>> <ma...@gmail.com>> wrote:
>>>>>
>>>>>     Hi,
>>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>>     JIRA issue and want to start implementing this and do you have any
>>>>>     estimate how long it will take?
>>>>>
>>>>>     I think the biggest challenge here is to have dynamic
>>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>>     issue. Am I right?
>>>>>
>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>     directories with the same source?
>>>>>
>>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>>     about this. It seems we have two totally separate things:
>>>>>     * build S3 source
>>>>>     * make flume configurable dynamically
>>>>>
>>>>>     --
>>>>>     Paweł
>>>>>
>>>>>
>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>>>>>
>>>>>
>>>>>         Hi,
>>>>>
>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>>>
>>>>>             Hey all,
>>>>>
>>>>>             I created a JIRA for this:
>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>
>>>>>
>>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>>         version?
>>>>>
>>>>>             I thought I'd start working on one myself, which can
>>>>>             hopefully be contributed back. I'm curious: do you have
>>>>>             particular requirements? Based on the emails in this
>>>>>             thread, it sounds like the original goal was to have
>>>>>             something that's like a SpoolDirectorySource that just
>>>>>             picks up new files from S3. Is that accurate?
>>>>>
>>>>>
>>>>>         Yes, I think so.  We need to be able to:
>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>>
>>>>>         every 1 min, every 5 min, etc.)
>>>>>         * fetch data from multiple S3 buckets
>>>>>         * associate an S3 bucket with a user/token/key
>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>         stored on disk) add new S3 buckets from which data should be
>>>>> fetch
>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>
>>>>>
>>>>>             Would you need to be able to pull files from multiple S3
>>>>>             directories with the same source?
>>>>>
>>>>>
>>>>>         I think the above addresses this question.
>>>>>
>>>>>             Thanks,
>>>>>             Natty
>>>>>
>>>>>
>>>>>         Thanks!
>>>>>
>>>>>         Otis
>>>>>         --
>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>
>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>             <otis.gospodnetic@gmail.com
>>>>>             <ma...@gmail.com>> wrote:
>>>>>
>>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>>
>>>>>                 But being able to dynamically add/remove S3 buckets
>>>>>                 from which to pull data seems important.
>>>>>
>>>>>                 Any suggestions for how to approach that?
>>>>>
>>>>>                 Otis
>>>>>                 --
>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>> Analytics
>>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>                 <hshreedharan@cloudera.com
>>>>>                 <ma...@cloudera.com>> wrote:
>>>>>
>>>>>                     Please go ahead and file a jira. If you are
>>>>>                     willing to submit a patch, you can post it on the
>>>>>                     jira.
>>>>>
>>>>>                     Viral Bajaria wrote:
>>>>>
>>>>>
>>>>>
>>>>>                     I have a similar use case that cropped up
>>>>>                     yesterday. I saw the archive
>>>>>                     and found that there was a recommendation to
>>>>>                     build it as Sharninder
>>>>>                     suggested.
>>>>>
>>>>>                     For now, I went down the route of writing a
>>>>>                     python script which
>>>>>                     downloads from S3 and puts the files in a
>>>>>                     directory which is
>>>>>                     configured to be picked up via a spooldir.
>>>>>
>>>>>                     I would prefer to get a direct S3 source, and
>>>>>                     maybe we could
>>>>>                     collaborate on it and open-source it. Let me know
>>>>>                     if you prefer that
>>>>>                     and we can work directly on it by creating a JIRA.
>>>>>
>>>>>                     Thanks,
>>>>>                     Viral
>>>>>
>>>>>
>>>>>
>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>>                     <hshreedharan@cloudera.com
>>>>>                     <ma...@cloudera.com>
>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>
>>>>>                     <ma...@cloudera.com>>> wrote:
>>>>>
>>>>>                         In both cases, Sharninder is right :)
>>>>>
>>>>>                         Sharninder wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                         As far as I know, there is no (open source)
>>>>>                     implementation of an S3
>>>>>                         source, so yes, you'll have to implement
>>>>>                     your own. You'll have to
>>>>>                         implement a Pollable source and the dev
>>>>>                     documentation has an outline
>>>>>                         that you can use. You can also look at the
>>>>>                     existing Execsource and
>>>>>                         work your way up.
>>>>>
>>>>>                         As far as I know, there is no way to
>>>>>                     configure flume without
>>>>>                         using the
>>>>>                         configuration file.
>>>>>
>>>>>
>>>>>
>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>>                     <mailto:prog88@gmail.com <mailto:prog88@gmail.com
>>>>> >>
>>>>>                     <mailto:prog88@gmail.com
>>>>>                     <ma...@gmail.com>
>>>>>                     <mailto:prog88@gmail.com
>>>>>                     <ma...@gmail.com>>>> wrote:
>>>>>
>>>>>                             Hi,
>>>>>                             I'm wondering if Flume is able to read
>>>>>                     directly from S3.
>>>>>
>>>>>                             I'll describe my case. I have log files
>>>>>                     stored in AWS S3. I have
>>>>>                             to fetch periodically new S3 objects and
>>>>>                     read log lines from it.
>>>>>                             Than use log lines (events) are
>>>>>                     processed in standard flume's way
>>>>>                             (as with other sources).
>>>>>
>>>>>                             *1) Is there any way to fetch S3 objects
>>>>>                     or I have to write
>>>>>                         my own
>>>>>                             Source?*
>>>>>
>>>>>
>>>>>                             There is also second case. I want to
>>>>>                     have flume configuration
>>>>>                             dynamic. Flume sources can change in
>>>>>                     time. New AWS key and S3
>>>>>                             bucket can be added or deleted.
>>>>>
>>>>>                             *2) Is there any other way to configure
>>>>>                     Flume than by static
>>>>>                             configuration file?*
>>>>>
>>>>>                             --
>>>>>                             Paweł Róg
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: AWS S3 flume source

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

On Wed, Aug 6, 2014 at 5:04 AM, Ashish <pa...@gmail.com> wrote:

> Sharing some random thoughts
>
> 1. Download the file using S3 SDK and let the SpoolDirectory
> implementation take care of rest. Like a Decorator in front of
> SpoolDirectory
>

My worry is that using SpoolDirectory requires temporary writes to the FS
and if you are using Flume to process a lot of data, then any large amounts
of data to disk will slow things down quite a bit.

But maybe there is no way of avoiding disk anyway because of Flume's
checkpointing and other parts that write to disk already?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


2. Use S3 SDK to create InputStream of S3 objects directly in code and
> create events out of it.
>
> Would be great to reuse an existing implementation which is based on
> InputStream and feed it with S3 object input stream, concern of metadata
> storage still remains. Most often S3 objects are stored in compressed form,
> so this source would need to take care of compression gz/avro/others.
>
> Best is to start with something that works and then start adding more
> features to it.
>
>
> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
> wrote:
>
>> Hi all,
>>
>> I started trying to write some code on this, and realized there are a
>> number of issues that need to be discussed in order to really design this
>> feature effectively. The requirements that have been discussed thus far are:
>>
>> 1. Fetching data from S3 periodically
>> 2. Fetching data from multiple S3 buckets -- This may be something that
>> should be punted on until later. For a first implementation, this could be
>> solved just by having multiple sources, each with a single S3 bucket
>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>> clarify what you mean by this?*
>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491,
>> so I think this is out-of-scope for discussions at the moment
>>
>> Some questions I want to try to answer:
>>
>> 1. How do we identify and track objects that need to be processed versus
>> objects that have been processed already?
>> 1a. What about if we want to have multiple sources working against the
>> same bucket to speed processing?
>> 2. Is it fair to assume that we're dealing with character files, rather
>> than binary objects?
>>
>>  For the first question, if we ignore the multiple source extension of
>> the question, I think the simplest answer is to do something on the local
>> filesystem, like have a tracking directory that contains a list of
>> to-be-processed objects and a list of already-processed objects. However,
>> if the source goes down, what should the restart semantics be? It seems
>> that the ideal situation is to store this state in a system like ZooKeeper,
>> which would ensure that a number of sources could operate off of the same
>> bucket, but this probably requires FLUME-1491 first.
>>
>> For the second question, my feeling was just that we should work with
>> similar assumptions to how the SpoolingDirectorySource works, where each
>> line is a separate event. Does that seem reasonable?
>>
>> Thanks,
>> Natty
>>
>>
>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>
>>> Hi,
>>> Thanks for explanation Jonathan. I think I will also start working on
>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>> in JIRA. I'll do the same.
>>> What do you think?
>>>
>>> --
>>> Paweł Róg
>>>
>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hs...@cloudera.com>:
>>>
>>> +1 on an S3 Source. I would gladly review.
>>>>
>>>> Jonathan Natkins wrote:
>>>>
>>>>
>>>> Hey Pawel,
>>>>
>>>> My intention is to start working on it, but I don't know exactly how
>>>> long it will take, and I'm not a committer, so time estimates would
>>>> have to be taken with a grain of salt regardless. If this is something
>>>> that you need urgently, it may not be ideal to wait for me to start
>>>> building something for yourself.
>>>>
>>>> That said, as mentioned in the other thread, dynamic configuration can
>>>> be done by refreshing the configuration files across the set of Flume
>>>> agents. It's certainly not as great as having a single place to change
>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>
>>>> Thanks,
>>>> Natty
>>>>
>>>>
>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>
>>>>     Hi,
>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>     JIRA issue and want to start implementing this and do you have any
>>>>     estimate how long it will take?
>>>>
>>>>     I think the biggest challenge here is to have dynamic
>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>     issue. Am I right?
>>>>
>>>>     > Would you need to be able to pull files from multiple S3
>>>>     directories with the same source?
>>>>
>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>     source. I just imagine an approach where each S3 source can be
>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>     about this. It seems we have two totally separate things:
>>>>     * build S3 source
>>>>     * make flume configurable dynamically
>>>>
>>>>     --
>>>>     Paweł
>>>>
>>>>
>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>>>>
>>>>
>>>>         Hi,
>>>>
>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>>
>>>>             Hey all,
>>>>
>>>>             I created a JIRA for this:
>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>
>>>>
>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>         version?
>>>>
>>>>             I thought I'd start working on one myself, which can
>>>>             hopefully be contributed back. I'm curious: do you have
>>>>             particular requirements? Based on the emails in this
>>>>             thread, it sounds like the original goal was to have
>>>>             something that's like a SpoolDirectorySource that just
>>>>             picks up new files from S3. Is that accurate?
>>>>
>>>>
>>>>         Yes, I think so.  We need to be able to:
>>>>         * fetch data (logs for pulling them in Logsene
>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>
>>>>         every 1 min, every 5 min, etc.)
>>>>         * fetch data from multiple S3 buckets
>>>>         * associate an S3 bucket with a user/token/key
>>>>         * dynamically (i.e. without editing/writing config files
>>>>         stored on disk) add new S3 buckets from which data should be
>>>> fetch
>>>>         * dynamically (i.e. without editing/writing config files
>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>
>>>>
>>>>             Would you need to be able to pull files from multiple S3
>>>>             directories with the same source?
>>>>
>>>>
>>>>         I think the above addresses this question.
>>>>
>>>>             Thanks,
>>>>             Natty
>>>>
>>>>
>>>>         Thanks!
>>>>
>>>>         Otis
>>>>         --
>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>>
>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>             <otis.gospodnetic@gmail.com
>>>>             <ma...@gmail.com>> wrote:
>>>>
>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>
>>>>                 But being able to dynamically add/remove S3 buckets
>>>>                 from which to pull data seems important.
>>>>
>>>>                 Any suggestions for how to approach that?
>>>>
>>>>                 Otis
>>>>                 --
>>>>                 Performance Monitoring * Log Analytics * Search
>>>> Analytics
>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>                 <hshreedharan@cloudera.com
>>>>                 <ma...@cloudera.com>> wrote:
>>>>
>>>>                     Please go ahead and file a jira. If you are
>>>>                     willing to submit a patch, you can post it on the
>>>>                     jira.
>>>>
>>>>                     Viral Bajaria wrote:
>>>>
>>>>
>>>>
>>>>                     I have a similar use case that cropped up
>>>>                     yesterday. I saw the archive
>>>>                     and found that there was a recommendation to
>>>>                     build it as Sharninder
>>>>                     suggested.
>>>>
>>>>                     For now, I went down the route of writing a
>>>>                     python script which
>>>>                     downloads from S3 and puts the files in a
>>>>                     directory which is
>>>>                     configured to be picked up via a spooldir.
>>>>
>>>>                     I would prefer to get a direct S3 source, and
>>>>                     maybe we could
>>>>                     collaborate on it and open-source it. Let me know
>>>>                     if you prefer that
>>>>                     and we can work directly on it by creating a JIRA.
>>>>
>>>>                     Thanks,
>>>>                     Viral
>>>>
>>>>
>>>>
>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>                     <hshreedharan@cloudera.com
>>>>                     <ma...@cloudera.com>
>>>>                     <mailto:hshreedharan@cloudera.com
>>>>
>>>>                     <ma...@cloudera.com>>> wrote:
>>>>
>>>>                         In both cases, Sharninder is right :)
>>>>
>>>>                         Sharninder wrote:
>>>>
>>>>
>>>>
>>>>
>>>>                         As far as I know, there is no (open source)
>>>>                     implementation of an S3
>>>>                         source, so yes, you'll have to implement
>>>>                     your own. You'll have to
>>>>                         implement a Pollable source and the dev
>>>>                     documentation has an outline
>>>>                         that you can use. You can also look at the
>>>>                     existing Execsource and
>>>>                         work your way up.
>>>>
>>>>                         As far as I know, there is no way to
>>>>                     configure flume without
>>>>                         using the
>>>>                         configuration file.
>>>>
>>>>
>>>>
>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>                     <mailto:prog88@gmail.com <ma...@gmail.com>>
>>>>                     <mailto:prog88@gmail.com
>>>>                     <ma...@gmail.com>
>>>>                     <mailto:prog88@gmail.com
>>>>                     <ma...@gmail.com>>>> wrote:
>>>>
>>>>                             Hi,
>>>>                             I'm wondering if Flume is able to read
>>>>                     directly from S3.
>>>>
>>>>                             I'll describe my case. I have log files
>>>>                     stored in AWS S3. I have
>>>>                             to fetch periodically new S3 objects and
>>>>                     read log lines from it.
>>>>                             Than use log lines (events) are
>>>>                     processed in standard flume's way
>>>>                             (as with other sources).
>>>>
>>>>                             *1) Is there any way to fetch S3 objects
>>>>                     or I have to write
>>>>                         my own
>>>>                             Source?*
>>>>
>>>>
>>>>                             There is also second case. I want to
>>>>                     have flume configuration
>>>>                             dynamic. Flume sources can change in
>>>>                     time. New AWS key and S3
>>>>                             bucket can be added or deleted.
>>>>
>>>>                             *2) Is there any other way to configure
>>>>                     Flume than by static
>>>>                             configuration file?*
>>>>
>>>>                             --
>>>>                             Paweł Róg
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: AWS S3 flume source

Posted by Jonathan Natkins <na...@streamsets.com>.

Adding the dev list to the discussion


On Wed, Aug 6, 2014 at 9:37 AM, Jonathan Natkins <na...@streamsets.com>
wrote:

> Ashish, I've put some comments inline.
>
>
> On Tuesday, August 5, 2014, Ashish <pa...@gmail.com> wrote:
>
>> Sharing some random thoughts
>>
>> 1. Download the file using S3 SDK and let the SpoolDirectory
>> implementation take care of rest. Like a Decorator in front of
>> SpoolDirectory
>>
>> This works for the simple case, but I don't think this is an ideal
> solution. My primary concern is that S3's max file size is 5TB, so
> downloading the object to local disk may not be possible.
>
>
>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
>> create events out of it.
>>
>> Would be great to reuse an existing implementation which is based on
>> InputStream and feed it with S3 object input stream, concern of metadata
>> storage still remains. Most often S3 objects are stored in compressed form,
>> so this source would need to take care of compression gz/avro/others.
>>
>> Best is to start with something that works and then start adding more
>> features to it.
>>
>>
>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I started trying to write some code on this, and realized there are a
>>> number of issues that need to be discussed in order to really design this
>>> feature effectively. The requirements that have been discussed thus far are:
>>>
>>> 1. Fetching data from S3 periodically
>>> 2. Fetching data from multiple S3 buckets -- This may be something that
>>> should be punted on until later. For a first implementation, this could be
>>> solved just by having multiple sources, each with a single S3 bucket
>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>>> clarify what you mean by this?*
>>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491,
>>> so I think this is out-of-scope for discussions at the moment
>>>
>>> Some questions I want to try to answer:
>>>
>>> 1. How do we identify and track objects that need to be processed versus
>>> objects that have been processed already?
>>> 1a. What about if we want to have multiple sources working against the
>>> same bucket to speed processing?
>>> 2. Is it fair to assume that we're dealing with character files, rather
>>> than binary objects?
>>>
>>>  For the first question, if we ignore the multiple source extension of
>>> the question, I think the simplest answer is to do something on the local
>>> filesystem, like have a tracking directory that contains a list of
>>> to-be-processed objects and a list of already-processed objects. However,
>>> if the source goes down, what should the restart semantics be? It seems
>>> that the ideal situation is to store this state in a system like ZooKeeper,
>>> which would ensure that a number of sources could operate off of the same
>>> bucket, but this probably requires FLUME-1491 first.
>>>
>>> For the second question, my feeling was just that we should work with
>>> similar assumptions to how the SpoolingDirectorySource works, where each
>>> line is a separate event. Does that seem reasonable?
>>>
>>> Thanks,
>>> Natty
>>>
>>>
>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Thanks for explanation Jonathan. I think I will also start working on
>>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>>> in JIRA. I'll do the same.
>>>> What do you think?
>>>>
>>>> --
>>>> Paweł Róg
>>>>
>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hs...@cloudera.com>
>>>> :
>>>>
>>>> +1 on an S3 Source. I would gladly review.
>>>>>
>>>>> Jonathan Natkins wrote:
>>>>>
>>>>>
>>>>> Hey Pawel,
>>>>>
>>>>> My intention is to start working on it, but I don't know exactly how
>>>>> long it will take, and I'm not a committer, so time estimates would
>>>>> have to be taken with a grain of salt regardless. If this is something
>>>>> that you need urgently, it may not be ideal to wait for me to start
>>>>> building something for yourself.
>>>>>
>>>>> That said, as mentioned in the other thread, dynamic configuration can
>>>>> be done by refreshing the configuration files across the set of Flume
>>>>> agents. It's certainly not as great as having a single place to change
>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>
>>>>> Thanks,
>>>>> Natty
>>>>>
>>>>>
>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>> <ma...@gmail.com>> wrote:
>>>>>
>>>>>     Hi,
>>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>>     JIRA issue and want to start implementing this and do you have any
>>>>>     estimate how long it will take?
>>>>>
>>>>>     I think the biggest challenge here is to have dynamic
>>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>>     issue. Am I right?
>>>>>
>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>     directories with the same source?
>>>>>
>>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>>     about this. It seems we have two totally separate things:
>>>>>     * build S3 source
>>>>>     * make flume configurable dynamically
>>>>>
>>>>>     --
>>>>>     Paweł
>>>>>
>>>>>
>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>>>>>
>>>>>
>>>>>         Hi,
>>>>>
>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>>>
>>>>>             Hey all,
>>>>>
>>>>>             I created a JIRA for this:
>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>
>>>>>
>>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>>         version?
>>>>>
>>>>>             I thought I'd start working on one myself, which can
>>>>>             hopefully be contributed back. I'm curious: do you have
>>>>>             particular requirements? Based on the emails in this
>>>>>             thread, it sounds like the original goal was to have
>>>>>             something that's like a SpoolDirectorySource that just
>>>>>             picks up new files from S3. Is that accurate?
>>>>>
>>>>>
>>>>>         Yes, I think so.  We need to be able to:
>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>>
>>>>>         every 1 min, every 5 min, etc.)
>>>>>         * fetch data from multiple S3 buckets
>>>>>         * associate an S3 bucket with a user/token/key
>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>         stored on disk) add new S3 buckets from which data should be
>>>>> fetch
>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>
>>>>>
>>>>>             Would you need to be able to pull files from multiple S3
>>>>>             directories with the same source?
>>>>>
>>>>>
>>>>>         I think the above addresses this question.
>>>>>
>>>>>             Thanks,
>>>>>             Natty
>>>>>
>>>>>
>>>>>         Thanks!
>>>>>
>>>>>         Otis
>>>>>         --
>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>
>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>             <otis.gospodnetic@gmail.com
>>>>>             <ma...@gmail.com>> wrote:
>>>>>
>>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>>
>>>>>                 But being able to dynamically add/remove S3 buckets
>>>>>                 from which to pull data seems important.
>>>>>
>>>>>                 Any suggestions for how to approach that?
>>>>>
>>>>>                 Otis
>>>>>                 --
>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>> Analytics
>>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>                 <hshreedharan@cloudera.com
>>>>>                 <ma...@cloudera.com>> wrote:
>>>>>
>>>>>                     Please go ahead and file a jira. If you are
>>>>>                     willing to submit a patch, you can post it on the
>>>>>                     jira.
>>>>>
>>>>>                     Viral Bajaria wrote:
>>>>>
>>>>>
>>>>>
>>>>>                     I have a similar use case that cropped up
>>>>>                     yesterday. I saw the archive
>>>>>                     and found that there was a recommendation to
>>>>>                     build it as Sharninder
>>>>>                     suggested.
>>>>>
>>>>>                     For now, I went down the route of writing a
>>>>>                     python script which
>>>>>                     downloads from S3 and puts the files in a
>>>>>                     directory which is
>>>>>                     configured to be picked up via a spooldir.
>>>>>
>>>>>                     I would prefer to get a direct S3 source, and
>>>>>                     maybe we could
>>>>>                     collaborate on it and open-source it. Let me know
>>>>>                     if you prefer that
>>>>>                     and we can work directly on it by creating a JIRA.
>>>>>
>>>>>                     Thanks,
>>>>>                     Viral
>>>>>
>>>>>
>>>>>
>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>>                     <hshreedharan@cloudera.com
>>>>>                     <ma...@cloudera.com>
>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>
>>>>>                     <ma...@cloudera.com>>> wrote:
>>>>>
>>>>>                         In both cases, Sharninder is right :)
>>>>>
>>>>>                         Sharninder wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                         As far as I know, there is no (open source)
>>>>>                     implementation of an S3
>>>>>                         source, so yes, you'll have to implement
>>>>>                     your own. You'll have to
>>>>>                         implement a Pollable source and the dev
>>>>>                     documentation has an outline
>>>>>                         that you can use. You can also look at the
>>>>>                     existing Execsource and
>>>>>                         work your way up.
>>>>>
>>>>>                         As far as I know, there is no way to
>>>>>                     configure flume without
>>>>>                         using the
>>>>>                         configuration file.
>>>>>
>>>>>
>>>>>
>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>>                     <mailto:prog88@gmail.com <mailto:prog88@gmail.com
>>>>> >>
>>>>>                     <mailto:prog88@gmail.com
>>>>>                     <ma...@gmail.com>
>>>>>                     <mailto:prog88@gmail.com
>>>>>                     <ma...@gmail.com>>>> wrote:
>>>>>
>>>>>                             Hi,
>>>>>                             I'm wondering if Flume is able to read
>>>>>                     directly from S3.
>>>>>
>>>>>                             I'll describe my case. I have log files
>>>>>                     stored in AWS S3. I have
>>>>>                             to fetch periodically new S3 objects and
>>>>>                     read log lines from it.
>>>>>                             Than use log lines (events) are
>>>>>                     processed in standard flume's way
>>>>>                             (as with other sources).
>>>>>
>>>>>                             *1) Is there any way to fetch S3 objects
>>>>>                     or I have to write
>>>>>                         my own
>>>>>                             Source?*
>>>>>
>>>>>
>>>>>                             There is also second case. I want to
>>>>>                     have flume configuration
>>>>>                             dynamic. Flume sources can change in
>>>>>                     time. New AWS key and S3
>>>>>                             bucket can be added or deleted.
>>>>>
>>>>>                             *2) Is there any other way to configure
>>>>>                     Flume than by static
>>>>>                             configuration file?*
>>>>>
>>>>>                             --
>>>>>                             Paweł Róg
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>

Re: AWS S3 flume source

Posted by Jonathan Natkins <na...@streamsets.com>.

Adding the dev list to the discussion


On Wed, Aug 6, 2014 at 9:37 AM, Jonathan Natkins <na...@streamsets.com>
wrote:

> Ashish, I've put some comments inline.
>
>
> On Tuesday, August 5, 2014, Ashish <pa...@gmail.com> wrote:
>
>> Sharing some random thoughts
>>
>> 1. Download the file using S3 SDK and let the SpoolDirectory
>> implementation take care of rest. Like a Decorator in front of
>> SpoolDirectory
>>
>> This works for the simple case, but I don't think this is an ideal
> solution. My primary concern is that S3's max file size is 5TB, so
> downloading the object to local disk may not be possible.
>
>
>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
>> create events out of it.
>>
>> Would be great to reuse an existing implementation which is based on
>> InputStream and feed it with S3 object input stream, concern of metadata
>> storage still remains. Most often S3 objects are stored in compressed form,
>> so this source would need to take care of compression gz/avro/others.
>>
>> Best is to start with something that works and then start adding more
>> features to it.
>>
>>
>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I started trying to write some code on this, and realized there are a
>>> number of issues that need to be discussed in order to really design this
>>> feature effectively. The requirements that have been discussed thus far are:
>>>
>>> 1. Fetching data from S3 periodically
>>> 2. Fetching data from multiple S3 buckets -- This may be something that
>>> should be punted on until later. For a first implementation, this could be
>>> solved just by having multiple sources, each with a single S3 bucket
>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>>> clarify what you mean by this?*
>>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491,
>>> so I think this is out-of-scope for discussions at the moment
>>>
>>> Some questions I want to try to answer:
>>>
>>> 1. How do we identify and track objects that need to be processed versus
>>> objects that have been processed already?
>>> 1a. What about if we want to have multiple sources working against the
>>> same bucket to speed processing?
>>> 2. Is it fair to assume that we're dealing with character files, rather
>>> than binary objects?
>>>
>>>  For the first question, if we ignore the multiple source extension of
>>> the question, I think the simplest answer is to do something on the local
>>> filesystem, like have a tracking directory that contains a list of
>>> to-be-processed objects and a list of already-processed objects. However,
>>> if the source goes down, what should the restart semantics be? It seems
>>> that the ideal situation is to store this state in a system like ZooKeeper,
>>> which would ensure that a number of sources could operate off of the same
>>> bucket, but this probably requires FLUME-1491 first.
>>>
>>> For the second question, my feeling was just that we should work with
>>> similar assumptions to how the SpoolingDirectorySource works, where each
>>> line is a separate event. Does that seem reasonable?
>>>
>>> Thanks,
>>> Natty
>>>
>>>
>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Thanks for explanation Jonathan. I think I will also start working on
>>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>>> in JIRA. I'll do the same.
>>>> What do you think?
>>>>
>>>> --
>>>> Paweł Róg
>>>>
>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hs...@cloudera.com>
>>>> :
>>>>
>>>> +1 on an S3 Source. I would gladly review.
>>>>>
>>>>> Jonathan Natkins wrote:
>>>>>
>>>>>
>>>>> Hey Pawel,
>>>>>
>>>>> My intention is to start working on it, but I don't know exactly how
>>>>> long it will take, and I'm not a committer, so time estimates would
>>>>> have to be taken with a grain of salt regardless. If this is something
>>>>> that you need urgently, it may not be ideal to wait for me to start
>>>>> building something for yourself.
>>>>>
>>>>> That said, as mentioned in the other thread, dynamic configuration can
>>>>> be done by refreshing the configuration files across the set of Flume
>>>>> agents. It's certainly not as great as having a single place to change
>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>
>>>>> Thanks,
>>>>> Natty
>>>>>
>>>>>
>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>> <ma...@gmail.com>> wrote:
>>>>>
>>>>>     Hi,
>>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>>     JIRA issue and want to start implementing this and do you have any
>>>>>     estimate how long it will take?
>>>>>
>>>>>     I think the biggest challenge here is to have dynamic
>>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>>     issue. Am I right?
>>>>>
>>>>>     > Would you need to be able to pull files from multiple S3
>>>>>     directories with the same source?
>>>>>
>>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>>     source. I just imagine an approach where each S3 source can be
>>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>>     about this. It seems we have two totally separate things:
>>>>>     * build S3 source
>>>>>     * make flume configurable dynamically
>>>>>
>>>>>     --
>>>>>     Paweł
>>>>>
>>>>>
>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>>>>>
>>>>>
>>>>>         Hi,
>>>>>
>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>>>
>>>>>             Hey all,
>>>>>
>>>>>             I created a JIRA for this:
>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>
>>>>>
>>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>>         version?
>>>>>
>>>>>             I thought I'd start working on one myself, which can
>>>>>             hopefully be contributed back. I'm curious: do you have
>>>>>             particular requirements? Based on the emails in this
>>>>>             thread, it sounds like the original goal was to have
>>>>>             something that's like a SpoolDirectorySource that just
>>>>>             picks up new files from S3. Is that accurate?
>>>>>
>>>>>
>>>>>         Yes, I think so.  We need to be able to:
>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>>
>>>>>         every 1 min, every 5 min, etc.)
>>>>>         * fetch data from multiple S3 buckets
>>>>>         * associate an S3 bucket with a user/token/key
>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>         stored on disk) add new S3 buckets from which data should be
>>>>> fetch
>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>
>>>>>
>>>>>             Would you need to be able to pull files from multiple S3
>>>>>             directories with the same source?
>>>>>
>>>>>
>>>>>         I think the above addresses this question.
>>>>>
>>>>>             Thanks,
>>>>>             Natty
>>>>>
>>>>>
>>>>>         Thanks!
>>>>>
>>>>>         Otis
>>>>>         --
>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>
>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>             <otis.gospodnetic@gmail.com
>>>>>             <ma...@gmail.com>> wrote:
>>>>>
>>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>>
>>>>>                 But being able to dynamically add/remove S3 buckets
>>>>>                 from which to pull data seems important.
>>>>>
>>>>>                 Any suggestions for how to approach that?
>>>>>
>>>>>                 Otis
>>>>>                 --
>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>> Analytics
>>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>                 <hshreedharan@cloudera.com
>>>>>                 <ma...@cloudera.com>> wrote:
>>>>>
>>>>>                     Please go ahead and file a jira. If you are
>>>>>                     willing to submit a patch, you can post it on the
>>>>>                     jira.
>>>>>
>>>>>                     Viral Bajaria wrote:
>>>>>
>>>>>
>>>>>
>>>>>                     I have a similar use case that cropped up
>>>>>                     yesterday. I saw the archive
>>>>>                     and found that there was a recommendation to
>>>>>                     build it as Sharninder
>>>>>                     suggested.
>>>>>
>>>>>                     For now, I went down the route of writing a
>>>>>                     python script which
>>>>>                     downloads from S3 and puts the files in a
>>>>>                     directory which is
>>>>>                     configured to be picked up via a spooldir.
>>>>>
>>>>>                     I would prefer to get a direct S3 source, and
>>>>>                     maybe we could
>>>>>                     collaborate on it and open-source it. Let me know
>>>>>                     if you prefer that
>>>>>                     and we can work directly on it by creating a JIRA.
>>>>>
>>>>>                     Thanks,
>>>>>                     Viral
>>>>>
>>>>>
>>>>>
>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>>                     <hshreedharan@cloudera.com
>>>>>                     <ma...@cloudera.com>
>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>
>>>>>                     <ma...@cloudera.com>>> wrote:
>>>>>
>>>>>                         In both cases, Sharninder is right :)
>>>>>
>>>>>                         Sharninder wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                         As far as I know, there is no (open source)
>>>>>                     implementation of an S3
>>>>>                         source, so yes, you'll have to implement
>>>>>                     your own. You'll have to
>>>>>                         implement a Pollable source and the dev
>>>>>                     documentation has an outline
>>>>>                         that you can use. You can also look at the
>>>>>                     existing Execsource and
>>>>>                         work your way up.
>>>>>
>>>>>                         As far as I know, there is no way to
>>>>>                     configure flume without
>>>>>                         using the
>>>>>                         configuration file.
>>>>>
>>>>>
>>>>>
>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>>>                     <mailto:prog88@gmail.com <mailto:prog88@gmail.com
>>>>> >>
>>>>>                     <mailto:prog88@gmail.com
>>>>>                     <ma...@gmail.com>
>>>>>                     <mailto:prog88@gmail.com
>>>>>                     <ma...@gmail.com>>>> wrote:
>>>>>
>>>>>                             Hi,
>>>>>                             I'm wondering if Flume is able to read
>>>>>                     directly from S3.
>>>>>
>>>>>                             I'll describe my case. I have log files
>>>>>                     stored in AWS S3. I have
>>>>>                             to fetch periodically new S3 objects and
>>>>>                     read log lines from it.
>>>>>                             Than use log lines (events) are
>>>>>                     processed in standard flume's way
>>>>>                             (as with other sources).
>>>>>
>>>>>                             *1) Is there any way to fetch S3 objects
>>>>>                     or I have to write
>>>>>                         my own
>>>>>                             Source?*
>>>>>
>>>>>
>>>>>                             There is also second case. I want to
>>>>>                     have flume configuration
>>>>>                             dynamic. Flume sources can change in
>>>>>                     time. New AWS key and S3
>>>>>                             bucket can be added or deleted.
>>>>>
>>>>>                             *2) Is there any other way to configure
>>>>>                     Flume than by static
>>>>>                             configuration file?*
>>>>>
>>>>>                             --
>>>>>                             Paweł Róg
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>
>

Re: AWS S3 flume source

Posted by Jonathan Natkins <na...@streamsets.com>.

Ashish, I've put some comments inline.

On Tuesday, August 5, 2014, Ashish <pa...@gmail.com> wrote:

> Sharing some random thoughts
>
> 1. Download the file using S3 SDK and let the SpoolDirectory
> implementation take care of rest. Like a Decorator in front of
> SpoolDirectory
>
> This works for the simple case, but I don't think this is an ideal
solution. My primary concern is that S3's max file size is 5TB, so
downloading the object to local disk may not be possible.


> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
> create events out of it.
>
> Would be great to reuse an existing implementation which is based on
> InputStream and feed it with S3 object input stream, concern of metadata
> storage still remains. Most often S3 objects are stored in compressed form,
> so this source would need to take care of compression gz/avro/others.
>
> Best is to start with something that works and then start adding more
> features to it.
>
>
> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <natty@streamsets.com
> <javascript:_e(%7B%7D,'cvml','natty@streamsets.com');>> wrote:
>
>> Hi all,
>>
>> I started trying to write some code on this, and realized there are a
>> number of issues that need to be discussed in order to really design this
>> feature effectively. The requirements that have been discussed thus far are:
>>
>> 1. Fetching data from S3 periodically
>> 2. Fetching data from multiple S3 buckets -- This may be something that
>> should be punted on until later. For a first implementation, this could be
>> solved just by having multiple sources, each with a single S3 bucket
>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>> clarify what you mean by this?*
>> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491,
>> so I think this is out-of-scope for discussions at the moment
>>
>> Some questions I want to try to answer:
>>
>> 1. How do we identify and track objects that need to be processed versus
>> objects that have been processed already?
>> 1a. What about if we want to have multiple sources working against the
>> same bucket to speed processing?
>> 2. Is it fair to assume that we're dealing with character files, rather
>> than binary objects?
>>
>>  For the first question, if we ignore the multiple source extension of
>> the question, I think the simplest answer is to do something on the local
>> filesystem, like have a tracking directory that contains a list of
>> to-be-processed objects and a list of already-processed objects. However,
>> if the source goes down, what should the restart semantics be? It seems
>> that the ideal situation is to store this state in a system like ZooKeeper,
>> which would ensure that a number of sources could operate off of the same
>> bucket, but this probably requires FLUME-1491 first.
>>
>> For the second question, my feeling was just that we should work with
>> similar assumptions to how the SpoolingDirectorySource works, where each
>> line is a separate event. Does that seem reasonable?
>>
>> Thanks,
>> Natty
>>
>>
>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <prog88@gmail.com
>> <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');>> wrote:
>>
>>> Hi,
>>> Thanks for explanation Jonathan. I think I will also start working on
>>> it. When you have any patch (even draft) I'd be glad if you can attach it
>>> in JIRA. I'll do the same.
>>> What do you think?
>>>
>>> --
>>> Paweł Róg
>>>
>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hshreedharan@cloudera.com
>>> <javascript:_e(%7B%7D,'cvml','hshreedharan@cloudera.com');>>:
>>>
>>> +1 on an S3 Source. I would gladly review.
>>>>
>>>> Jonathan Natkins wrote:
>>>>
>>>>
>>>> Hey Pawel,
>>>>
>>>> My intention is to start working on it, but I don't know exactly how
>>>> long it will take, and I'm not a committer, so time estimates would
>>>> have to be taken with a grain of salt regardless. If this is something
>>>> that you need urgently, it may not be ideal to wait for me to start
>>>> building something for yourself.
>>>>
>>>> That said, as mentioned in the other thread, dynamic configuration can
>>>> be done by refreshing the configuration files across the set of Flume
>>>> agents. It's certainly not as great as having a single place to change
>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>
>>>> Thanks,
>>>> Natty
>>>>
>>>>
>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');>
>>>> <mailto:prog88@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');>>> wrote:
>>>>
>>>>     Hi,
>>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>>     JIRA issue and want to start implementing this and do you have any
>>>>     estimate how long it will take?
>>>>
>>>>     I think the biggest challenge here is to have dynamic
>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>     issue. Am I right?
>>>>
>>>>     > Would you need to be able to pull files from multiple S3
>>>>     directories with the same source?
>>>>
>>>>     I think we don't need to track multiple S3 buckets with a single
>>>>     source. I just imagine an approach where each S3 source can be
>>>>     added or deleted on demand and attached to any Channel. I'm only
>>>>     afraid about this dynamic configuration. I'll open a new thread
>>>>     about this. It seems we have two totally separate things:
>>>>     * build S3 source
>>>>     * make flume configurable dynamically
>>>>
>>>>     --
>>>>     Paweł
>>>>
>>>>
>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>     <otis.gospodnetic@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','otis.gospodnetic@gmail.com');> <mailto:
>>>> otis.gospodnetic@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','otis.gospodnetic@gmail.com');>>>:
>>>>
>>>>
>>>>         Hi,
>>>>
>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>         <natty@streamsets.com
>>>> <javascript:_e(%7B%7D,'cvml','natty@streamsets.com');> <mailto:
>>>> natty@streamsets.com
>>>> <javascript:_e(%7B%7D,'cvml','natty@streamsets.com');>>> wrote:
>>>>
>>>>             Hey all,
>>>>
>>>>             I created a JIRA for this:
>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>
>>>>
>>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>>         version?
>>>>
>>>>             I thought I'd start working on one myself, which can
>>>>             hopefully be contributed back. I'm curious: do you have
>>>>             particular requirements? Based on the emails in this
>>>>             thread, it sounds like the original goal was to have
>>>>             something that's like a SpoolDirectorySource that just
>>>>             picks up new files from S3. Is that accurate?
>>>>
>>>>
>>>>         Yes, I think so.  We need to be able to:
>>>>         * fetch data (logs for pulling them in Logsene
>>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>>
>>>>         every 1 min, every 5 min, etc.)
>>>>         * fetch data from multiple S3 buckets
>>>>         * associate an S3 bucket with a user/token/key
>>>>         * dynamically (i.e. without editing/writing config files
>>>>         stored on disk) add new S3 buckets from which data should be
>>>> fetch
>>>>         * dynamically (i.e. without editing/writing config files
>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>
>>>>
>>>>             Would you need to be able to pull files from multiple S3
>>>>             directories with the same source?
>>>>
>>>>
>>>>         I think the above addresses this question.
>>>>
>>>>             Thanks,
>>>>             Natty
>>>>
>>>>
>>>>         Thanks!
>>>>
>>>>         Otis
>>>>         --
>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>>
>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>             <otis.gospodnetic@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','otis.gospodnetic@gmail.com');>
>>>>             <mailto:otis.gospodnetic@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','otis.gospodnetic@gmail.com');>>> wrote:
>>>>
>>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>>
>>>>                 But being able to dynamically add/remove S3 buckets
>>>>                 from which to pull data seems important.
>>>>
>>>>                 Any suggestions for how to approach that?
>>>>
>>>>                 Otis
>>>>                 --
>>>>                 Performance Monitoring * Log Analytics * Search
>>>> Analytics
>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>                 <hshreedharan@cloudera.com
>>>> <javascript:_e(%7B%7D,'cvml','hshreedharan@cloudera.com');>
>>>>                 <mailto:hshreedharan@cloudera.com
>>>> <javascript:_e(%7B%7D,'cvml','hshreedharan@cloudera.com');>>> wrote:
>>>>
>>>>                     Please go ahead and file a jira. If you are
>>>>                     willing to submit a patch, you can post it on the
>>>>                     jira.
>>>>
>>>>                     Viral Bajaria wrote:
>>>>
>>>>
>>>>
>>>>                     I have a similar use case that cropped up
>>>>                     yesterday. I saw the archive
>>>>                     and found that there was a recommendation to
>>>>                     build it as Sharninder
>>>>                     suggested.
>>>>
>>>>                     For now, I went down the route of writing a
>>>>                     python script which
>>>>                     downloads from S3 and puts the files in a
>>>>                     directory which is
>>>>                     configured to be picked up via a spooldir.
>>>>
>>>>                     I would prefer to get a direct S3 source, and
>>>>                     maybe we could
>>>>                     collaborate on it and open-source it. Let me know
>>>>                     if you prefer that
>>>>                     and we can work directly on it by creating a JIRA.
>>>>
>>>>                     Thanks,
>>>>                     Viral
>>>>
>>>>
>>>>
>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>                     <hshreedharan@cloudera.com
>>>> <javascript:_e(%7B%7D,'cvml','hshreedharan@cloudera.com');>
>>>>                     <mailto:hshreedharan@cloudera.com
>>>> <javascript:_e(%7B%7D,'cvml','hshreedharan@cloudera.com');>>
>>>>                     <mailto:hshreedharan@cloudera.com
>>>> <javascript:_e(%7B%7D,'cvml','hshreedharan@cloudera.com');>
>>>>
>>>>                     <mailto:hshreedharan@cloudera.com
>>>> <javascript:_e(%7B%7D,'cvml','hshreedharan@cloudera.com');>>>> wrote:
>>>>
>>>>                         In both cases, Sharninder is right :)
>>>>
>>>>                         Sharninder wrote:
>>>>
>>>>
>>>>
>>>>
>>>>                         As far as I know, there is no (open source)
>>>>                     implementation of an S3
>>>>                         source, so yes, you'll have to implement
>>>>                     your own. You'll have to
>>>>                         implement a Pollable source and the dev
>>>>                     documentation has an outline
>>>>                         that you can use. You can also look at the
>>>>                     existing Execsource and
>>>>                         work your way up.
>>>>
>>>>                         As far as I know, there is no way to
>>>>                     configure flume without
>>>>                         using the
>>>>                         configuration file.
>>>>
>>>>
>>>>
>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>                     <prog88@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');> <mailto:
>>>> prog88@gmail.com <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');>>
>>>>                     <mailto:prog88@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');> <mailto:
>>>> prog88@gmail.com <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');>>>
>>>>                     <mailto:prog88@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');>
>>>>                     <mailto:prog88@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');>>
>>>>                     <mailto:prog88@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');>
>>>>                     <mailto:prog88@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','prog88@gmail.com');>>>>> wrote:
>>>>
>>>>                             Hi,
>>>>                             I'm wondering if Flume is able to read
>>>>                     directly from S3.
>>>>
>>>>                             I'll describe my case. I have log files
>>>>                     stored in AWS S3. I have
>>>>                             to fetch periodically new S3 objects and
>>>>                     read log lines from it.
>>>>                             Than use log lines (events) are
>>>>                     processed in standard flume's way
>>>>                             (as with other sources).
>>>>
>>>>                             *1) Is there any way to fetch S3 objects
>>>>                     or I have to write
>>>>                         my own
>>>>                             Source?*
>>>>
>>>>
>>>>                             There is also second case. I want to
>>>>                     have flume configuration
>>>>                             dynamic. Flume sources can change in
>>>>                     time. New AWS key and S3
>>>>                             bucket can be added or deleted.
>>>>
>>>>                             *2) Is there any other way to configure
>>>>                     Flume than by static
>>>>                             configuration file?*
>>>>
>>>>                             --
>>>>                             Paweł Róg
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: AWS S3 flume source

Posted by Ashish <pa...@gmail.com>.

Sharing some random thoughts

1. Download the file using S3 SDK and let the SpoolDirectory implementation
take care of rest. Like a Decorator in front of SpoolDirectory

2. Use S3 SDK to create InputStream of S3 objects directly in code and
create events out of it.

Would be great to reuse an existing implementation which is based on
InputStream and feed it with S3 object input stream, concern of metadata
storage still remains. Most often S3 objects are stored in compressed form,
so this source would need to take care of compression gz/avro/others.

Best is to start with something that works and then start adding more
features to it.


On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <na...@streamsets.com>
wrote:

> Hi all,
>
> I started trying to write some code on this, and realized there are a
> number of issues that need to be discussed in order to really design this
> feature effectively. The requirements that have been discussed thus far are:
>
> 1. Fetching data from S3 periodically
> 2. Fetching data from multiple S3 buckets -- This may be something that
> should be punted on until later. For a first implementation, this could be
> solved just by having multiple sources, each with a single S3 bucket
> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
> clarify what you mean by this?*
> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491, so
> I think this is out-of-scope for discussions at the moment
>
> Some questions I want to try to answer:
>
> 1. How do we identify and track objects that need to be processed versus
> objects that have been processed already?
> 1a. What about if we want to have multiple sources working against the
> same bucket to speed processing?
> 2. Is it fair to assume that we're dealing with character files, rather
> than binary objects?
>
>  For the first question, if we ignore the multiple source extension of
> the question, I think the simplest answer is to do something on the local
> filesystem, like have a tracking directory that contains a list of
> to-be-processed objects and a list of already-processed objects. However,
> if the source goes down, what should the restart semantics be? It seems
> that the ideal situation is to store this state in a system like ZooKeeper,
> which would ensure that a number of sources could operate off of the same
> bucket, but this probably requires FLUME-1491 first.
>
> For the second question, my feeling was just that we should work with
> similar assumptions to how the SpoolingDirectorySource works, where each
> line is a separate event. Does that seem reasonable?
>
> Thanks,
> Natty
>
>
> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>
>> Hi,
>> Thanks for explanation Jonathan. I think I will also start working on it.
>> When you have any patch (even draft) I'd be glad if you can attach it in
>> JIRA. I'll do the same.
>> What do you think?
>>
>> --
>> Paweł Róg
>>
>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hs...@cloudera.com>:
>>
>> +1 on an S3 Source. I would gladly review.
>>>
>>> Jonathan Natkins wrote:
>>>
>>>
>>> Hey Pawel,
>>>
>>> My intention is to start working on it, but I don't know exactly how
>>> long it will take, and I'm not a committer, so time estimates would
>>> have to be taken with a grain of salt regardless. If this is something
>>> that you need urgently, it may not be ideal to wait for me to start
>>> building something for yourself.
>>>
>>> That said, as mentioned in the other thread, dynamic configuration can
>>> be done by refreshing the configuration files across the set of Flume
>>> agents. It's certainly not as great as having a single place to change
>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>
>>> Thanks,
>>> Natty
>>>
>>>
>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>>     Hi,
>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>     JIRA issue and want to start implementing this and do you have any
>>>     estimate how long it will take?
>>>
>>>     I think the biggest challenge here is to have dynamic
>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>     issue. Am I right?
>>>
>>>     > Would you need to be able to pull files from multiple S3
>>>     directories with the same source?
>>>
>>>     I think we don't need to track multiple S3 buckets with a single
>>>     source. I just imagine an approach where each S3 source can be
>>>     added or deleted on demand and attached to any Channel. I'm only
>>>     afraid about this dynamic configuration. I'll open a new thread
>>>     about this. It seems we have two totally separate things:
>>>     * build S3 source
>>>     * make flume configurable dynamically
>>>
>>>     --
>>>     Paweł
>>>
>>>
>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>>>
>>>
>>>         Hi,
>>>
>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>
>>>             Hey all,
>>>
>>>             I created a JIRA for this:
>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>
>>>
>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>         version?
>>>
>>>             I thought I'd start working on one myself, which can
>>>             hopefully be contributed back. I'm curious: do you have
>>>             particular requirements? Based on the emails in this
>>>             thread, it sounds like the original goal was to have
>>>             something that's like a SpoolDirectorySource that just
>>>             picks up new files from S3. Is that accurate?
>>>
>>>
>>>         Yes, I think so.  We need to be able to:
>>>         * fetch data (logs for pulling them in Logsene
>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>
>>>         every 1 min, every 5 min, etc.)
>>>         * fetch data from multiple S3 buckets
>>>         * associate an S3 bucket with a user/token/key
>>>         * dynamically (i.e. without editing/writing config files
>>>         stored on disk) add new S3 buckets from which data should be
>>> fetch
>>>         * dynamically (i.e. without editing/writing config files
>>>         stored on disk) stop fetching data from some S3 buckets
>>>
>>>
>>>             Would you need to be able to pull files from multiple S3
>>>             directories with the same source?
>>>
>>>
>>>         I think the above addresses this question.
>>>
>>>             Thanks,
>>>             Natty
>>>
>>>
>>>         Thanks!
>>>
>>>         Otis
>>>         --
>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>>
>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>             <otis.gospodnetic@gmail.com
>>>             <ma...@gmail.com>> wrote:
>>>
>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>
>>>                 But being able to dynamically add/remove S3 buckets
>>>                 from which to pull data seems important.
>>>
>>>                 Any suggestions for how to approach that?
>>>
>>>                 Otis
>>>                 --
>>>                 Performance Monitoring * Log Analytics * Search Analytics
>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>                 <hshreedharan@cloudera.com
>>>                 <ma...@cloudera.com>> wrote:
>>>
>>>                     Please go ahead and file a jira. If you are
>>>                     willing to submit a patch, you can post it on the
>>>                     jira.
>>>
>>>                     Viral Bajaria wrote:
>>>
>>>
>>>
>>>                     I have a similar use case that cropped up
>>>                     yesterday. I saw the archive
>>>                     and found that there was a recommendation to
>>>                     build it as Sharninder
>>>                     suggested.
>>>
>>>                     For now, I went down the route of writing a
>>>                     python script which
>>>                     downloads from S3 and puts the files in a
>>>                     directory which is
>>>                     configured to be picked up via a spooldir.
>>>
>>>                     I would prefer to get a direct S3 source, and
>>>                     maybe we could
>>>                     collaborate on it and open-source it. Let me know
>>>                     if you prefer that
>>>                     and we can work directly on it by creating a JIRA.
>>>
>>>                     Thanks,
>>>                     Viral
>>>
>>>
>>>
>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>                     <hshreedharan@cloudera.com
>>>                     <ma...@cloudera.com>
>>>                     <mailto:hshreedharan@cloudera.com
>>>
>>>                     <ma...@cloudera.com>>> wrote:
>>>
>>>                         In both cases, Sharninder is right :)
>>>
>>>                         Sharninder wrote:
>>>
>>>
>>>
>>>
>>>                         As far as I know, there is no (open source)
>>>                     implementation of an S3
>>>                         source, so yes, you'll have to implement
>>>                     your own. You'll have to
>>>                         implement a Pollable source and the dev
>>>                     documentation has an outline
>>>                         that you can use. You can also look at the
>>>                     existing Execsource and
>>>                         work your way up.
>>>
>>>                         As far as I know, there is no way to
>>>                     configure flume without
>>>                         using the
>>>                         configuration file.
>>>
>>>
>>>
>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>                     <mailto:prog88@gmail.com <ma...@gmail.com>>
>>>                     <mailto:prog88@gmail.com
>>>                     <ma...@gmail.com>
>>>                     <mailto:prog88@gmail.com
>>>                     <ma...@gmail.com>>>> wrote:
>>>
>>>                             Hi,
>>>                             I'm wondering if Flume is able to read
>>>                     directly from S3.
>>>
>>>                             I'll describe my case. I have log files
>>>                     stored in AWS S3. I have
>>>                             to fetch periodically new S3 objects and
>>>                     read log lines from it.
>>>                             Than use log lines (events) are
>>>                     processed in standard flume's way
>>>                             (as with other sources).
>>>
>>>                             *1) Is there any way to fetch S3 objects
>>>                     or I have to write
>>>                         my own
>>>                             Source?*
>>>
>>>
>>>                             There is also second case. I want to
>>>                     have flume configuration
>>>                             dynamic. Flume sources can change in
>>>                     time. New AWS key and S3
>>>                             bucket can be added or deleted.
>>>
>>>                             *2) Is there any other way to configure
>>>                     Flume than by static
>>>                             configuration file?*
>>>
>>>                             --
>>>                             Paweł Róg
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: AWS S3 flume source

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

On Tue, Aug 5, 2014 at 10:57 PM, Jonathan Natkins <na...@streamsets.com>
wrote:

> Hi all,
>
> I started trying to write some code on this, and realized there are a
> number of issues that need to be discussed in order to really design this
> feature effectively. The requirements that have been discussed thus far are:
>
> 1. Fetching data from S3 periodically
> 2. Fetching data from multiple S3 buckets -- This may be something that
> should be punted on until later. For a first implementation, this could be
> solved just by having multiple sources, each with a single S3 bucket
>
3. Associating an S3 bucket with a user/token/key -- *Otis - can you
> clarify what you mean by this?*
>

Think about a multi-tenant application where each tenant wants the
application to fetch their data from S3.  Each tenant has some sort of ID.
 So what I meant by the above is that this ID probably needs to be "carried
around" in the S3 Source => Flume => Some Sink pipeline.  For example,
imagine data from S3 needs to end up in an Elasticsearch cluster shared by
multiple tenants, where each tenant has a separate index named with
tenant's ID.  In order to write each tenant's data into their index, data
pulled from S3 needs to carry with it the tenant's ID, so the Sink can
write it to the correct index.

I hope I managed to explain it clearly. :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/



> 4. Dynamically reconfigure the source -- This is blocked by FLUME-1491, so
> I think this is out-of-scope for discussions at the moment
>
> Some questions I want to try to answer:
>
> 1. How do we identify and track objects that need to be processed versus
> objects that have been processed already?
> 1a. What about if we want to have multiple sources working against the
> same bucket to speed processing?
> 2. Is it fair to assume that we're dealing with character files, rather
> than binary objects?
>
>  For the first question, if we ignore the multiple source extension of
> the question, I think the simplest answer is to do something on the local
> filesystem, like have a tracking directory that contains a list of
> to-be-processed objects and a list of already-processed objects. However,
> if the source goes down, what should the restart semantics be? It seems
> that the ideal situation is to store this state in a system like ZooKeeper,
> which would ensure that a number of sources could operate off of the same
> bucket, but this probably requires FLUME-1491 first.
>
> For the second question, my feeling was just that we should work with
> similar assumptions to how the SpoolingDirectorySource works, where each
> line is a separate event. Does that seem reasonable?
>
> Thanks,
> Natty
>
>
> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:
>
>> Hi,
>> Thanks for explanation Jonathan. I think I will also start working on it.
>> When you have any patch (even draft) I'd be glad if you can attach it in
>> JIRA. I'll do the same.
>> What do you think?
>>
>> --
>> Paweł Róg
>>
>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hs...@cloudera.com>:
>>
>> +1 on an S3 Source. I would gladly review.
>>>
>>> Jonathan Natkins wrote:
>>>
>>>
>>> Hey Pawel,
>>>
>>> My intention is to start working on it, but I don't know exactly how
>>> long it will take, and I'm not a committer, so time estimates would
>>> have to be taken with a grain of salt regardless. If this is something
>>> that you need urgently, it may not be ideal to wait for me to start
>>> building something for yourself.
>>>
>>> That said, as mentioned in the other thread, dynamic configuration can
>>> be done by refreshing the configuration files across the set of Flume
>>> agents. It's certainly not as great as having a single place to change
>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>
>>> Thanks,
>>> Natty
>>>
>>>
>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>>     Hi,
>>>     Jonathan how should we interpret your last e-mail? You opened an
>>>     JIRA issue and want to start implementing this and do you have any
>>>     estimate how long it will take?
>>>
>>>     I think the biggest challenge here is to have dynamic
>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>     issue. Am I right?
>>>
>>>     > Would you need to be able to pull files from multiple S3
>>>     directories with the same source?
>>>
>>>     I think we don't need to track multiple S3 buckets with a single
>>>     source. I just imagine an approach where each S3 source can be
>>>     added or deleted on demand and attached to any Channel. I'm only
>>>     afraid about this dynamic configuration. I'll open a new thread
>>>     about this. It seems we have two totally separate things:
>>>     * build S3 source
>>>     * make flume configurable dynamically
>>>
>>>     --
>>>     Paweł
>>>
>>>
>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>>>
>>>
>>>         Hi,
>>>
>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>>
>>>             Hey all,
>>>
>>>             I created a JIRA for this:
>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>
>>>
>>>         Thanks!  Should Fix Version be set to the next Flume release
>>>         version?
>>>
>>>             I thought I'd start working on one myself, which can
>>>             hopefully be contributed back. I'm curious: do you have
>>>             particular requirements? Based on the emails in this
>>>             thread, it sounds like the original goal was to have
>>>             something that's like a SpoolDirectorySource that just
>>>             picks up new files from S3. Is that accurate?
>>>
>>>
>>>         Yes, I think so.  We need to be able to:
>>>         * fetch data (logs for pulling them in Logsene
>>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>>
>>>         every 1 min, every 5 min, etc.)
>>>         * fetch data from multiple S3 buckets
>>>         * associate an S3 bucket with a user/token/key
>>>         * dynamically (i.e. without editing/writing config files
>>>         stored on disk) add new S3 buckets from which data should be
>>> fetch
>>>         * dynamically (i.e. without editing/writing config files
>>>         stored on disk) stop fetching data from some S3 buckets
>>>
>>>
>>>             Would you need to be able to pull files from multiple S3
>>>             directories with the same source?
>>>
>>>
>>>         I think the above addresses this question.
>>>
>>>             Thanks,
>>>             Natty
>>>
>>>
>>>         Thanks!
>>>
>>>         Otis
>>>         --
>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>>
>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>             <otis.gospodnetic@gmail.com
>>>             <ma...@gmail.com>> wrote:
>>>
>>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>>
>>>                 But being able to dynamically add/remove S3 buckets
>>>                 from which to pull data seems important.
>>>
>>>                 Any suggestions for how to approach that?
>>>
>>>                 Otis
>>>                 --
>>>                 Performance Monitoring * Log Analytics * Search Analytics
>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>                 <hshreedharan@cloudera.com
>>>                 <ma...@cloudera.com>> wrote:
>>>
>>>                     Please go ahead and file a jira. If you are
>>>                     willing to submit a patch, you can post it on the
>>>                     jira.
>>>
>>>                     Viral Bajaria wrote:
>>>
>>>
>>>
>>>                     I have a similar use case that cropped up
>>>                     yesterday. I saw the archive
>>>                     and found that there was a recommendation to
>>>                     build it as Sharninder
>>>                     suggested.
>>>
>>>                     For now, I went down the route of writing a
>>>                     python script which
>>>                     downloads from S3 and puts the files in a
>>>                     directory which is
>>>                     configured to be picked up via a spooldir.
>>>
>>>                     I would prefer to get a direct S3 source, and
>>>                     maybe we could
>>>                     collaborate on it and open-source it. Let me know
>>>                     if you prefer that
>>>                     and we can work directly on it by creating a JIRA.
>>>
>>>                     Thanks,
>>>                     Viral
>>>
>>>
>>>
>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>                     <hshreedharan@cloudera.com
>>>                     <ma...@cloudera.com>
>>>                     <mailto:hshreedharan@cloudera.com
>>>
>>>                     <ma...@cloudera.com>>> wrote:
>>>
>>>                         In both cases, Sharninder is right :)
>>>
>>>                         Sharninder wrote:
>>>
>>>
>>>
>>>
>>>                         As far as I know, there is no (open source)
>>>                     implementation of an S3
>>>                         source, so yes, you'll have to implement
>>>                     your own. You'll have to
>>>                         implement a Pollable source and the dev
>>>                     documentation has an outline
>>>                         that you can use. You can also look at the
>>>                     existing Execsource and
>>>                         work your way up.
>>>
>>>                         As far as I know, there is no way to
>>>                     configure flume without
>>>                         using the
>>>                         configuration file.
>>>
>>>
>>>
>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>                     <prog88@gmail.com <ma...@gmail.com>
>>>                     <mailto:prog88@gmail.com <ma...@gmail.com>>
>>>                     <mailto:prog88@gmail.com
>>>                     <ma...@gmail.com>
>>>                     <mailto:prog88@gmail.com
>>>                     <ma...@gmail.com>>>> wrote:
>>>
>>>                             Hi,
>>>                             I'm wondering if Flume is able to read
>>>                     directly from S3.
>>>
>>>                             I'll describe my case. I have log files
>>>                     stored in AWS S3. I have
>>>                             to fetch periodically new S3 objects and
>>>                     read log lines from it.
>>>                             Than use log lines (events) are
>>>                     processed in standard flume's way
>>>                             (as with other sources).
>>>
>>>                             *1) Is there any way to fetch S3 objects
>>>                     or I have to write
>>>                         my own
>>>                             Source?*
>>>
>>>
>>>                             There is also second case. I want to
>>>                     have flume configuration
>>>                             dynamic. Flume sources can change in
>>>                     time. New AWS key and S3
>>>                             bucket can be added or deleted.
>>>
>>>                             *2) Is there any other way to configure
>>>                     Flume than by static
>>>                             configuration file?*
>>>
>>>                             --
>>>                             Paweł Róg
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: AWS S3 flume source

Posted by Jonathan Natkins <na...@streamsets.com>.

Hi all,

I started trying to write some code on this, and realized there are a
number of issues that need to be discussed in order to really design this
feature effectively. The requirements that have been discussed thus far are:

1. Fetching data from S3 periodically
2. Fetching data from multiple S3 buckets -- This may be something that
should be punted on until later. For a first implementation, this could be
solved just by having multiple sources, each with a single S3 bucket
3. Associating an S3 bucket with a user/token/key -- *Otis - can you
clarify what you mean by this?*
4. Dynamically reconfigure the source -- This is blocked by FLUME-1491, so
I think this is out-of-scope for discussions at the moment

Some questions I want to try to answer:

1. How do we identify and track objects that need to be processed versus
objects that have been processed already?
1a. What about if we want to have multiple sources working against the same
bucket to speed processing?
2. Is it fair to assume that we're dealing with character files, rather
than binary objects?

For the first question, if we ignore the multiple source extension of the
question, I think the simplest answer is to do something on the local
filesystem, like have a tracking directory that contains a list of
to-be-processed objects and a list of already-processed objects. However,
if the source goes down, what should the restart semantics be? It seems
that the ideal situation is to store this state in a system like ZooKeeper,
which would ensure that a number of sources could operate off of the same
bucket, but this probably requires FLUME-1491 first.

For the second question, my feeling was just that we should work with
similar assumptions to how the SpoolingDirectorySource works, where each
line is a separate event. Does that seem reasonable?

Thanks,
Natty


On Fri, Aug 1, 2014 at 11:31 AM, Paweł <pr...@gmail.com> wrote:

> Hi,
> Thanks for explanation Jonathan. I think I will also start working on it.
> When you have any patch (even draft) I'd be glad if you can attach it in
> JIRA. I'll do the same.
> What do you think?
>
> --
> Paweł Róg
>
> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hs...@cloudera.com>:
>
> +1 on an S3 Source. I would gladly review.
>>
>> Jonathan Natkins wrote:
>>
>>
>> Hey Pawel,
>>
>> My intention is to start working on it, but I don't know exactly how
>> long it will take, and I'm not a committer, so time estimates would
>> have to be taken with a grain of salt regardless. If this is something
>> that you need urgently, it may not be ideal to wait for me to start
>> building something for yourself.
>>
>> That said, as mentioned in the other thread, dynamic configuration can
>> be done by refreshing the configuration files across the set of Flume
>> agents. It's certainly not as great as having a single place to change
>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>
>> Thanks,
>> Natty
>>
>>
>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     Hi,
>>     Jonathan how should we interpret your last e-mail? You opened an
>>     JIRA issue and want to start implementing this and do you have any
>>     estimate how long it will take?
>>
>>     I think the biggest challenge here is to have dynamic
>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>     issue. Am I right?
>>
>>     > Would you need to be able to pull files from multiple S3
>>     directories with the same source?
>>
>>     I think we don't need to track multiple S3 buckets with a single
>>     source. I just imagine an approach where each S3 source can be
>>     added or deleted on demand and attached to any Channel. I'm only
>>     afraid about this dynamic configuration. I'll open a new thread
>>     about this. It seems we have two totally separate things:
>>     * build S3 source
>>     * make flume configurable dynamically
>>
>>     --
>>     Paweł
>>
>>
>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>>
>>
>>         Hi,
>>
>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>>
>>             Hey all,
>>
>>             I created a JIRA for this:
>>             https://issues.apache.org/jira/browse/FLUME-2437
>>
>>
>>         Thanks!  Should Fix Version be set to the next Flume release
>>         version?
>>
>>             I thought I'd start working on one myself, which can
>>             hopefully be contributed back. I'm curious: do you have
>>             particular requirements? Based on the emails in this
>>             thread, it sounds like the original goal was to have
>>             something that's like a SpoolDirectorySource that just
>>             picks up new files from S3. Is that accurate?
>>
>>
>>         Yes, I think so.  We need to be able to:
>>         * fetch data (logs for pulling them in Logsene
>>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>>
>>         every 1 min, every 5 min, etc.)
>>         * fetch data from multiple S3 buckets
>>         * associate an S3 bucket with a user/token/key
>>         * dynamically (i.e. without editing/writing config files
>>         stored on disk) add new S3 buckets from which data should be fetch
>>         * dynamically (i.e. without editing/writing config files
>>         stored on disk) stop fetching data from some S3 buckets
>>
>>
>>             Would you need to be able to pull files from multiple S3
>>             directories with the same source?
>>
>>
>>         I think the above addresses this question.
>>
>>             Thanks,
>>             Natty
>>
>>
>>         Thanks!
>>
>>         Otis
>>         --
>>         Performance Monitoring * Log Analytics * Search Analytics
>>         Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>>
>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>             <otis.gospodnetic@gmail.com
>>             <ma...@gmail.com>> wrote:
>>
>>                 +1 for seeing S3Source, starting with a JIRA issue.
>>
>>                 But being able to dynamically add/remove S3 buckets
>>                 from which to pull data seems important.
>>
>>                 Any suggestions for how to approach that?
>>
>>                 Otis
>>                 --
>>                 Performance Monitoring * Log Analytics * Search Analytics
>>                 Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>                 <hshreedharan@cloudera.com
>>                 <ma...@cloudera.com>> wrote:
>>
>>                     Please go ahead and file a jira. If you are
>>                     willing to submit a patch, you can post it on the
>>                     jira.
>>
>>                     Viral Bajaria wrote:
>>
>>
>>
>>                     I have a similar use case that cropped up
>>                     yesterday. I saw the archive
>>                     and found that there was a recommendation to
>>                     build it as Sharninder
>>                     suggested.
>>
>>                     For now, I went down the route of writing a
>>                     python script which
>>                     downloads from S3 and puts the files in a
>>                     directory which is
>>                     configured to be picked up via a spooldir.
>>
>>                     I would prefer to get a direct S3 source, and
>>                     maybe we could
>>                     collaborate on it and open-source it. Let me know
>>                     if you prefer that
>>                     and we can work directly on it by creating a JIRA.
>>
>>                     Thanks,
>>                     Viral
>>
>>
>>
>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>                     <hshreedharan@cloudera.com
>>                     <ma...@cloudera.com>
>>                     <mailto:hshreedharan@cloudera.com
>>
>>                     <ma...@cloudera.com>>> wrote:
>>
>>                         In both cases, Sharninder is right :)
>>
>>                         Sharninder wrote:
>>
>>
>>
>>
>>                         As far as I know, there is no (open source)
>>                     implementation of an S3
>>                         source, so yes, you'll have to implement
>>                     your own. You'll have to
>>                         implement a Pollable source and the dev
>>                     documentation has an outline
>>                         that you can use. You can also look at the
>>                     existing Execsource and
>>                         work your way up.
>>
>>                         As far as I know, there is no way to
>>                     configure flume without
>>                         using the
>>                         configuration file.
>>
>>
>>
>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>                     <prog88@gmail.com <ma...@gmail.com>
>>                     <mailto:prog88@gmail.com <ma...@gmail.com>>
>>                     <mailto:prog88@gmail.com
>>                     <ma...@gmail.com>
>>                     <mailto:prog88@gmail.com
>>                     <ma...@gmail.com>>>> wrote:
>>
>>                             Hi,
>>                             I'm wondering if Flume is able to read
>>                     directly from S3.
>>
>>                             I'll describe my case. I have log files
>>                     stored in AWS S3. I have
>>                             to fetch periodically new S3 objects and
>>                     read log lines from it.
>>                             Than use log lines (events) are
>>                     processed in standard flume's way
>>                             (as with other sources).
>>
>>                             *1) Is there any way to fetch S3 objects
>>                     or I have to write
>>                         my own
>>                             Source?*
>>
>>
>>                             There is also second case. I want to
>>                     have flume configuration
>>                             dynamic. Flume sources can change in
>>                     time. New AWS key and S3
>>                             bucket can be added or deleted.
>>
>>                             *2) Is there any other way to configure
>>                     Flume than by static
>>                             configuration file?*
>>
>>                             --
>>                             Paweł Róg
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: AWS S3 flume source

Posted by Paweł <pr...@gmail.com>.

Hi,
Thanks for explanation Jonathan. I think I will also start working on it.
When you have any patch (even draft) I'd be glad if you can attach it in
JIRA. I'll do the same.
What do you think?

--
Paweł Róg

2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hs...@cloudera.com>:

> +1 on an S3 Source. I would gladly review.
>
> Jonathan Natkins wrote:
>
>
> Hey Pawel,
>
> My intention is to start working on it, but I don't know exactly how
> long it will take, and I'm not a committer, so time estimates would
> have to be taken with a grain of salt regardless. If this is something
> that you need urgently, it may not be ideal to wait for me to start
> building something for yourself.
>
> That said, as mentioned in the other thread, dynamic configuration can
> be done by refreshing the configuration files across the set of Flume
> agents. It's certainly not as great as having a single place to change
> it (e.g. ZooKeeper), but it's a way to get the job done.
>
> Thanks,
> Natty
>
>
> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hi,
>     Jonathan how should we interpret your last e-mail? You opened an
>     JIRA issue and want to start implementing this and do you have any
>     estimate how long it will take?
>
>     I think the biggest challenge here is to have dynamic
>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>     issue. Am I right?
>
>     > Would you need to be able to pull files from multiple S3
>     directories with the same source?
>
>     I think we don't need to track multiple S3 buckets with a single
>     source. I just imagine an approach where each S3 source can be
>     added or deleted on demand and attached to any Channel. I'm only
>     afraid about this dynamic configuration. I'll open a new thread
>     about this. It seems we have two totally separate things:
>     * build S3 source
>     * make flume configurable dynamically
>
>     --
>     Paweł
>
>
>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>     <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>
>
>         Hi,
>
>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>         <natty@streamsets.com <ma...@streamsets.com>> wrote:
>
>             Hey all,
>
>             I created a JIRA for this:
>             https://issues.apache.org/jira/browse/FLUME-2437
>
>
>         Thanks!  Should Fix Version be set to the next Flume release
>         version?
>
>             I thought I'd start working on one myself, which can
>             hopefully be contributed back. I'm curious: do you have
>             particular requirements? Based on the emails in this
>             thread, it sounds like the original goal was to have
>             something that's like a SpoolDirectorySource that just
>             picks up new files from S3. Is that accurate?
>
>
>         Yes, I think so.  We need to be able to:
>         * fetch data (logs for pulling them in Logsene
>         <http://sematext.com/logsene/>) from S3 periodically (e.g.
>
>         every 1 min, every 5 min, etc.)
>         * fetch data from multiple S3 buckets
>         * associate an S3 bucket with a user/token/key
>         * dynamically (i.e. without editing/writing config files
>         stored on disk) add new S3 buckets from which data should be fetch
>         * dynamically (i.e. without editing/writing config files
>         stored on disk) stop fetching data from some S3 buckets
>
>
>             Would you need to be able to pull files from multiple S3
>             directories with the same source?
>
>
>         I think the above addresses this question.
>
>             Thanks,
>             Natty
>
>
>         Thanks!
>
>         Otis
>         --
>         Performance Monitoring * Log Analytics * Search Analytics
>         Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>             <otis.gospodnetic@gmail.com
>             <ma...@gmail.com>> wrote:
>
>                 +1 for seeing S3Source, starting with a JIRA issue.
>
>                 But being able to dynamically add/remove S3 buckets
>                 from which to pull data seems important.
>
>                 Any suggestions for how to approach that?
>
>                 Otis
>                 --
>                 Performance Monitoring * Log Analytics * Search Analytics
>                 Solr & Elasticsearch Support * http://sematext.com/
>
>
>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>                 <hshreedharan@cloudera.com
>                 <ma...@cloudera.com>> wrote:
>
>                     Please go ahead and file a jira. If you are
>                     willing to submit a patch, you can post it on the
>                     jira.
>
>                     Viral Bajaria wrote:
>
>
>
>                     I have a similar use case that cropped up
>                     yesterday. I saw the archive
>                     and found that there was a recommendation to
>                     build it as Sharninder
>                     suggested.
>
>                     For now, I went down the route of writing a
>                     python script which
>                     downloads from S3 and puts the files in a
>                     directory which is
>                     configured to be picked up via a spooldir.
>
>                     I would prefer to get a direct S3 source, and
>                     maybe we could
>                     collaborate on it and open-source it. Let me know
>                     if you prefer that
>                     and we can work directly on it by creating a JIRA.
>
>                     Thanks,
>                     Viral
>
>
>
>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>                     <hshreedharan@cloudera.com
>                     <ma...@cloudera.com>
>                     <mailto:hshreedharan@cloudera.com
>
>                     <ma...@cloudera.com>>> wrote:
>
>                         In both cases, Sharninder is right :)
>
>                         Sharninder wrote:
>
>
>
>
>                         As far as I know, there is no (open source)
>                     implementation of an S3
>                         source, so yes, you'll have to implement
>                     your own. You'll have to
>                         implement a Pollable source and the dev
>                     documentation has an outline
>                         that you can use. You can also look at the
>                     existing Execsource and
>                         work your way up.
>
>                         As far as I know, there is no way to
>                     configure flume without
>                         using the
>                         configuration file.
>
>
>
>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>                     <prog88@gmail.com <ma...@gmail.com>
>                     <mailto:prog88@gmail.com <ma...@gmail.com>>
>                     <mailto:prog88@gmail.com
>                     <ma...@gmail.com>
>                     <mailto:prog88@gmail.com
>                     <ma...@gmail.com>>>> wrote:
>
>                             Hi,
>                             I'm wondering if Flume is able to read
>                     directly from S3.
>
>                             I'll describe my case. I have log files
>                     stored in AWS S3. I have
>                             to fetch periodically new S3 objects and
>                     read log lines from it.
>                             Than use log lines (events) are
>                     processed in standard flume's way
>                             (as with other sources).
>
>                             *1) Is there any way to fetch S3 objects
>                     or I have to write
>                         my own
>                             Source?*
>
>
>                             There is also second case. I want to
>                     have flume configuration
>                             dynamic. Flume sources can change in
>                     time. New AWS key and S3
>                             bucket can be added or deleted.
>
>                             *2) Is there any other way to configure
>                     Flume than by static
>                             configuration file?*
>
>                             --
>                             Paweł Róg
>
>
>
>
>
>
>
>
>

Re: AWS S3 flume source

Posted by Hari Shreedharan <hs...@cloudera.com>.

+1 on an S3 Source. I would gladly review.

Jonathan Natkins wrote:
>
> Hey Pawel,
>
> My intention is to start working on it, but I don't know exactly how
> long it will take, and I'm not a committer, so time estimates would
> have to be taken with a grain of salt regardless. If this is something
> that you need urgently, it may not be ideal to wait for me to start
> building something for yourself.
>
> That said, as mentioned in the other thread, dynamic configuration can
> be done by refreshing the configuration files across the set of Flume
> agents. It's certainly not as great as having a single place to change
> it (e.g. ZooKeeper), but it's a way to get the job done.
>
> Thanks,
> Natty
>
>
> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
> <ma...@gmail.com>> wrote:
>
> Hi,
> Jonathan how should we interpret your last e-mail? You opened an
> JIRA issue and want to start implementing this and do you have any
> estimate how long it will take?
>
> I think the biggest challenge here is to have dynamic
> configuration of Flume. It doesn't seem to be part of FLUME-2437
> issue. Am I right?
>
> > Would you need to be able to pull files from multiple S3
> directories with the same source?
>
> I think we don't need to track multiple S3 buckets with a single
> source. I just imagine an approach where each S3 source can be
> added or deleted on demand and attached to any Channel. I'm only
> afraid about this dynamic configuration. I'll open a new thread
> about this. It seems we have two totally separate things:
> * build S3 source
> * make flume configurable dynamically
>
> --
> Paweł
>
>
> 2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
> <otis.gospodnetic@gmail.com <ma...@gmail.com>>:
>
> Hi,
>
> On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
> <natty@streamsets.com <ma...@streamsets.com>> wrote:
>
> Hey all,
>
> I created a JIRA for this:
> https://issues.apache.org/jira/browse/FLUME-2437
>
>
> Thanks! Should Fix Version be set to the next Flume release
> version?
>
> I thought I'd start working on one myself, which can
> hopefully be contributed back. I'm curious: do you have
> particular requirements? Based on the emails in this
> thread, it sounds like the original goal was to have
> something that's like a SpoolDirectorySource that just
> picks up new files from S3. Is that accurate?
>
>
> Yes, I think so. We need to be able to:
> * fetch data (logs for pulling them in Logsene
> <http://sematext.com/logsene/>) from S3 periodically (e.g.
> every 1 min, every 5 min, etc.)
> * fetch data from multiple S3 buckets
> * associate an S3 bucket with a user/token/key
> * dynamically (i.e. without editing/writing config files
> stored on disk) add new S3 buckets from which data should be fetch
> * dynamically (i.e. without editing/writing config files
> stored on disk) stop fetching data from some S3 buckets
>
>
> Would you need to be able to pull files from multiple S3
> directories with the same source?
>
>
> I think the above addresses this question.
>
> Thanks,
> Natty
>
>
> Thanks!
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
> <otis.gospodnetic@gmail.com
> <ma...@gmail.com>> wrote:
>
> +1 for seeing S3Source, starting with a JIRA issue.
>
> But being able to dynamically add/remove S3 buckets
> from which to pull data seems important.
>
> Any suggestions for how to approach that?
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
> <hshreedharan@cloudera.com
> <ma...@cloudera.com>> wrote:
>
> Please go ahead and file a jira. If you are
> willing to submit a patch, you can post it on the
> jira.
>
> Viral Bajaria wrote:
>>
>>
>> I have a similar use case that cropped up
>> yesterday. I saw the archive
>> and found that there was a recommendation to
>> build it as Sharninder
>> suggested.
>>
>> For now, I went down the route of writing a
>> python script which
>> downloads from S3 and puts the files in a
>> directory which is
>> configured to be picked up via a spooldir.
>>
>> I would prefer to get a direct S3 source, and
>> maybe we could
>> collaborate on it and open-source it. Let me know
>> if you prefer that
>> and we can work directly on it by creating a JIRA.
>>
>> Thanks,
>> Viral
>>
>>
>>
>> On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>> <hshreedharan@cloudera.com
>> <ma...@cloudera.com>
>> <mailto:hshreedharan@cloudera.com
>> <ma...@cloudera.com>>> wrote:
>>
>> In both cases, Sharninder is right :)
>>
>> Sharninder wrote:
>>>
>>>
>>>
>>> As far as I know, there is no (open source)
>>> implementation of an S3
>>> source, so yes, you'll have to implement
>>> your own. You'll have to
>>> implement a Pollable source and the dev
>>> documentation has an outline
>>> that you can use. You can also look at the
>>> existing Execsource and
>>> work your way up.
>>>
>>> As far as I know, there is no way to
>>> configure flume without
>>> using the
>>> configuration file.
>>>
>>>
>>>
>>> On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>> <prog88@gmail.com <ma...@gmail.com>
>>> <mailto:prog88@gmail.com <ma...@gmail.com>>
>>> <mailto:prog88@gmail.com
>>> <ma...@gmail.com>
>>> <mailto:prog88@gmail.com
>>> <ma...@gmail.com>>>> wrote:
>>>
>>> Hi,
>>> I'm wondering if Flume is able to read
>>> directly from S3.
>>>
>>> I'll describe my case. I have log files
>>> stored in AWS S3. I have
>>> to fetch periodically new S3 objects and
>>> read log lines from it.
>>> Than use log lines (events) are
>>> processed in standard flume's way
>>> (as with other sources).
>>>
>>> *1) Is there any way to fetch S3 objects
>>> or I have to write
>>> my own
>>> Source?*
>>>
>>>
>>> There is also second case. I want to
>>> have flume configuration
>>> dynamic. Flume sources can change in
>>> time. New AWS key and S3
>>> bucket can be added or deleted.
>>>
>>> *2) Is there any other way to configure
>>> Flume than by static
>>> configuration file?*
>>>
>>> --
>>> Paweł Róg
>>>
>>
>>
>
>
>
>
>

Re: AWS S3 flume source

Posted by Jonathan Natkins <na...@streamsets.com>.

Hey Pawel,

My intention is to start working on it, but I don't know exactly how long
it will take, and I'm not a committer, so time estimates would have to be
taken with a grain of salt regardless. If this is something that you need
urgently, it may not be ideal to wait for me to start building something
for yourself.

That said, as mentioned in the other thread, dynamic configuration can be
done by refreshing the configuration files across the set of Flume agents.
It's certainly not as great as having a single place to change it (e.g.
ZooKeeper), but it's a way to get the job done.

Thanks,
Natty


On Fri, Aug 1, 2014 at 1:33 AM, Paweł <pr...@gmail.com> wrote:

> Hi,
> Jonathan how should we interpret your last e-mail? You opened an JIRA
> issue and want to start implementing this and do you have any estimate how
> long it will take?
>
> I think the biggest challenge here is to have dynamic configuration of
> Flume. It doesn't seem to be part of FLUME-2437 issue. Am I right?
>
> > Would you need to be able to pull files from multiple S3 directories
> with the same source?
>
> I think we don't need to track multiple S3 buckets with a single source. I
> just imagine an approach where each S3 source can be added or deleted on
> demand and attached to any Channel. I'm only afraid about this dynamic
> configuration. I'll open a new thread about this. It seems we have two
> totally separate things:
> * build S3 source
> * make flume configurable dynamically
>
> --
> Paweł
>
>
> 2014-08-01 9:51 GMT+02:00 Otis Gospodnetic <ot...@gmail.com>:
>
> Hi,
>>
>> On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins <na...@streamsets.com>
>> wrote:
>>
>>> Hey all,
>>>
>>> I created a JIRA for this:
>>> https://issues.apache.org/jira/browse/FLUME-2437
>>>
>>
>> Thanks!  Should Fix Version be set to the next Flume release version?
>>
>> I thought I'd start working on one myself, which can hopefully be
>>> contributed back. I'm curious: do you have particular requirements? Based
>>> on the emails in this thread, it sounds like the original goal was to have
>>> something that's like a SpoolDirectorySource that just picks up new files
>>> from S3. Is that accurate?
>>>
>>
>> Yes, I think so.  We need to be able to:
>> * fetch data (logs for pulling them in Logsene
>> <http://sematext.com/logsene/>) from S3 periodically (e.g. every 1 min,
>> every 5 min, etc.)
>> * fetch data from multiple S3 buckets
>> * associate an S3 bucket with a user/token/key
>> * dynamically (i.e. without editing/writing config files stored on disk)
>> add new S3 buckets from which data should be fetch
>> * dynamically (i.e. without editing/writing config files stored on disk)
>> stop fetching data from some S3 buckets
>>
>>
>>> Would you need to be able to pull files from multiple S3 directories
>>> with the same source?
>>>
>>
>> I think the above addresses this question.
>>
>>
>>> Thanks,
>>> Natty
>>>
>>
>> Thanks!
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>>
>>>
>>>
>>> On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic <
>>> otis.gospodnetic@gmail.com> wrote:
>>>
>>>> +1 for seeing S3Source, starting with a JIRA issue.
>>>>
>>>> But being able to dynamically add/remove S3 buckets from which to pull
>>>> data seems important.
>>>>
>>>> Any suggestions for how to approach that?
>>>>
>>>> Otis
>>>> --
>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan <
>>>> hshreedharan@cloudera.com> wrote:
>>>>
>>>>> Please go ahead and file a jira. If you are willing to submit a patch,
>>>>> you can post it on the jira.
>>>>>
>>>>> Viral Bajaria wrote:
>>>>>
>>>>>
>>>>> I have a similar use case that cropped up yesterday. I saw the archive
>>>>> and found that there was a recommendation to build it as Sharninder
>>>>> suggested.
>>>>>
>>>>> For now, I went down the route of writing a python script which
>>>>> downloads from S3 and puts the files in a directory which is
>>>>> configured to be picked up via a spooldir.
>>>>>
>>>>> I would prefer to get a direct S3 source, and maybe we could
>>>>> collaborate on it and open-source it. Let me know if you prefer that
>>>>> and we can work directly on it by creating a JIRA.
>>>>>
>>>>> Thanks,
>>>>> Viral
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>>> <hshreedharan@cloudera.com <ma...@cloudera.com>> wrote:
>>>>>
>>>>>     In both cases, Sharninder is right :)
>>>>>
>>>>>     Sharninder wrote:
>>>>>
>>>>>
>>>>>
>>>>>     As far as I know, there is no (open source) implementation of an S3
>>>>>     source, so yes, you'll have to implement your own. You'll have to
>>>>>     implement a Pollable source and the dev documentation has an
>>>>> outline
>>>>>     that you can use. You can also look at the existing Execsource and
>>>>>     work your way up.
>>>>>
>>>>>     As far as I know, there is no way to configure flume without
>>>>>     using the
>>>>>     configuration file.
>>>>>
>>>>>
>>>>>
>>>>>     On Thu, Jul 31, 2014 at 7:57 PM, Paweł <prog88@gmail.com
>>>>>     <ma...@gmail.com>
>>>>>     <mailto:prog88@gmail.com <ma...@gmail.com>>> wrote:
>>>>>
>>>>>         Hi,
>>>>>         I'm wondering if Flume is able to read directly from S3.
>>>>>
>>>>>         I'll describe my case. I have log files stored in AWS S3. I
>>>>> have
>>>>>         to fetch periodically new S3 objects and read log lines from
>>>>> it.
>>>>>         Than use log lines (events) are processed in standard flume's
>>>>> way
>>>>>         (as with other sources).
>>>>>
>>>>>         *1) Is there any way to fetch S3 objects or I have to write
>>>>>     my own
>>>>>         Source?*
>>>>>
>>>>>
>>>>>         There is also second case. I want to have flume configuration
>>>>>         dynamic. Flume sources can change in time. New AWS key and S3
>>>>>         bucket can be added or deleted.
>>>>>
>>>>>         *2) Is there any other way to configure Flume than by static
>>>>>         configuration file?*
>>>>>
>>>>>         --
>>>>>         Paweł Róg
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: AWS S3 flume source

Posted by Paweł <pr...@gmail.com>.

Hi,
Jonathan how should we interpret your last e-mail? You opened an JIRA issue
and want to start implementing this and do you have any estimate how long
it will take?

I think the biggest challenge here is to have dynamic configuration of
Flume. It doesn't seem to be part of FLUME-2437 issue. Am I right?

> Would you need to be able to pull files from multiple S3 directories with
the same source?

I think we don't need to track multiple S3 buckets with a single source. I
just imagine an approach where each S3 source can be added or deleted on
demand and attached to any Channel. I'm only afraid about this dynamic
configuration. I'll open a new thread about this. It seems we have two
totally separate things:
* build S3 source
* make flume configurable dynamically

--
Paweł


2014-08-01 9:51 GMT+02:00 Otis Gospodnetic <ot...@gmail.com>:

> Hi,
>
> On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins <na...@streamsets.com>
> wrote:
>
>> Hey all,
>>
>> I created a JIRA for this:
>> https://issues.apache.org/jira/browse/FLUME-2437
>>
>
> Thanks!  Should Fix Version be set to the next Flume release version?
>
> I thought I'd start working on one myself, which can hopefully be
>> contributed back. I'm curious: do you have particular requirements? Based
>> on the emails in this thread, it sounds like the original goal was to have
>> something that's like a SpoolDirectorySource that just picks up new files
>> from S3. Is that accurate?
>>
>
> Yes, I think so.  We need to be able to:
> * fetch data (logs for pulling them in Logsene
> <http://sematext.com/logsene/>) from S3 periodically (e.g. every 1 min,
> every 5 min, etc.)
> * fetch data from multiple S3 buckets
> * associate an S3 bucket with a user/token/key
> * dynamically (i.e. without editing/writing config files stored on disk)
> add new S3 buckets from which data should be fetch
> * dynamically (i.e. without editing/writing config files stored on disk)
> stop fetching data from some S3 buckets
>
>
>> Would you need to be able to pull files from multiple S3 directories with
>> the same source?
>>
>
> I think the above addresses this question.
>
>
>> Thanks,
>> Natty
>>
>
> Thanks!
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>>
>>
>> On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic <
>> otis.gospodnetic@gmail.com> wrote:
>>
>>> +1 for seeing S3Source, starting with a JIRA issue.
>>>
>>> But being able to dynamically add/remove S3 buckets from which to pull
>>> data seems important.
>>>
>>> Any suggestions for how to approach that?
>>>
>>> Otis
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan <
>>> hshreedharan@cloudera.com> wrote:
>>>
>>>> Please go ahead and file a jira. If you are willing to submit a patch,
>>>> you can post it on the jira.
>>>>
>>>> Viral Bajaria wrote:
>>>>
>>>>
>>>> I have a similar use case that cropped up yesterday. I saw the archive
>>>> and found that there was a recommendation to build it as Sharninder
>>>> suggested.
>>>>
>>>> For now, I went down the route of writing a python script which
>>>> downloads from S3 and puts the files in a directory which is
>>>> configured to be picked up via a spooldir.
>>>>
>>>> I would prefer to get a direct S3 source, and maybe we could
>>>> collaborate on it and open-source it. Let me know if you prefer that
>>>> and we can work directly on it by creating a JIRA.
>>>>
>>>> Thanks,
>>>> Viral
>>>>
>>>>
>>>>
>>>> On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>>> <hshreedharan@cloudera.com <ma...@cloudera.com>> wrote:
>>>>
>>>>     In both cases, Sharninder is right :)
>>>>
>>>>     Sharninder wrote:
>>>>
>>>>
>>>>
>>>>     As far as I know, there is no (open source) implementation of an S3
>>>>     source, so yes, you'll have to implement your own. You'll have to
>>>>     implement a Pollable source and the dev documentation has an outline
>>>>     that you can use. You can also look at the existing Execsource and
>>>>     work your way up.
>>>>
>>>>     As far as I know, there is no way to configure flume without
>>>>     using the
>>>>     configuration file.
>>>>
>>>>
>>>>
>>>>     On Thu, Jul 31, 2014 at 7:57 PM, Paweł <prog88@gmail.com
>>>>     <ma...@gmail.com>
>>>>     <mailto:prog88@gmail.com <ma...@gmail.com>>> wrote:
>>>>
>>>>         Hi,
>>>>         I'm wondering if Flume is able to read directly from S3.
>>>>
>>>>         I'll describe my case. I have log files stored in AWS S3. I have
>>>>         to fetch periodically new S3 objects and read log lines from it.
>>>>         Than use log lines (events) are processed in standard flume's
>>>> way
>>>>         (as with other sources).
>>>>
>>>>         *1) Is there any way to fetch S3 objects or I have to write
>>>>     my own
>>>>         Source?*
>>>>
>>>>
>>>>         There is also second case. I want to have flume configuration
>>>>         dynamic. Flume sources can change in time. New AWS key and S3
>>>>         bucket can be added or deleted.
>>>>
>>>>         *2) Is there any other way to configure Flume than by static
>>>>         configuration file?*
>>>>
>>>>         --
>>>>         Paweł Róg
>>>>
>>>>
>>>>
>>>
>>
>

Re: AWS S3 flume source

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins <na...@streamsets.com>
wrote:

> Hey all,
>
> I created a JIRA for this:
> https://issues.apache.org/jira/browse/FLUME-2437
>

Thanks!  Should Fix Version be set to the next Flume release version?

I thought I'd start working on one myself, which can hopefully be
> contributed back. I'm curious: do you have particular requirements? Based
> on the emails in this thread, it sounds like the original goal was to have
> something that's like a SpoolDirectorySource that just picks up new files
> from S3. Is that accurate?
>

Yes, I think so.  We need to be able to:
* fetch data (logs for pulling them in Logsene
<http://sematext.com/logsene/>) from S3 periodically (e.g. every 1 min,
every 5 min, etc.)
* fetch data from multiple S3 buckets
* associate an S3 bucket with a user/token/key
* dynamically (i.e. without editing/writing config files stored on disk)
add new S3 buckets from which data should be fetch
* dynamically (i.e. without editing/writing config files stored on disk)
stop fetching data from some S3 buckets


> Would you need to be able to pull files from multiple S3 directories with
> the same source?
>

I think the above addresses this question.


> Thanks,
> Natty
>

Thanks!

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/



>
>
> On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> +1 for seeing S3Source, starting with a JIRA issue.
>>
>> But being able to dynamically add/remove S3 buckets from which to pull
>> data seems important.
>>
>> Any suggestions for how to approach that?
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan <
>> hshreedharan@cloudera.com> wrote:
>>
>>> Please go ahead and file a jira. If you are willing to submit a patch,
>>> you can post it on the jira.
>>>
>>> Viral Bajaria wrote:
>>>
>>>
>>> I have a similar use case that cropped up yesterday. I saw the archive
>>> and found that there was a recommendation to build it as Sharninder
>>> suggested.
>>>
>>> For now, I went down the route of writing a python script which
>>> downloads from S3 and puts the files in a directory which is
>>> configured to be picked up via a spooldir.
>>>
>>> I would prefer to get a direct S3 source, and maybe we could
>>> collaborate on it and open-source it. Let me know if you prefer that
>>> and we can work directly on it by creating a JIRA.
>>>
>>> Thanks,
>>> Viral
>>>
>>>
>>>
>>> On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>>> <hshreedharan@cloudera.com <ma...@cloudera.com>> wrote:
>>>
>>>     In both cases, Sharninder is right :)
>>>
>>>     Sharninder wrote:
>>>
>>>
>>>
>>>     As far as I know, there is no (open source) implementation of an S3
>>>     source, so yes, you'll have to implement your own. You'll have to
>>>     implement a Pollable source and the dev documentation has an outline
>>>     that you can use. You can also look at the existing Execsource and
>>>     work your way up.
>>>
>>>     As far as I know, there is no way to configure flume without
>>>     using the
>>>     configuration file.
>>>
>>>
>>>
>>>     On Thu, Jul 31, 2014 at 7:57 PM, Paweł <prog88@gmail.com
>>>     <ma...@gmail.com>
>>>     <mailto:prog88@gmail.com <ma...@gmail.com>>> wrote:
>>>
>>>         Hi,
>>>         I'm wondering if Flume is able to read directly from S3.
>>>
>>>         I'll describe my case. I have log files stored in AWS S3. I have
>>>         to fetch periodically new S3 objects and read log lines from it.
>>>         Than use log lines (events) are processed in standard flume's way
>>>         (as with other sources).
>>>
>>>         *1) Is there any way to fetch S3 objects or I have to write
>>>     my own
>>>         Source?*
>>>
>>>
>>>         There is also second case. I want to have flume configuration
>>>         dynamic. Flume sources can change in time. New AWS key and S3
>>>         bucket can be added or deleted.
>>>
>>>         *2) Is there any other way to configure Flume than by static
>>>         configuration file?*
>>>
>>>         --
>>>         Paweł Róg
>>>
>>>
>>>
>>
>

Re: AWS S3 flume source

Posted by Jonathan Natkins <na...@streamsets.com>.

Hey all,

I created a JIRA for this: https://issues.apache.org/jira/browse/FLUME-2437

I thought I'd start working on one myself, which can hopefully be
contributed back. I'm curious: do you have particular requirements? Based
on the emails in this thread, it sounds like the original goal was to have
something that's like a SpoolDirectorySource that just picks up new files
from S3. Is that accurate?

Would you need to be able to pull files from multiple S3 directories with
the same source?

Thanks,
Natty


On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> +1 for seeing S3Source, starting with a JIRA issue.
>
> But being able to dynamically add/remove S3 buckets from which to pull
> data seems important.
>
> Any suggestions for how to approach that?
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>> Please go ahead and file a jira. If you are willing to submit a patch,
>> you can post it on the jira.
>>
>> Viral Bajaria wrote:
>>
>>
>> I have a similar use case that cropped up yesterday. I saw the archive
>> and found that there was a recommendation to build it as Sharninder
>> suggested.
>>
>> For now, I went down the route of writing a python script which
>> downloads from S3 and puts the files in a directory which is
>> configured to be picked up via a spooldir.
>>
>> I would prefer to get a direct S3 source, and maybe we could
>> collaborate on it and open-source it. Let me know if you prefer that
>> and we can work directly on it by creating a JIRA.
>>
>> Thanks,
>> Viral
>>
>>
>>
>> On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
>> <hshreedharan@cloudera.com <ma...@cloudera.com>> wrote:
>>
>>     In both cases, Sharninder is right :)
>>
>>     Sharninder wrote:
>>
>>
>>
>>     As far as I know, there is no (open source) implementation of an S3
>>     source, so yes, you'll have to implement your own. You'll have to
>>     implement a Pollable source and the dev documentation has an outline
>>     that you can use. You can also look at the existing Execsource and
>>     work your way up.
>>
>>     As far as I know, there is no way to configure flume without
>>     using the
>>     configuration file.
>>
>>
>>
>>     On Thu, Jul 31, 2014 at 7:57 PM, Paweł <prog88@gmail.com
>>     <ma...@gmail.com>
>>     <mailto:prog88@gmail.com <ma...@gmail.com>>> wrote:
>>
>>         Hi,
>>         I'm wondering if Flume is able to read directly from S3.
>>
>>         I'll describe my case. I have log files stored in AWS S3. I have
>>         to fetch periodically new S3 objects and read log lines from it.
>>         Than use log lines (events) are processed in standard flume's way
>>         (as with other sources).
>>
>>         *1) Is there any way to fetch S3 objects or I have to write
>>     my own
>>         Source?*
>>
>>
>>         There is also second case. I want to have flume configuration
>>         dynamic. Flume sources can change in time. New AWS key and S3
>>         bucket can be added or deleted.
>>
>>         *2) Is there any other way to configure Flume than by static
>>         configuration file?*
>>
>>         --
>>         Paweł Róg
>>
>>
>>
>