You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by Jürgen Jakobitsch <j....@semantic-web.at> on 2016/10/11 15:02:07 UTC

SpoolingDirectorySource with parentDirectory header

hi,

for a project using the SpoolingDirectorySource with a HDFS sink i wanted
to have the same (relative) directory structure in HDFS as in the spool
directory, which uses subdirectories.

to achieve this i updated (all in flume-ng-core)

org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java
org/apache/flume/source/SpoolDirectorySource.java
org/apache/flume/source/SpoolDirectorySourceConfigurationConstants.java

to include the following additional (optional) headers (analog to
basenameHeader):

parentDirectory (the parent directory of the file)

example:
spooldirectory: /var/lib/flume/data/
file: /var/lib/flume/data/some/subdirectory/somefile.log

parentDirectoryHeader = /var/lib/flume/data/some/subdirectory/

relativeParentDirectory (the parent directory of the file relative from the
spooldirectory)

example:
spooldirectory: /var/lib/flume/data/
file: /var/lib/flume/data/some/subdirectory/somefile.log

relativeParentDirectoryHeader = some/subdirectory/


i'm now using the following flume config (excerpt) to get a nice folder
structure in HDFS:

flume.sources.dirSource.spoolDir = /var/lib/flume/data
flume.sources.dirSource.recursiveDirectorySearch = true
flume.sources.dirSource.basenameHeader = true
flume.sources.dirSource.basenameHeaderKey = basename
flume.sources.dirSource.relativeParentDirectoryHeader = true
flume.sources.dirSource.relativeParentDirectoryHeaderKey =
relativeParentDirectory
...
flume.sinks.HDFS.type = hdfs
flume.sinks.HDFS.hdfs.path = hdfs://
bigdata.example.com:54310/application/root/directory/%{relativeParentDirectory}
flume.sinks.HDFS.hdfs.fileType = DataStream
flume.sinks.HDFS.hdfs.filePrefix = %{basename}

example:

a file : /var/lib/flume/data/some/subdirectory/somefile.log
would now be stored in

hdfs://
bigdata.example.com:54310/application/root/directory/some/subdirectory/somefile.log.1476194723885

i attach three patches in case someone finds this useful
(used: http://git-wip-us.apache.org/repos/asf/flume.git => branch : trunk
and instructions from here [1] to create the patch)

krj

[1]
http://stackoverflow.com/questions/9396240/how-do-i-simply-create-a-patch-from-my-latest-git-commit

*Jürgen Jakobitsch*
Innovation Director
Semantic Web Company GmbH
EU: +43-1-4021235-0
Mobile: +43-676-6212710
http://www.semantic-web.at
http://www.poolparty.biz


PERSONAL INFORMATION
| web       : http://www.turnguard.com
| foaf      : http://www.turnguard.com/turnguard
| g+        : https://plus.google.com/111233759991616358206/posts
| skype     : jakobitsch-punkt
| xmlns:tg  = "http://www.turnguard.com/turnguard#"

Re: SpoolingDirectorySource with parentDirectory header

Posted by Denes Arvay <de...@cloudera.com>.
Hi Jürgen,

Thank you for your patch, I think the proposed changes are valuable
additions to Flume.
Could you please file a Jira at https://issues.apache.org/jira/ and either
attach the patches to it and upload them to Reviewboard (
https://reviews.apache.org) or, if it's easier for you, issue a pull
request on github (start by forking https://github.com/apache/flume/)?

You can find more details on how to contribute on this wiki page:
https://cwiki.apache.org/confluence/display/FLUME/How+to+Contribute
(Note: we are experimenting with github pull requests nowadays, it's not
mentioned in this doc, but feel free to use it if you prefer)

I skimmed through your patches quickly and have the following comments:
- Could you please add tests? As a start you can have a look
on TestSpoolDirectorySource for example.
- Adding the newly added config parameters to the documentation would be
very helpful, too.

Let us know if you have any questions or need assistance on submitting the
patch.

Kind regards,
Denes

On Tue, Oct 11, 2016 at 5:02 PM Jürgen Jakobitsch <
j.jakobitsch@semantic-web.at> wrote:

> hi,
>
> for a project using the SpoolingDirectorySource with a HDFS sink i wanted
> to have the same (relative) directory structure in HDFS as in the spool
> directory, which uses subdirectories.
>
> to achieve this i updated (all in flume-ng-core)
>
> org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java
> org/apache/flume/source/SpoolDirectorySource.java
> org/apache/flume/source/SpoolDirectorySourceConfigurationConstants.java
>
> to include the following additional (optional) headers (analog to
> basenameHeader):
>
> parentDirectory (the parent directory of the file)
>
> example:
> spooldirectory: /var/lib/flume/data/
> file: /var/lib/flume/data/some/subdirectory/somefile.log
>
> parentDirectoryHeader = /var/lib/flume/data/some/subdirectory/
>
> relativeParentDirectory (the parent directory of the file relative from
> the spooldirectory)
>
> example:
> spooldirectory: /var/lib/flume/data/
> file: /var/lib/flume/data/some/subdirectory/somefile.log
>
> relativeParentDirectoryHeader = some/subdirectory/
>
>
> i'm now using the following flume config (excerpt) to get a nice folder
> structure in HDFS:
>
> flume.sources.dirSource.spoolDir = /var/lib/flume/data
> flume.sources.dirSource.recursiveDirectorySearch = true
> flume.sources.dirSource.basenameHeader = true
> flume.sources.dirSource.basenameHeaderKey = basename
> flume.sources.dirSource.relativeParentDirectoryHeader = true
> flume.sources.dirSource.relativeParentDirectoryHeaderKey =
> relativeParentDirectory
> ...
> flume.sinks.HDFS.type = hdfs
> flume.sinks.HDFS.hdfs.path = hdfs://
> bigdata.example.com:54310/application/root/directory/%{relativeParentDirectory}
> flume.sinks.HDFS.hdfs.fileType = DataStream
> flume.sinks.HDFS.hdfs.filePrefix = %{basename}
>
> example:
>
> a file : /var/lib/flume/data/some/subdirectory/somefile.log
> would now be stored in
>
> hdfs://
> bigdata.example.com:54310/application/root/directory/some/subdirectory/somefile.log.1476194723885
>
> i attach three patches in case someone finds this useful
> (used: http://git-wip-us.apache.org/repos/asf/flume.git => branch : trunk
> and instructions from here [1] to create the patch)
>
> krj
>
> [1]
> http://stackoverflow.com/questions/9396240/how-do-i-simply-create-a-patch-from-my-latest-git-commit
>
> *Jürgen Jakobitsch*
> Innovation Director
> Semantic Web Company GmbH
> EU: +43-1-4021235-0
> Mobile: +43-676-6212710 <+43%20676%206212710>
> http://www.semantic-web.at
> http://www.poolparty.biz
>
>
>
> PERSONAL INFORMATION
>
> | web       : http://www.turnguard.com
>
> | foaf      : http://www.turnguard.com/turnguard
>
> | g+        : https://plus.google.com/111233759991616358206/posts
>
> | skype     : jakobitsch-punkt
>
> | xmlns:tg  = "http://www.turnguard.com/turnguard#"
>
>
>