You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@apex.apache.org by Aaron Bossert <aa...@punchcyber.com> on 2018/06/21 19:29:20 UTC
Before I start re-inventing the wheel: AbstractFileInputOperator
folks,
I have been working with
com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a
slightly different use case, but keep running into inneficiencies...prior
to detailing my use case, up front, I do have this working, but I feel like
it is horribly inefficient and definitely far from elegant.
- Scan multiple directories (not one as is expected)
- Accept changes to directories to be scanned on the fly
- Accept multiple file types (based on checking magic bytes/number)
- Assume that files may be in any of the following conditions:
- "Raw"
- Compressed
- Archived
- Compressed and Archived
- Associate provenance (e.g. customer and sensor) with events extracted
from these files
My existing solution was to provide my own implementation of
AbstractFileInputOperator.DirectoryScanner, and also to spit out
arrays/lists of events rather than Strings (lines from each file) due to
the binary nature of most of my input file types.
I am seeing several mismatches between my use case and the
AbstractFileInputOperator, but also see a ton of existing work within it
that I would prefer not to redo (partitioning, fault-tolerance, etc.). Is
there a more appropriate class/Interface I should be looking at or is it
appropriate to create a new interface to handle a directory scanner that
accounts for multiple directories and the potential to deal with compressed
and archived files (thus things like openFile would need to support
outputting a list of inputstreams at a minimum to accomodate these
files)...I just want to make sure I am not overdoing things in a quest for
more efficient and clean code...
--
M. Aaron Bossert
(571) 242-4021
Punch Cyber Analytics Group
Re: Before I start re-inventing the wheel: AbstractFileInputOperator
Posted by Aaron Bossert <aa...@punchcyber.com>.
Thomas,
You know, I did read through the source for that, but I guess my
imagination didn't kick in...Maybe I got stuck on the name being a file
splitter as opposed to the potential to re-purpose it for archived
files...Thanks for the pointer...I will give that a go before re-inventing
the wheel...
On Fri, Jun 22, 2018 at 2:08 AM Thomas Weise <th...@apache.org> wrote:
> Did you already look at FileSplitter/BlockReader?
>
> https://apex.apache.org/docs/malhar/operators/file_splitter/
>
> Would that better support your customization requirements?
>
>
>
> --
> sent from mobile
>
>
> On Thu, Jun 21, 2018, 9:29 PM Aaron Bossert <aa...@punchcyber.com> wrote:
>
>> folks,
>>
>> I have been working with
>> com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a
>> slightly different use case, but keep running into inneficiencies...prior
>> to detailing my use case, up front, I do have this working, but I feel like
>> it is horribly inefficient and definitely far from elegant.
>>
>>
>> - Scan multiple directories (not one as is expected)
>> - Accept changes to directories to be scanned on the fly
>> - Accept multiple file types (based on checking magic bytes/number)
>> - Assume that files may be in any of the following conditions:
>> - "Raw"
>> - Compressed
>> - Archived
>> - Compressed and Archived
>> - Associate provenance (e.g. customer and sensor) with events
>> extracted from these files
>>
>> My existing solution was to provide my own implementation of
>> AbstractFileInputOperator.DirectoryScanner, and also to spit out
>> arrays/lists of events rather than Strings (lines from each file) due to
>> the binary nature of most of my input file types.
>>
>> I am seeing several mismatches between my use case and the
>> AbstractFileInputOperator, but also see a ton of existing work within it
>> that I would prefer not to redo (partitioning, fault-tolerance, etc.). Is
>> there a more appropriate class/Interface I should be looking at or is it
>> appropriate to create a new interface to handle a directory scanner that
>> accounts for multiple directories and the potential to deal with compressed
>> and archived files (thus things like openFile would need to support
>> outputting a list of inputstreams at a minimum to accomodate these
>> files)...I just want to make sure I am not overdoing things in a quest for
>> more efficient and clean code...
>>
>> --
>>
>> M. Aaron Bossert
>> (571) 242-4021
>> Punch Cyber Analytics Group
>>
>>
>>
--
M. Aaron Bossert
(571) 242-4021
Punch Cyber Analytics Group
Re: Before I start re-inventing the wheel: AbstractFileInputOperator
Posted by Thomas Weise <th...@apache.org>.
Did you already look at FileSplitter/BlockReader?
https://apex.apache.org/docs/malhar/operators/file_splitter/
Would that better support your customization requirements?
--
sent from mobile
On Thu, Jun 21, 2018, 9:29 PM Aaron Bossert <aa...@punchcyber.com> wrote:
> folks,
>
> I have been working with
> com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a
> slightly different use case, but keep running into inneficiencies...prior
> to detailing my use case, up front, I do have this working, but I feel like
> it is horribly inefficient and definitely far from elegant.
>
>
> - Scan multiple directories (not one as is expected)
> - Accept changes to directories to be scanned on the fly
> - Accept multiple file types (based on checking magic bytes/number)
> - Assume that files may be in any of the following conditions:
> - "Raw"
> - Compressed
> - Archived
> - Compressed and Archived
> - Associate provenance (e.g. customer and sensor) with events
> extracted from these files
>
> My existing solution was to provide my own implementation of
> AbstractFileInputOperator.DirectoryScanner, and also to spit out
> arrays/lists of events rather than Strings (lines from each file) due to
> the binary nature of most of my input file types.
>
> I am seeing several mismatches between my use case and the
> AbstractFileInputOperator, but also see a ton of existing work within it
> that I would prefer not to redo (partitioning, fault-tolerance, etc.). Is
> there a more appropriate class/Interface I should be looking at or is it
> appropriate to create a new interface to handle a directory scanner that
> accounts for multiple directories and the potential to deal with compressed
> and archived files (thus things like openFile would need to support
> outputting a list of inputstreams at a minimum to accomodate these
> files)...I just want to make sure I am not overdoing things in a quest for
> more efficient and clean code...
>
> --
>
> M. Aaron Bossert
> (571) 242-4021
> Punch Cyber Analytics Group
>
>
>