You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2018/07/26 06:54:59 UTC

[GitHub] josephglanville commented on issue #5492: Native parallel batch indexing without shuffle

josephglanville commented on issue #5492: Native parallel batch indexing without shuffle
URL: https://github.com/apache/incubator-druid/pull/5492#issuecomment-407996093
 
 
   @jihoonson this is outside of the scope of this PR but would you be interested in collaborating on making the current Firehose layered abstractions better support non-textual formats?
   
   The backstory is that we mostly use InputRowParsers that operate on ByteBuffer and read from either raw byte messages from Kafka or SequenceFile format from archival storage (GCS in our case).
   
   We would like to modify/extend the current prefetching and iterating abstractions to support iteration over other file formats not just newline delimited files and most importantly support emitting non-string rows for parsing so that ByteBufferInputRowParsers can be utilised with native batch ingestion.
   
   In my mind there is a missing abstraction layer that should handle creating an iterator from a file that returns rows that can then be passed to InputRowParsers.
   Basically an InputFileFormat interface where the current implementation would be TextFileInputFormat and we would want a SequenceFileInputFormat but any iterable file format would be possible.
   This separates the concern of reading rows from the files themselves from the Firehose which should be responsible for connecting to storage and fetching files.
   
   We could of course sidestep this by creating a custom Firehose that simply implements the exact logic we want and handles prefetching etc without utilising the existing interfaces but we would much prefer upstreaming an approach that enables batch processing for all users wanting to process non-textual formats.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org