You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by WeiWan <we...@sunteng.com> on 2017/01/03 08:35:57 UTC

IndexR, a new storage plugin for Drill

IndexR is a distributed, columnar storage system based on HDFS, which focus on fast analyse, both for massive static(historical) data and rapidly ingesting realtime data. IndexR is designed for OLAP.

Fast analyze on large dataset
Realtime ingestion with zero delay for query
Deep integration with Hadoop ecosystem
Hardware efficiency
Highly avaliable, scalable, manageable and simple
Adapted with popular query engines like Apache Drill, Apache Hive, etc.

And now it is open source.

Project: https://github.com/shunfei/indexr <https://github.com/shunfei/indexr> 
Wiki: https://github.com/shunfei/indexr/wiki <https://github.com/shunfei/indexr/wiki>

IndexR is original developed by Sunteng Tech. This project started a year ago and now has been deployed to several productions in our company. The whole cluster consumes over 30 billions events each day in realtime from Kafka. The largest table contains over 10 billions rows (after rollup) and rapidly increasing. Most of the statistic/analyze queries’ latency is less than 3 seconds in real world production environment.

Currently it is mainly used as Drill and Hive storage plugin. It should be quite easy to master.

We hope IndexR be a favor to you and make it better.

Regards
Flow Wei

Re: IndexR, a new storage plugin for Drill

Posted by WeiWan <we...@sunteng.com>.

Hi Robin,

That is a very good question really. Our team knows Apache Kudu at the very beginning of IndexR project. 
They do share some commons:
* streaming ingestion
* both claim fast analytics ability
* columnar storage
* Hadoop integration

And there are many differences:
* Currently IndexR is not a stand alone database, it needs an query engine like Drill to provide query ability. Kudu can be used independently.
* According to document, Kudu provides “competitive random access performance”. IndexR does not promise. It doesn’t have primary index and rough set index is not design for that. IndexR is not designed for OLTP.
* IndexR does not support updates while Kudu does. In realtime ingestion, IndexR accepts row events and simply append to latest segments, maybe do some rearranges or merges for performance. This greatly reduce system complexity and let IndexR focus on achieving as much performance as possible. IndexR offers very high ingestion and scan speed with relatively small memory/cpu/io cost.
* IndexR stores data on HDFS and those data are Hive queryable and manageable. Means we can do online and offline analyze on the same data. 
* etc.

I have heard that some teams using Kudu as write optimized format and Parquet as read optimized format, combined them by Impala, constantly transform them by mapredue. This process and mechanism have been implemented in IndexR.

Regards
Flow Wei



> On Jan 4, 2017, at 20:51, Robin Moffatt <ro...@rittmanmead.com> wrote:
> 
> Hi Flow Wei,
> This looks pretty interesting. Any comments on comparison of indexR with
> Apache Kudu?
> 
> thanks, Robin.
> 
> On 4 January 2017 at 11:20, WeiWan <we...@sunteng.com> wrote:
> 
>> Hi Nicolas,
>> 
>>> 1)Does both drill and hive support predicat pushdown with indexR ? I mean
>>> using the indexes and not scanning table.
>> 
>> Of course we supports predicates pushdown.
>> IndexR implements a special index so called Rough Set Index, which is very
>> suitable for statistic queries. It can effectively filter out those
>> irrelevant data chunks and cost very little comparing to other index form.
>> The idea is original comes from Infobright (ICE). I'm sure you can find
>> many useful links by google with “infobright rough set”. In some aspects
>> you can think IndexR as another Infobright which is open source,
>> distributed, on Hadoop and realtime ingest supported.
>> 
>> 
>>> 2)Does it support join push down, sort etc ?
>> 
>> It does not. Those job should be done by query layer, i.e. Drill.
>> But we did hope Drill can support aggregation push down, which can really
>> speed up queries in the cases like “select count(*), sum(a), max(b) form
>> table"
>> 
>>> 3)Can you elaborate why your team choose Drill versus equivalent (impala,
>>> presto…)
>> 
>> We are not very familiar with Impala, Presto. But we did tried Spark. We
>> didn’t choose Spark because at that time, early 2016, Spark’s API for
>> scanner is not stable enough, and we need the processes running on local
>> machines, instead of running on Yarn. And most of all, we love Drill for
>> its stability, efficiency, simplicity, and the nice interface for storage
>> plugin.
>> 
>> Regards
>> Flow Wei
>> 
>> 
>> 
>>> On Jan 4, 2017, at 16:32, Nicolas Paris <ni...@gmail.com> wrote:
>>> 
>>> Hi Weiwan,
>>> 
>>> 1)Does both drill and hive support predicat pushdown with indexR ? I mean
>>> using the indexes and not scanning table.
>>> 2)Does it support join push down, sort etc ?
>>> 3)Can you elaborate why your team choose Drill versus equivalent (impala,
>>> presto...)
>>> 
>>> Thanks !
>>> 
>>> 
>>> 
>>> 2017-01-04 2:59 GMT+01:00 WeiWan <we...@sunteng.com>:
>>> 
>>>> Hi,
>>>> 
>>>> It will take some time for IndexR plugin to merge into Drill. But you
>> can
>>>> try it out already by following those documents.
>>>> 
>>>> Compilation:  https://github.com/shunfei/indexr/wiki/Compilation <
>>>> https://github.com/shunfei/indexr/wiki/Compilation>
>>>> Deployment:  https://github.com/shunfei/indexr/wiki/Deployment <
>>>> https://github.com/shunfei/indexr/wiki/Deployment>
>>>> User Guide:  https://github.com/shunfei/indexr/wiki/User-Guide <
>>>> https://github.com/shunfei/indexr/wiki/User-Guide>
>>>> Regards
>>>> Flow Wei
>>>> 
>>>> 
>>>> 
>>>>> On Jan 4, 2017, at 00:22, Jinfeng Ni <jn...@apache.org> wrote:
>>>>> 
>>>>> Looks like IndexR is very interesting storage plugin. Although I have
>>>>> not looked into the detail, I'm looking forward to seeing the PR and
>>>>> hopefully getting this into Drill!
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jinfeng
>>>>> 
>>>>> 
>>>>> On Tue, Jan 3, 2017 at 7:30 AM, WeiWan <we...@sunteng.com> wrote:
>>>>>> Hi Charles,
>>>>>> 
>>>>>> It would be great if IndexR plugin can be merged into official Drill
>>>> project. I will do some more tests based on latest Drill version and
>> submit
>>>> a PR.
>>>>>> 
>>>>>> Regards
>>>>>> Flow Wei
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Jan 3, 2017, at 23:18, Charles Givre <cg...@gmail.com> wrote:
>>>>>>> 
>>>>>>> This sounds really interesting.  Will you be submitting a PR to
>>>> integrate this into the main Drill codebase?
>>>>>>> — C
>>>>>>> 
>>>>>>>> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
>>>>>>>> 
>>>>>>>> IndexR is a distributed, columnar storage system based on HDFS,
>> which
>>>> focus on fast analyse, both for massive static(historical) data and
>> rapidly
>>>> ingesting realtime data. IndexR is designed for OLAP.
>>>>>>>> 
>>>>>>>> Fast analyze on large dataset
>>>>>>>> Realtime ingestion with zero delay for query
>>>>>>>> Deep integration with Hadoop ecosystem
>>>>>>>> Hardware efficiency
>>>>>>>> Highly avaliable, scalable, manageable and simple
>>>>>>>> Adapted with popular query engines like Apache Drill, Apache Hive,
>>>> etc.
>>>>>>>> 
>>>>>>>> And now it is open source.
>>>>>>>> 
>>>>>>>> Project: https://github.com/shunfei/indexr <
>>>> https://github.com/shunfei/indexr>
>>>>>>>> Wiki: https://github.com/shunfei/indexr/wiki <
>>>> https://github.com/shunfei/indexr/wiki>
>>>>>>>> 
>>>>>>>> IndexR is original developed by Sunteng Tech. This project started a
>>>> year ago and now has been deployed to several productions in our
>> company.
>>>> The whole cluster consumes over 30 billions events each day in realtime
>>>> from Kafka. The largest table contains over 10 billions rows (after
>> rollup)
>>>> and rapidly increasing. Most of the statistic/analyze queries’ latency
>> is
>>>> less than 3 seconds in real world production environment.
>>>>>>>> 
>>>>>>>> Currently it is mainly used as Drill and Hive storage plugin. It
>>>> should be quite easy to master.
>>>>>>>> 
>>>>>>>> We hope IndexR be a favor to you and make it better.
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> Flow Wei
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>

Re: IndexR, a new storage plugin for Drill

Posted by Robin Moffatt <ro...@rittmanmead.com>.

Hi Flow Wei,
This looks pretty interesting. Any comments on comparison of indexR with
Apache Kudu?

thanks, Robin.

On 4 January 2017 at 11:20, WeiWan <we...@sunteng.com> wrote:

> Hi Nicolas,
>
> > 1)Does both drill and hive support predicat pushdown with indexR ? I mean
> > using the indexes and not scanning table.
>
> Of course we supports predicates pushdown.
> IndexR implements a special index so called Rough Set Index, which is very
> suitable for statistic queries. It can effectively filter out those
> irrelevant data chunks and cost very little comparing to other index form.
> The idea is original comes from Infobright (ICE). I'm sure you can find
> many useful links by google with “infobright rough set”. In some aspects
> you can think IndexR as another Infobright which is open source,
> distributed, on Hadoop and realtime ingest supported.
>
>
> > 2)Does it support join push down, sort etc ?
>
> It does not. Those job should be done by query layer, i.e. Drill.
> But we did hope Drill can support aggregation push down, which can really
> speed up queries in the cases like “select count(*), sum(a), max(b) form
> table"
>
> > 3)Can you elaborate why your team choose Drill versus equivalent (impala,
> > presto…)
>
> We are not very familiar with Impala, Presto. But we did tried Spark. We
> didn’t choose Spark because at that time, early 2016, Spark’s API for
> scanner is not stable enough, and we need the processes running on local
> machines, instead of running on Yarn. And most of all, we love Drill for
> its stability, efficiency, simplicity, and the nice interface for storage
> plugin.
>
> Regards
> Flow Wei
>
>
>
> > On Jan 4, 2017, at 16:32, Nicolas Paris <ni...@gmail.com> wrote:
> >
> > Hi Weiwan,
> >
> > 1)Does both drill and hive support predicat pushdown with indexR ? I mean
> > using the indexes and not scanning table.
> > 2)Does it support join push down, sort etc ?
> > 3)Can you elaborate why your team choose Drill versus equivalent (impala,
> > presto...)
> >
> > Thanks !
> >
> >
> >
> > 2017-01-04 2:59 GMT+01:00 WeiWan <we...@sunteng.com>:
> >
> >> Hi,
> >>
> >> It will take some time for IndexR plugin to merge into Drill. But you
> can
> >> try it out already by following those documents.
> >>
> >> Compilation:  https://github.com/shunfei/indexr/wiki/Compilation <
> >> https://github.com/shunfei/indexr/wiki/Compilation>
> >> Deployment:  https://github.com/shunfei/indexr/wiki/Deployment <
> >> https://github.com/shunfei/indexr/wiki/Deployment>
> >> User Guide:  https://github.com/shunfei/indexr/wiki/User-Guide <
> >> https://github.com/shunfei/indexr/wiki/User-Guide>
> >> Regards
> >> Flow Wei
> >>
> >>
> >>
> >>> On Jan 4, 2017, at 00:22, Jinfeng Ni <jn...@apache.org> wrote:
> >>>
> >>> Looks like IndexR is very interesting storage plugin. Although I have
> >>> not looked into the detail, I'm looking forward to seeing the PR and
> >>> hopefully getting this into Drill!
> >>>
> >>> Thanks,
> >>>
> >>> Jinfeng
> >>>
> >>>
> >>> On Tue, Jan 3, 2017 at 7:30 AM, WeiWan <we...@sunteng.com> wrote:
> >>>> Hi Charles,
> >>>>
> >>>> It would be great if IndexR plugin can be merged into official Drill
> >> project. I will do some more tests based on latest Drill version and
> submit
> >> a PR.
> >>>>
> >>>> Regards
> >>>> Flow Wei
> >>>>
> >>>>
> >>>>
> >>>>> On Jan 3, 2017, at 23:18, Charles Givre <cg...@gmail.com> wrote:
> >>>>>
> >>>>> This sounds really interesting.  Will you be submitting a PR to
> >> integrate this into the main Drill codebase?
> >>>>> — C
> >>>>>
> >>>>>> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
> >>>>>>
> >>>>>> IndexR is a distributed, columnar storage system based on HDFS,
> which
> >> focus on fast analyse, both for massive static(historical) data and
> rapidly
> >> ingesting realtime data. IndexR is designed for OLAP.
> >>>>>>
> >>>>>> Fast analyze on large dataset
> >>>>>> Realtime ingestion with zero delay for query
> >>>>>> Deep integration with Hadoop ecosystem
> >>>>>> Hardware efficiency
> >>>>>> Highly avaliable, scalable, manageable and simple
> >>>>>> Adapted with popular query engines like Apache Drill, Apache Hive,
> >> etc.
> >>>>>>
> >>>>>> And now it is open source.
> >>>>>>
> >>>>>> Project: https://github.com/shunfei/indexr <
> >> https://github.com/shunfei/indexr>
> >>>>>> Wiki: https://github.com/shunfei/indexr/wiki <
> >> https://github.com/shunfei/indexr/wiki>
> >>>>>>
> >>>>>> IndexR is original developed by Sunteng Tech. This project started a
> >> year ago and now has been deployed to several productions in our
> company.
> >> The whole cluster consumes over 30 billions events each day in realtime
> >> from Kafka. The largest table contains over 10 billions rows (after
> rollup)
> >> and rapidly increasing. Most of the statistic/analyze queries’ latency
> is
> >> less than 3 seconds in real world production environment.
> >>>>>>
> >>>>>> Currently it is mainly used as Drill and Hive storage plugin. It
> >> should be quite easy to master.
> >>>>>>
> >>>>>> We hope IndexR be a favor to you and make it better.
> >>>>>>
> >>>>>> Regards
> >>>>>> Flow Wei
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >>
>

Re: IndexR, a new storage plugin for Drill

Posted by WeiWan <we...@sunteng.com>.

Hi Nicolas,

> 1)Does both drill and hive support predicat pushdown with indexR ? I mean
> using the indexes and not scanning table.

Of course we supports predicates pushdown.
IndexR implements a special index so called Rough Set Index, which is very suitable for statistic queries. It can effectively filter out those irrelevant data chunks and cost very little comparing to other index form. The idea is original comes from Infobright (ICE). I'm sure you can find many useful links by google with “infobright rough set”. In some aspects you can think IndexR as another Infobright which is open source, distributed, on Hadoop and realtime ingest supported.


> 2)Does it support join push down, sort etc ?

It does not. Those job should be done by query layer, i.e. Drill. 
But we did hope Drill can support aggregation push down, which can really speed up queries in the cases like “select count(*), sum(a), max(b) form table"

> 3)Can you elaborate why your team choose Drill versus equivalent (impala,
> presto…)

We are not very familiar with Impala, Presto. But we did tried Spark. We didn’t choose Spark because at that time, early 2016, Spark’s API for scanner is not stable enough, and we need the processes running on local machines, instead of running on Yarn. And most of all, we love Drill for its stability, efficiency, simplicity, and the nice interface for storage plugin.

Regards
Flow Wei



> On Jan 4, 2017, at 16:32, Nicolas Paris <ni...@gmail.com> wrote:
> 
> Hi Weiwan,
> 
> 1)Does both drill and hive support predicat pushdown with indexR ? I mean
> using the indexes and not scanning table.
> 2)Does it support join push down, sort etc ?
> 3)Can you elaborate why your team choose Drill versus equivalent (impala,
> presto...)
> 
> Thanks !
> 
> 
> 
> 2017-01-04 2:59 GMT+01:00 WeiWan <we...@sunteng.com>:
> 
>> Hi,
>> 
>> It will take some time for IndexR plugin to merge into Drill. But you can
>> try it out already by following those documents.
>> 
>> Compilation:  https://github.com/shunfei/indexr/wiki/Compilation <
>> https://github.com/shunfei/indexr/wiki/Compilation>
>> Deployment:  https://github.com/shunfei/indexr/wiki/Deployment <
>> https://github.com/shunfei/indexr/wiki/Deployment>
>> User Guide:  https://github.com/shunfei/indexr/wiki/User-Guide <
>> https://github.com/shunfei/indexr/wiki/User-Guide>
>> Regards
>> Flow Wei
>> 
>> 
>> 
>>> On Jan 4, 2017, at 00:22, Jinfeng Ni <jn...@apache.org> wrote:
>>> 
>>> Looks like IndexR is very interesting storage plugin. Although I have
>>> not looked into the detail, I'm looking forward to seeing the PR and
>>> hopefully getting this into Drill!
>>> 
>>> Thanks,
>>> 
>>> Jinfeng
>>> 
>>> 
>>> On Tue, Jan 3, 2017 at 7:30 AM, WeiWan <we...@sunteng.com> wrote:
>>>> Hi Charles,
>>>> 
>>>> It would be great if IndexR plugin can be merged into official Drill
>> project. I will do some more tests based on latest Drill version and submit
>> a PR.
>>>> 
>>>> Regards
>>>> Flow Wei
>>>> 
>>>> 
>>>> 
>>>>> On Jan 3, 2017, at 23:18, Charles Givre <cg...@gmail.com> wrote:
>>>>> 
>>>>> This sounds really interesting.  Will you be submitting a PR to
>> integrate this into the main Drill codebase?
>>>>> — C
>>>>> 
>>>>>> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
>>>>>> 
>>>>>> IndexR is a distributed, columnar storage system based on HDFS, which
>> focus on fast analyse, both for massive static(historical) data and rapidly
>> ingesting realtime data. IndexR is designed for OLAP.
>>>>>> 
>>>>>> Fast analyze on large dataset
>>>>>> Realtime ingestion with zero delay for query
>>>>>> Deep integration with Hadoop ecosystem
>>>>>> Hardware efficiency
>>>>>> Highly avaliable, scalable, manageable and simple
>>>>>> Adapted with popular query engines like Apache Drill, Apache Hive,
>> etc.
>>>>>> 
>>>>>> And now it is open source.
>>>>>> 
>>>>>> Project: https://github.com/shunfei/indexr <
>> https://github.com/shunfei/indexr>
>>>>>> Wiki: https://github.com/shunfei/indexr/wiki <
>> https://github.com/shunfei/indexr/wiki>
>>>>>> 
>>>>>> IndexR is original developed by Sunteng Tech. This project started a
>> year ago and now has been deployed to several productions in our company.
>> The whole cluster consumes over 30 billions events each day in realtime
>> from Kafka. The largest table contains over 10 billions rows (after rollup)
>> and rapidly increasing. Most of the statistic/analyze queries’ latency is
>> less than 3 seconds in real world production environment.
>>>>>> 
>>>>>> Currently it is mainly used as Drill and Hive storage plugin. It
>> should be quite easy to master.
>>>>>> 
>>>>>> We hope IndexR be a favor to you and make it better.
>>>>>> 
>>>>>> Regards
>>>>>> Flow Wei
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: IndexR, a new storage plugin for Drill

Posted by Nicolas Paris <ni...@gmail.com>.

Hi Weiwan,

1)Does both drill and hive support predicat pushdown with indexR ? I mean
using the indexes and not scanning table.
2)Does it support join push down, sort etc ?
3)Can you elaborate why your team choose Drill versus equivalent (impala,
presto...)

Thanks !



2017-01-04 2:59 GMT+01:00 WeiWan <we...@sunteng.com>:

> Hi,
>
> It will take some time for IndexR plugin to merge into Drill. But you can
> try it out already by following those documents.
>
> Compilation:  https://github.com/shunfei/indexr/wiki/Compilation <
> https://github.com/shunfei/indexr/wiki/Compilation>
> Deployment:  https://github.com/shunfei/indexr/wiki/Deployment <
> https://github.com/shunfei/indexr/wiki/Deployment>
> User Guide:  https://github.com/shunfei/indexr/wiki/User-Guide <
> https://github.com/shunfei/indexr/wiki/User-Guide>
> Regards
> Flow Wei
>
>
>
> > On Jan 4, 2017, at 00:22, Jinfeng Ni <jn...@apache.org> wrote:
> >
> > Looks like IndexR is very interesting storage plugin. Although I have
> > not looked into the detail, I'm looking forward to seeing the PR and
> > hopefully getting this into Drill!
> >
> > Thanks,
> >
> > Jinfeng
> >
> >
> > On Tue, Jan 3, 2017 at 7:30 AM, WeiWan <we...@sunteng.com> wrote:
> >> Hi Charles,
> >>
> >> It would be great if IndexR plugin can be merged into official Drill
> project. I will do some more tests based on latest Drill version and submit
> a PR.
> >>
> >> Regards
> >> Flow Wei
> >>
> >>
> >>
> >>> On Jan 3, 2017, at 23:18, Charles Givre <cg...@gmail.com> wrote:
> >>>
> >>> This sounds really interesting.  Will you be submitting a PR to
> integrate this into the main Drill codebase?
> >>> — C
> >>>
> >>>> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
> >>>>
> >>>> IndexR is a distributed, columnar storage system based on HDFS, which
> focus on fast analyse, both for massive static(historical) data and rapidly
> ingesting realtime data. IndexR is designed for OLAP.
> >>>>
> >>>> Fast analyze on large dataset
> >>>> Realtime ingestion with zero delay for query
> >>>> Deep integration with Hadoop ecosystem
> >>>> Hardware efficiency
> >>>> Highly avaliable, scalable, manageable and simple
> >>>> Adapted with popular query engines like Apache Drill, Apache Hive,
> etc.
> >>>>
> >>>> And now it is open source.
> >>>>
> >>>> Project: https://github.com/shunfei/indexr <
> https://github.com/shunfei/indexr>
> >>>> Wiki: https://github.com/shunfei/indexr/wiki <
> https://github.com/shunfei/indexr/wiki>
> >>>>
> >>>> IndexR is original developed by Sunteng Tech. This project started a
> year ago and now has been deployed to several productions in our company.
> The whole cluster consumes over 30 billions events each day in realtime
> from Kafka. The largest table contains over 10 billions rows (after rollup)
> and rapidly increasing. Most of the statistic/analyze queries’ latency is
> less than 3 seconds in real world production environment.
> >>>>
> >>>> Currently it is mainly used as Drill and Hive storage plugin. It
> should be quite easy to master.
> >>>>
> >>>> We hope IndexR be a favor to you and make it better.
> >>>>
> >>>> Regards
> >>>> Flow Wei
> >>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: IndexR, a new storage plugin for Drill

Posted by WeiWan <we...@sunteng.com>.

Hi,

It will take some time for IndexR plugin to merge into Drill. But you can try it out already by following those documents.

Compilation:  https://github.com/shunfei/indexr/wiki/Compilation <https://github.com/shunfei/indexr/wiki/Compilation>
Deployment:  https://github.com/shunfei/indexr/wiki/Deployment <https://github.com/shunfei/indexr/wiki/Deployment>
User Guide:  https://github.com/shunfei/indexr/wiki/User-Guide <https://github.com/shunfei/indexr/wiki/User-Guide>
Regards
Flow Wei



> On Jan 4, 2017, at 00:22, Jinfeng Ni <jn...@apache.org> wrote:
> 
> Looks like IndexR is very interesting storage plugin. Although I have
> not looked into the detail, I'm looking forward to seeing the PR and
> hopefully getting this into Drill!
> 
> Thanks,
> 
> Jinfeng
> 
> 
> On Tue, Jan 3, 2017 at 7:30 AM, WeiWan <we...@sunteng.com> wrote:
>> Hi Charles,
>> 
>> It would be great if IndexR plugin can be merged into official Drill project. I will do some more tests based on latest Drill version and submit a PR.
>> 
>> Regards
>> Flow Wei
>> 
>> 
>> 
>>> On Jan 3, 2017, at 23:18, Charles Givre <cg...@gmail.com> wrote:
>>> 
>>> This sounds really interesting.  Will you be submitting a PR to integrate this into the main Drill codebase?
>>> — C
>>> 
>>>> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
>>>> 
>>>> IndexR is a distributed, columnar storage system based on HDFS, which focus on fast analyse, both for massive static(historical) data and rapidly ingesting realtime data. IndexR is designed for OLAP.
>>>> 
>>>> Fast analyze on large dataset
>>>> Realtime ingestion with zero delay for query
>>>> Deep integration with Hadoop ecosystem
>>>> Hardware efficiency
>>>> Highly avaliable, scalable, manageable and simple
>>>> Adapted with popular query engines like Apache Drill, Apache Hive, etc.
>>>> 
>>>> And now it is open source.
>>>> 
>>>> Project: https://github.com/shunfei/indexr <https://github.com/shunfei/indexr>
>>>> Wiki: https://github.com/shunfei/indexr/wiki <https://github.com/shunfei/indexr/wiki>
>>>> 
>>>> IndexR is original developed by Sunteng Tech. This project started a year ago and now has been deployed to several productions in our company. The whole cluster consumes over 30 billions events each day in realtime from Kafka. The largest table contains over 10 billions rows (after rollup) and rapidly increasing. Most of the statistic/analyze queries’ latency is less than 3 seconds in real world production environment.
>>>> 
>>>> Currently it is mainly used as Drill and Hive storage plugin. It should be quite easy to master.
>>>> 
>>>> We hope IndexR be a favor to you and make it better.
>>>> 
>>>> Regards
>>>> Flow Wei
>>>> 
>>>> 
>>>> 
>>> 
>>

Re: IndexR, a new storage plugin for Drill

Posted by Chunhui Shi <cs...@maprtech.com>.

Congratulations to IndexR team! What is your plan to add indexR storage
plugin to Drill? Cannot wait to try it out with Drill.



On Tue, Jan 3, 2017 at 9:02 AM, Jinfeng Ni <jn...@apache.org> wrote:

> Forward to drill dev list. People on dev list might be interested in
> this as well.
>
>
> On Tue, Jan 3, 2017 at 8:22 AM, Jinfeng Ni <jn...@apache.org> wrote:
> > Looks like IndexR is very interesting storage plugin. Although I have
> > not looked into the detail, I'm looking forward to seeing the PR and
> > hopefully getting this into Drill!
> >
> > Thanks,
> >
> > Jinfeng
> >
> >
> > On Tue, Jan 3, 2017 at 7:30 AM, WeiWan <we...@sunteng.com> wrote:
> >> Hi Charles,
> >>
> >> It would be great if IndexR plugin can be merged into official Drill
> project. I will do some more tests based on latest Drill version and submit
> a PR.
> >>
> >> Regards
> >> Flow Wei
> >>
> >>
> >>
> >>> On Jan 3, 2017, at 23:18, Charles Givre <cg...@gmail.com> wrote:
> >>>
> >>> This sounds really interesting.  Will you be submitting a PR to
> integrate this into the main Drill codebase?
> >>> — C
> >>>
> >>>> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
> >>>>
> >>>> IndexR is a distributed, columnar storage system based on HDFS, which
> focus on fast analyse, both for massive static(historical) data and rapidly
> ingesting realtime data. IndexR is designed for OLAP.
> >>>>
> >>>> Fast analyze on large dataset
> >>>> Realtime ingestion with zero delay for query
> >>>> Deep integration with Hadoop ecosystem
> >>>> Hardware efficiency
> >>>> Highly avaliable, scalable, manageable and simple
> >>>> Adapted with popular query engines like Apache Drill, Apache Hive,
> etc.
> >>>>
> >>>> And now it is open source.
> >>>>
> >>>> Project: https://github.com/shunfei/indexr <
> https://github.com/shunfei/indexr>
> >>>> Wiki: https://github.com/shunfei/indexr/wiki <
> https://github.com/shunfei/indexr/wiki>
> >>>>
> >>>> IndexR is original developed by Sunteng Tech. This project started a
> year ago and now has been deployed to several productions in our company.
> The whole cluster consumes over 30 billions events each day in realtime
> from Kafka. The largest table contains over 10 billions rows (after rollup)
> and rapidly increasing. Most of the statistic/analyze queries’ latency is
> less than 3 seconds in real world production environment.
> >>>>
> >>>> Currently it is mainly used as Drill and Hive storage plugin. It
> should be quite easy to master.
> >>>>
> >>>> We hope IndexR be a favor to you and make it better.
> >>>>
> >>>> Regards
> >>>> Flow Wei
> >>>>
> >>>>
> >>>>
> >>>
> >>
>

Re: IndexR, a new storage plugin for Drill

Posted by Jinfeng Ni <jn...@apache.org>.

Forward to drill dev list. People on dev list might be interested in
this as well.


On Tue, Jan 3, 2017 at 8:22 AM, Jinfeng Ni <jn...@apache.org> wrote:
> Looks like IndexR is very interesting storage plugin. Although I have
> not looked into the detail, I'm looking forward to seeing the PR and
> hopefully getting this into Drill!
>
> Thanks,
>
> Jinfeng
>
>
> On Tue, Jan 3, 2017 at 7:30 AM, WeiWan <we...@sunteng.com> wrote:
>> Hi Charles,
>>
>> It would be great if IndexR plugin can be merged into official Drill project. I will do some more tests based on latest Drill version and submit a PR.
>>
>> Regards
>> Flow Wei
>>
>>
>>
>>> On Jan 3, 2017, at 23:18, Charles Givre <cg...@gmail.com> wrote:
>>>
>>> This sounds really interesting.  Will you be submitting a PR to integrate this into the main Drill codebase?
>>> — C
>>>
>>>> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
>>>>
>>>> IndexR is a distributed, columnar storage system based on HDFS, which focus on fast analyse, both for massive static(historical) data and rapidly ingesting realtime data. IndexR is designed for OLAP.
>>>>
>>>> Fast analyze on large dataset
>>>> Realtime ingestion with zero delay for query
>>>> Deep integration with Hadoop ecosystem
>>>> Hardware efficiency
>>>> Highly avaliable, scalable, manageable and simple
>>>> Adapted with popular query engines like Apache Drill, Apache Hive, etc.
>>>>
>>>> And now it is open source.
>>>>
>>>> Project: https://github.com/shunfei/indexr <https://github.com/shunfei/indexr>
>>>> Wiki: https://github.com/shunfei/indexr/wiki <https://github.com/shunfei/indexr/wiki>
>>>>
>>>> IndexR is original developed by Sunteng Tech. This project started a year ago and now has been deployed to several productions in our company. The whole cluster consumes over 30 billions events each day in realtime from Kafka. The largest table contains over 10 billions rows (after rollup) and rapidly increasing. Most of the statistic/analyze queries’ latency is less than 3 seconds in real world production environment.
>>>>
>>>> Currently it is mainly used as Drill and Hive storage plugin. It should be quite easy to master.
>>>>
>>>> We hope IndexR be a favor to you and make it better.
>>>>
>>>> Regards
>>>> Flow Wei
>>>>
>>>>
>>>>
>>>
>>

Re: IndexR, a new storage plugin for Drill

Posted by Jinfeng Ni <jn...@apache.org>.

Forward to drill dev list. People on dev list might be interested in
this as well.


On Tue, Jan 3, 2017 at 8:22 AM, Jinfeng Ni <jn...@apache.org> wrote:
> Looks like IndexR is very interesting storage plugin. Although I have
> not looked into the detail, I'm looking forward to seeing the PR and
> hopefully getting this into Drill!
>
> Thanks,
>
> Jinfeng
>
>
> On Tue, Jan 3, 2017 at 7:30 AM, WeiWan <we...@sunteng.com> wrote:
>> Hi Charles,
>>
>> It would be great if IndexR plugin can be merged into official Drill project. I will do some more tests based on latest Drill version and submit a PR.
>>
>> Regards
>> Flow Wei
>>
>>
>>
>>> On Jan 3, 2017, at 23:18, Charles Givre <cg...@gmail.com> wrote:
>>>
>>> This sounds really interesting.  Will you be submitting a PR to integrate this into the main Drill codebase?
>>> — C
>>>
>>>> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
>>>>
>>>> IndexR is a distributed, columnar storage system based on HDFS, which focus on fast analyse, both for massive static(historical) data and rapidly ingesting realtime data. IndexR is designed for OLAP.
>>>>
>>>> Fast analyze on large dataset
>>>> Realtime ingestion with zero delay for query
>>>> Deep integration with Hadoop ecosystem
>>>> Hardware efficiency
>>>> Highly avaliable, scalable, manageable and simple
>>>> Adapted with popular query engines like Apache Drill, Apache Hive, etc.
>>>>
>>>> And now it is open source.
>>>>
>>>> Project: https://github.com/shunfei/indexr <https://github.com/shunfei/indexr>
>>>> Wiki: https://github.com/shunfei/indexr/wiki <https://github.com/shunfei/indexr/wiki>
>>>>
>>>> IndexR is original developed by Sunteng Tech. This project started a year ago and now has been deployed to several productions in our company. The whole cluster consumes over 30 billions events each day in realtime from Kafka. The largest table contains over 10 billions rows (after rollup) and rapidly increasing. Most of the statistic/analyze queries’ latency is less than 3 seconds in real world production environment.
>>>>
>>>> Currently it is mainly used as Drill and Hive storage plugin. It should be quite easy to master.
>>>>
>>>> We hope IndexR be a favor to you and make it better.
>>>>
>>>> Regards
>>>> Flow Wei
>>>>
>>>>
>>>>
>>>
>>

Re: IndexR, a new storage plugin for Drill

Posted by Jinfeng Ni <jn...@apache.org>.

Looks like IndexR is very interesting storage plugin. Although I have
not looked into the detail, I'm looking forward to seeing the PR and
hopefully getting this into Drill!

Thanks,

Jinfeng


On Tue, Jan 3, 2017 at 7:30 AM, WeiWan <we...@sunteng.com> wrote:
> Hi Charles,
>
> It would be great if IndexR plugin can be merged into official Drill project. I will do some more tests based on latest Drill version and submit a PR.
>
> Regards
> Flow Wei
>
>
>
>> On Jan 3, 2017, at 23:18, Charles Givre <cg...@gmail.com> wrote:
>>
>> This sounds really interesting.  Will you be submitting a PR to integrate this into the main Drill codebase?
>> — C
>>
>>> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
>>>
>>> IndexR is a distributed, columnar storage system based on HDFS, which focus on fast analyse, both for massive static(historical) data and rapidly ingesting realtime data. IndexR is designed for OLAP.
>>>
>>> Fast analyze on large dataset
>>> Realtime ingestion with zero delay for query
>>> Deep integration with Hadoop ecosystem
>>> Hardware efficiency
>>> Highly avaliable, scalable, manageable and simple
>>> Adapted with popular query engines like Apache Drill, Apache Hive, etc.
>>>
>>> And now it is open source.
>>>
>>> Project: https://github.com/shunfei/indexr <https://github.com/shunfei/indexr>
>>> Wiki: https://github.com/shunfei/indexr/wiki <https://github.com/shunfei/indexr/wiki>
>>>
>>> IndexR is original developed by Sunteng Tech. This project started a year ago and now has been deployed to several productions in our company. The whole cluster consumes over 30 billions events each day in realtime from Kafka. The largest table contains over 10 billions rows (after rollup) and rapidly increasing. Most of the statistic/analyze queries’ latency is less than 3 seconds in real world production environment.
>>>
>>> Currently it is mainly used as Drill and Hive storage plugin. It should be quite easy to master.
>>>
>>> We hope IndexR be a favor to you and make it better.
>>>
>>> Regards
>>> Flow Wei
>>>
>>>
>>>
>>
>

Re: IndexR, a new storage plugin for Drill

Posted by WeiWan <we...@sunteng.com>.

Hi Charles, 

It would be great if IndexR plugin can be merged into official Drill project. I will do some more tests based on latest Drill version and submit a PR.

Regards
Flow Wei



> On Jan 3, 2017, at 23:18, Charles Givre <cg...@gmail.com> wrote:
> 
> This sounds really interesting.  Will you be submitting a PR to integrate this into the main Drill codebase?
> — C
> 
>> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
>> 
>> IndexR is a distributed, columnar storage system based on HDFS, which focus on fast analyse, both for massive static(historical) data and rapidly ingesting realtime data. IndexR is designed for OLAP.
>> 
>> Fast analyze on large dataset
>> Realtime ingestion with zero delay for query
>> Deep integration with Hadoop ecosystem
>> Hardware efficiency
>> Highly avaliable, scalable, manageable and simple
>> Adapted with popular query engines like Apache Drill, Apache Hive, etc.
>> 
>> And now it is open source.
>> 
>> Project: https://github.com/shunfei/indexr <https://github.com/shunfei/indexr> 
>> Wiki: https://github.com/shunfei/indexr/wiki <https://github.com/shunfei/indexr/wiki>
>> 
>> IndexR is original developed by Sunteng Tech. This project started a year ago and now has been deployed to several productions in our company. The whole cluster consumes over 30 billions events each day in realtime from Kafka. The largest table contains over 10 billions rows (after rollup) and rapidly increasing. Most of the statistic/analyze queries’ latency is less than 3 seconds in real world production environment.
>> 
>> Currently it is mainly used as Drill and Hive storage plugin. It should be quite easy to master.
>> 
>> We hope IndexR be a favor to you and make it better.
>> 
>> Regards
>> Flow Wei
>> 
>> 
>> 
>

Re: IndexR, a new storage plugin for Drill

Posted by Charles Givre <cg...@gmail.com>.

This sounds really interesting.  Will you be submitting a PR to integrate this into the main Drill codebase?
— C

> On Jan 3, 2017, at 03:35, WeiWan <we...@sunteng.com> wrote:
> 
> IndexR is a distributed, columnar storage system based on HDFS, which focus on fast analyse, both for massive static(historical) data and rapidly ingesting realtime data. IndexR is designed for OLAP.
> 
> Fast analyze on large dataset
> Realtime ingestion with zero delay for query
> Deep integration with Hadoop ecosystem
> Hardware efficiency
> Highly avaliable, scalable, manageable and simple
> Adapted with popular query engines like Apache Drill, Apache Hive, etc.
> 
> And now it is open source.
> 
> Project: https://github.com/shunfei/indexr <https://github.com/shunfei/indexr> 
> Wiki: https://github.com/shunfei/indexr/wiki <https://github.com/shunfei/indexr/wiki>
> 
> IndexR is original developed by Sunteng Tech. This project started a year ago and now has been deployed to several productions in our company. The whole cluster consumes over 30 billions events each day in realtime from Kafka. The largest table contains over 10 billions rows (after rollup) and rapidly increasing. Most of the statistic/analyze queries’ latency is less than 3 seconds in real world production environment.
> 
> Currently it is mainly used as Drill and Hive storage plugin. It should be quite easy to master.
> 
> We hope IndexR be a favor to you and make it better.
> 
> Regards
> Flow Wei
> 
> 
>

Re: IndexR, a new storage plugin for Drill

Posted by John Omernik <jo...@omernik.com>.

This looks very interesting! Can't wait to see some how-to's to get the the
server nodes setup, and kafka pipelines setup.  I'd be very interested in
trying this once it's setup.

Thanks!



On Tue, Jan 3, 2017 at 2:35 AM, WeiWan <we...@sunteng.com> wrote:

> IndexR is a distributed, columnar storage system based on HDFS, which
> focus on fast analyse, both for massive static(historical) data and rapidly
> ingesting realtime data. IndexR is designed for OLAP.
>
> Fast analyze on large dataset
> Realtime ingestion with zero delay for query
> Deep integration with Hadoop ecosystem
> Hardware efficiency
> Highly avaliable, scalable, manageable and simple
> Adapted with popular query engines like Apache Drill, Apache Hive, etc.
>
> And now it is open source.
>
> Project: https://github.com/shunfei/indexr <https://github.com/shunfei/
> indexr>
> Wiki: https://github.com/shunfei/indexr/wiki <https://github.com/shunfei/
> indexr/wiki>
>
> IndexR is original developed by Sunteng Tech. This project started a year
> ago and now has been deployed to several productions in our company. The
> whole cluster consumes over 30 billions events each day in realtime from
> Kafka. The largest table contains over 10 billions rows (after rollup) and
> rapidly increasing. Most of the statistic/analyze queries’ latency is less
> than 3 seconds in real world production environment.
>
> Currently it is mainly used as Drill and Hive storage plugin. It should be
> quite easy to master.
>
> We hope IndexR be a favor to you and make it better.
>
> Regards
> Flow Wei
>
>
>
>