You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by jackylk <gi...@git.apache.org> on 2016/10/03 16:31:11 UTC

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

GitHub user jackylk opened a pull request:

    https://github.com/apache/incubator-carbondata/pull/208

    [CARBONDATA-284][WIP] Abstracting index and segment interface

    This PR adds new User API and Dev API for carbon-hadoop module:
    
    ### User API
    - `CarbonColumnarInputFormat/OutputFormat`: it uses current `CarbonInputFormat` as internal implementation. 
    - `CarbonRowInputFormat/OutputFormat`: it needs to be implemented
    - `CarbonOutputCommitter`: used for managing segment commit
    
    They are based on `CarbonInputFormatBase/OutputFormatBase`
    
    ### Dev API
    - Segment: an abstract class represents a single load of data,  used by CarbonInputFormatBase to get all InputSplit by matching QueryModel, and used by CarbonOutputCommitter to prepare for reading. Implementation examples are `IndexedSegment` and `StreamingSegment`.
    - SegmentManager: an interface to manage segments. Current implementation is `ZkSegmentManager`, which need to be mapped to existing logic.
    - Index: an interface that can is used by `IndexedSegment` to filter InputSplit. Current implementation is `InMemoryBTreeIndex` which load the index into driver's memory.
    
    `CarbonInputFormatUtil` is modified so that it can also be used by `CarbonColumnarInputFormat`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jackylk/incubator-carbondata index-interface

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-carbondata/pull/208.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #208
    
----
commit 398d2ec3e6706c615918a734a90f9dc4111067d8
Author: jackylk <ja...@huawei.com>
Date:   2016-10-03T16:01:48Z

    add User API

commit 1d92a00403faeebc09bf595ba11b3e55d4c997f2
Author: jackylk <ja...@huawei.com>
Date:   2016-10-03T16:02:04Z

    add Developer API

commit 1812a0a68b53ba5d48fc030e2a59329b0e827b05
Author: jackylk <ja...@huawei.com>
Date:   2016-10-03T16:02:49Z

    refactory existing code

commit 430e7710b88725b587c1f3542d4d66ab02958cbc
Author: jackylk <ja...@huawei.com>
Date:   2016-10-03T16:27:10Z

    change Index interface

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85470293
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/impl/IndexedSegment.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.segment.impl;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.index.Block;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.hadoop.internal.index.Index;
    +import org.apache.carbondata.hadoop.internal.index.IndexLoader;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +/**
    + * This segment is backed by index, thus getSplits can use the index to do file pruning.
    + */
    +public class IndexedSegment extends Segment {
    +
    +  private IndexLoader loader;
    +
    +  public IndexedSegment(String name, String path, IndexLoader loader) {
    +    super(name, path);
    +    this.loader = loader;
    +  }
    +
    +  @Override
    +  public List<InputSplit> getSplits(JobContext job, FilterResolverIntf filterResolver)
    +      throws IOException {
    +    // do as following
    +    // 1. create the index or get from cache by the filter name in the configuration
    +    // 2. filter by index to get the filtered block
    +    // 3. create input split from filtered block
    +
    +    List<InputSplit> output = new LinkedList<>();
    +    Index index = loader.load(job.getConfiguration());
    --- End diff --
    
    if loader internally implement cache then we can keep as `IndexLoader` only.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by JihongMA <gi...@git.apache.org>.

Github user JihongMA commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r82487685
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/columnar/CarbonColumnarInputFormat.java ---
    @@ -0,0 +1,56 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.api.columnar;
    +
    +import java.io.IOException;
    +
    +import org.apache.carbondata.hadoop.CarbonInputFormat;
    +import org.apache.carbondata.hadoop.CarbonProjection;
    +import org.apache.carbondata.hadoop.api.CarbonInputFormatBase;
    +import org.apache.carbondata.hadoop.internal.segment.SegmentManager;
    +import org.apache.carbondata.hadoop.util.CarbonInputFormatUtil;
    +import org.apache.carbondata.scan.expression.Expression;
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +public class CarbonColumnarInputFormat<T> extends CarbonInputFormatBase<T> {
    +  private CarbonInputFormat<T> wrap;
    --- End diff --
    
    suggest to change to a meaningful name: wrap --> carbonInputFormat


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r82447352
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/index/memory/InMemoryBTreeIndex.java ---
    @@ -0,0 +1,214 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.index.memory;
    +
    +import java.io.IOException;
    +import java.util.ArrayList;
    +import java.util.HashMap;
    +import java.util.LinkedList;
    +import java.util.List;
    +import java.util.Map;
    +
    +import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier;
    +import org.apache.carbondata.core.carbon.datastore.DataRefNode;
    +import org.apache.carbondata.core.carbon.datastore.DataRefNodeFinder;
    +import org.apache.carbondata.core.carbon.datastore.IndexKey;
    +import org.apache.carbondata.core.carbon.datastore.SegmentTaskIndexStore;
    +import org.apache.carbondata.core.carbon.datastore.block.AbstractIndex;
    +import org.apache.carbondata.core.carbon.datastore.block.BlockletInfos;
    +import org.apache.carbondata.core.carbon.datastore.block.SegmentProperties;
    +import org.apache.carbondata.core.carbon.datastore.block.TableBlockInfo;
    +import org.apache.carbondata.core.carbon.datastore.exception.IndexBuilderException;
    +import org.apache.carbondata.core.carbon.datastore.impl.btree.BTreeDataRefNodeFinder;
    +import org.apache.carbondata.core.carbon.datastore.impl.btree.BlockBTreeLeafNode;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatistic;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatisticsConstants;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatisticsRecorder;
    +import org.apache.carbondata.core.keygenerator.KeyGenException;
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.index.Index;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.hadoop.util.CarbonInputFormatUtil;
    +import org.apache.carbondata.scan.executor.exception.QueryExecutionException;
    +import org.apache.carbondata.scan.filter.FilterExpressionProcessor;
    +import org.apache.carbondata.scan.filter.FilterUtil;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.commons.logging.Log;
    +import org.apache.commons.logging.LogFactory;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +class InMemoryBTreeIndex implements Index {
    --- End diff --
    
    Is this class used for pruning blocks based on secondary index?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85457078
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ---
    @@ -0,0 +1,169 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.api;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.CarbonProjection;
    +import org.apache.carbondata.hadoop.internal.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.hadoop.internal.segment.SegmentManager;
    +import org.apache.carbondata.hadoop.util.CarbonInputFormatUtil;
    +import org.apache.carbondata.hadoop.util.ObjectSerializationUtil;
    +import org.apache.carbondata.scan.expression.Expression;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    +
    +/**
    + * Input format of CarbonData file.
    + * @param <T>
    + */
    +public class CarbonTableInputFormat<T> extends FileInputFormat<Void, T> {
    +
    +  private static final String FILTER_PREDICATE =
    +      "mapreduce.input.carboninputformat.filter.predicate";
    +
    +  private SegmentManager segmentManager;
    +
    +  public CarbonTableInputFormat(SegmentManager segmentManager) {
    --- End diff --
    
    accept


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85340545
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/index/impl/InMemoryBTreeIndex.java ---
    @@ -0,0 +1,215 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.index.impl;
    +
    +import java.io.IOException;
    +import java.util.ArrayList;
    +import java.util.HashMap;
    +import java.util.LinkedList;
    +import java.util.List;
    +import java.util.Map;
    +
    +import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier;
    +import org.apache.carbondata.core.carbon.datastore.DataRefNode;
    +import org.apache.carbondata.core.carbon.datastore.DataRefNodeFinder;
    +import org.apache.carbondata.core.carbon.datastore.IndexKey;
    +import org.apache.carbondata.core.carbon.datastore.SegmentTaskIndexStore;
    +import org.apache.carbondata.core.carbon.datastore.block.AbstractIndex;
    +import org.apache.carbondata.core.carbon.datastore.block.BlockletInfos;
    +import org.apache.carbondata.core.carbon.datastore.block.SegmentProperties;
    +import org.apache.carbondata.core.carbon.datastore.block.TableBlockInfo;
    +import org.apache.carbondata.core.carbon.datastore.exception.IndexBuilderException;
    +import org.apache.carbondata.core.carbon.datastore.impl.btree.BTreeDataRefNodeFinder;
    +import org.apache.carbondata.core.carbon.datastore.impl.btree.BlockBTreeLeafNode;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatistic;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatisticsConstants;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatisticsRecorder;
    +import org.apache.carbondata.core.keygenerator.KeyGenException;
    +import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.index.Index;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.scan.executor.exception.QueryExecutionException;
    +import org.apache.carbondata.scan.filter.FilterExpressionProcessor;
    +import org.apache.carbondata.scan.filter.FilterUtil;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.commons.logging.Log;
    +import org.apache.commons.logging.LogFactory;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +class InMemoryBTreeIndex implements Index {
    +
    +  private static final Log LOG = LogFactory.getLog(InMemoryBTreeIndex.class);
    +  private Segment segment;
    +
    +  InMemoryBTreeIndex(Segment segment) {
    +    this.segment = segment;
    +  }
    +
    +  @Override
    +  public String getName() {
    +    return null;
    +  }
    +
    +  @Override
    +  public List<InputSplit> filter(JobContext job, FilterResolverIntf filter)
    --- End diff --
    
    It seems method return type is incompatible. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85346673
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/impl/IndexedSegment.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.segment.impl;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.index.Block;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.hadoop.internal.index.Index;
    +import org.apache.carbondata.hadoop.internal.index.IndexLoader;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +/**
    + * This segment is backed by index, thus getSplits can use the index to do file pruning.
    + */
    +public class IndexedSegment extends Segment {
    +
    +  private IndexLoader loader;
    +
    +  public IndexedSegment(String name, String path, IndexLoader loader) {
    +    super(name, path);
    +    this.loader = loader;
    +  }
    +
    +  @Override
    +  public List<InputSplit> getSplits(JobContext job, FilterResolverIntf filterResolver)
    +      throws IOException {
    +    // do as following
    +    // 1. create the index or get from cache by the filter name in the configuration
    +    // 2. filter by index to get the filtered block
    +    // 3. create input split from filtered block
    +
    +    List<InputSplit> output = new LinkedList<>();
    +    Index index = loader.load(job.getConfiguration());
    --- End diff --
    
    does it required to load index every time?
    I guess we are just creating the instance of index here, so why don't you use factory here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85457738
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/impl/IndexedSegment.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.segment.impl;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.index.Block;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.hadoop.internal.index.Index;
    +import org.apache.carbondata.hadoop.internal.index.IndexLoader;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +/**
    + * This segment is backed by index, thus getSplits can use the index to do file pruning.
    + */
    +public class IndexedSegment extends Segment {
    +
    +  private IndexLoader loader;
    +
    +  public IndexedSegment(String name, String path, IndexLoader loader) {
    +    super(name, path);
    +    this.loader = loader;
    +  }
    +
    +  @Override
    +  public List<InputSplit> getSplits(JobContext job, FilterResolverIntf filterResolver)
    +      throws IOException {
    +    // do as following
    +    // 1. create the index or get from cache by the filter name in the configuration
    +    // 2. filter by index to get the filtered block
    +    // 3. create input split from filtered block
    +
    +    List<InputSplit> output = new LinkedList<>();
    +    Index index = loader.load(job.getConfiguration());
    --- End diff --
    
    Yes, actually the IndexLoader is the factory, `load` function has the same functionality as `getInstance` in the factory. So I will keep the IndexLoader name? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85339106
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/CarbonFormat.java ---
    @@ -0,0 +1,24 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal;
    +
    +public enum CarbonFormat {
    +  COLUMNR
    --- End diff --
    
    typo : COLUMNAR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by QiangCai <gi...@git.apache.org>.

Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85464310
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/impl/IndexedSegment.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.segment.impl;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    --- End diff --
    
    please use internal.CarbonInputSplit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by QiangCai <gi...@git.apache.org>.

Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85061184
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/index/memory/InMemoryBTreeIndex.java ---
    @@ -0,0 +1,220 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.index.memory;
    +
    +import java.io.IOException;
    +import java.util.ArrayList;
    +import java.util.HashMap;
    +import java.util.LinkedList;
    +import java.util.List;
    +import java.util.Map;
    +
    +import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier;
    +import org.apache.carbondata.core.carbon.datastore.DataRefNode;
    +import org.apache.carbondata.core.carbon.datastore.DataRefNodeFinder;
    +import org.apache.carbondata.core.carbon.datastore.IndexKey;
    +import org.apache.carbondata.core.carbon.datastore.SegmentTaskIndexStore;
    +import org.apache.carbondata.core.carbon.datastore.block.AbstractIndex;
    +import org.apache.carbondata.core.carbon.datastore.block.BlockletInfos;
    +import org.apache.carbondata.core.carbon.datastore.block.SegmentProperties;
    +import org.apache.carbondata.core.carbon.datastore.block.TableBlockInfo;
    +import org.apache.carbondata.core.carbon.datastore.exception.IndexBuilderException;
    +import org.apache.carbondata.core.carbon.datastore.impl.btree.BTreeDataRefNodeFinder;
    +import org.apache.carbondata.core.carbon.datastore.impl.btree.BlockBTreeLeafNode;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatistic;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatisticsConstants;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatisticsRecorder;
    +import org.apache.carbondata.core.keygenerator.KeyGenException;
    +import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.index.Index;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.hadoop.util.CarbonInputFormatUtil;
    +import org.apache.carbondata.scan.executor.exception.QueryExecutionException;
    +import org.apache.carbondata.scan.filter.FilterExpressionProcessor;
    +import org.apache.carbondata.scan.filter.FilterUtil;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.commons.logging.Log;
    +import org.apache.commons.logging.LogFactory;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +class InMemoryBTreeIndex implements Index {
    +
    +  private static final Log LOG = LogFactory.getLog(InMemoryBTreeIndex.class);
    +  private Segment segment;
    +
    +  InMemoryBTreeIndex(Segment segment) {
    +    this.segment = segment;
    +  }
    +
    +  @Override
    +  public String getName() {
    +    return null;
    +  }
    +
    +  @Override
    +  public List<InputSplit> filter(JobContext job, FilterResolverIntf filter)
    +      throws IOException {
    +
    +    List<InputSplit> result = new LinkedList<InputSplit>();
    +
    +    FilterExpressionProcessor filterExpressionProcessor = new FilterExpressionProcessor();
    +
    +    AbsoluteTableIdentifier absoluteTableIdentifier = null;
    +        //CarbonInputFormatUtil.getAbsoluteTableIdentifier(job.getConfiguration());
    +
    +    //for this segment fetch blocks matching filter in BTree
    +    List<DataRefNode> dataRefNodes = null;
    +    try {
    +      dataRefNodes = getDataBlocksOfSegment(job, filterExpressionProcessor, absoluteTableIdentifier,
    +          filter, segment.getId());
    +    } catch (IndexBuilderException e) {
    +      throw new IOException(e.getMessage());
    +    }
    +    for (DataRefNode dataRefNode : dataRefNodes) {
    +      BlockBTreeLeafNode leafNode = (BlockBTreeLeafNode) dataRefNode;
    +      TableBlockInfo tableBlockInfo = leafNode.getTableBlockInfo();
    +      result.add(new CarbonInputSplit(segment.getId(), new Path(tableBlockInfo.getFilePath()),
    +          tableBlockInfo.getBlockOffset(), tableBlockInfo.getBlockLength(),
    +          tableBlockInfo.getLocations(), tableBlockInfo.getBlockletInfos().getNoOfBlockLets()));
    +    }
    +    return result;
    +  }
    +
    +  private Map<String, AbstractIndex> getSegmentAbstractIndexs(JobContext job,
    +      AbsoluteTableIdentifier identifier, String segmentId)
    --- End diff --
    
    remove segmentId, directly use segment.id in this class


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505582
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/StreamingSegment.java ---
    @@ -0,0 +1,45 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.segment;
    +
    +import java.io.IOException;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.api.CarbonInputFormatBase;
    +import org.apache.carbondata.scan.model.QueryModel;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +public class StreamingSegment extends Segment {
    --- End diff --
    
    If I understand the comment correctly, the answer is that all segments are handled unifiedly in `CarbonInputFormatBase`, however, the internally read implementation of this segment is different from IndexedSegment. It uses Row input format to read. Is this the question?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505527
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/IndexedSegment.java ---
    @@ -0,0 +1,70 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.segment;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.api.CarbonInputFormatBase;
    +import org.apache.carbondata.hadoop.internal.index.Index;
    +import org.apache.carbondata.hadoop.internal.index.IndexLoader;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.carbondata.scan.model.QueryModel;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +public class IndexedSegment extends Segment {
    --- End diff --
    
    Do you mean a non-indexed and non-streaming segment? Yes, I think we can have a segment like that, then reading of this segment will be pure scan only without any index. 
    But what is the benefit doing that? 
    
    I think the real issue is how to let optimizer to decide when to use index and when not to use index. I think I will create another `Estimator` interface to expose cost information for optimizer. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505613
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonInputFormatBase.java ---
    @@ -0,0 +1,69 @@
    +/*
    --- End diff --
    
    Yes, it is like that only


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-carbondata/pull/208


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85480562
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/impl/IndexedSegment.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.segment.impl;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.index.Block;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.hadoop.internal.index.Index;
    +import org.apache.carbondata.hadoop.internal.index.IndexLoader;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +/**
    + * This segment is backed by index, thus getSplits can use the index to do file pruning.
    + */
    +public class IndexedSegment extends Segment {
    +
    +  private IndexLoader loader;
    +
    +  public IndexedSegment(String name, String path, IndexLoader loader) {
    +    super(name, path);
    +    this.loader = loader;
    +  }
    +
    +  @Override
    +  public List<InputSplit> getSplits(JobContext job, FilterResolverIntf filterResolver)
    +      throws IOException {
    +    // do as following
    +    // 1. create the index or get from cache by the filter name in the configuration
    +    // 2. filter by index to get the filtered block
    +    // 3. create input split from filtered block
    +
    +    List<InputSplit> output = new LinkedList<>();
    +    Index index = loader.load(job.getConfiguration());
    +    List<Block> blocks = index.filter(job, filterResolver);
    --- End diff --
    
    You are right, I will modify


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85480694
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/impl/IndexedSegment.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.segment.impl;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r82446169
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/row/CarbonRowInputFormat.java ---
    @@ -0,0 +1,40 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.api.row;
    +
    +import java.io.IOException;
    +
    +import org.apache.carbondata.hadoop.api.CarbonInputFormatBase;
    +import org.apache.carbondata.hadoop.internal.segment.SegmentManager;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +public class CarbonRowInputFormat<T> extends CarbonInputFormatBase<T> {
    --- End diff --
    
    is it for loading data row wise like old molap data? Then if we don't have plan to implement now better remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85470691
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ---
    @@ -0,0 +1,169 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.api;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.CarbonProjection;
    +import org.apache.carbondata.hadoop.internal.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.hadoop.internal.segment.SegmentManager;
    +import org.apache.carbondata.hadoop.util.CarbonInputFormatUtil;
    +import org.apache.carbondata.hadoop.util.ObjectSerializationUtil;
    +import org.apache.carbondata.scan.expression.Expression;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    +
    +/**
    + * Input format of CarbonData file.
    + * @param <T>
    + */
    +public class CarbonTableInputFormat<T> extends FileInputFormat<Void, T> {
    +
    +  private static final String FILTER_PREDICATE =
    +      "mapreduce.input.carboninputformat.filter.predicate";
    +
    +  private SegmentManager segmentManager;
    +
    +  public CarbonTableInputFormat(SegmentManager segmentManager) {
    +    this.segmentManager = segmentManager;
    +  }
    +
    +  @Override
    +  public RecordReader<Void, T> createRecordReader(InputSplit split,
    +      TaskAttemptContext context) throws IOException, InterruptedException {
    +    switch (((CarbonInputSplit)split).formatType()) {
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85337928
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java ---
    @@ -0,0 +1,169 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.api;
    +
    +import java.io.IOException;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.CarbonProjection;
    +import org.apache.carbondata.hadoop.internal.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.hadoop.internal.segment.SegmentManager;
    +import org.apache.carbondata.hadoop.util.CarbonInputFormatUtil;
    +import org.apache.carbondata.hadoop.util.ObjectSerializationUtil;
    +import org.apache.carbondata.scan.expression.Expression;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.hadoop.conf.Configuration;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    +
    +/**
    + * Input format of CarbonData file.
    + * @param <T>
    + */
    +public class CarbonTableInputFormat<T> extends FileInputFormat<Void, T> {
    +
    +  private static final String FILTER_PREDICATE =
    +      "mapreduce.input.carboninputformat.filter.predicate";
    +
    +  private SegmentManager segmentManager;
    +
    +  public CarbonTableInputFormat(SegmentManager segmentManager) {
    +    this.segmentManager = segmentManager;
    +  }
    +
    +  @Override
    +  public RecordReader<Void, T> createRecordReader(InputSplit split,
    +      TaskAttemptContext context) throws IOException, InterruptedException {
    +    switch (((CarbonInputSplit)split).formatType()) {
    --- End diff --
    
    Why don't you take the formatType from job conf? Better don't touch InputSplit as it comes from outside. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505442
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/row/CarbonRowInputFormat.java ---
    @@ -0,0 +1,40 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.api.row;
    +
    +import java.io.IOException;
    +
    +import org.apache.carbondata.hadoop.api.CarbonInputFormatBase;
    +import org.apache.carbondata.hadoop.internal.segment.SegmentManager;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.RecordReader;
    +import org.apache.hadoop.mapreduce.TaskAttemptContext;
    +
    +public class CarbonRowInputFormat<T> extends CarbonInputFormatBase<T> {
    --- End diff --
    
    Row format is for streaming ingest. I can remove it now and submit it in future feature of streaming ingest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by JihongMA <gi...@git.apache.org>.

Github user JihongMA commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r82487219
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonInputFormatBase.java ---
    @@ -0,0 +1,69 @@
    +/*
    --- End diff --
    
    should we consolidate the interface as following, 
    CarbonInputFormatBase
          --> CarbonInputFormat
                      --> CarbonColumnarInputFormat
                      --> CarbonRowInputFormat
    or even further merging CarbonInputFormat into CarbonInputFormatBase


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by QiangCai <gi...@git.apache.org>.

Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85061025
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/index/memory/InMemoryBTreeIndex.java ---
    @@ -0,0 +1,214 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.index.memory;
    +
    +import java.io.IOException;
    +import java.util.ArrayList;
    +import java.util.HashMap;
    +import java.util.LinkedList;
    +import java.util.List;
    +import java.util.Map;
    +
    +import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier;
    +import org.apache.carbondata.core.carbon.datastore.DataRefNode;
    +import org.apache.carbondata.core.carbon.datastore.DataRefNodeFinder;
    +import org.apache.carbondata.core.carbon.datastore.IndexKey;
    +import org.apache.carbondata.core.carbon.datastore.SegmentTaskIndexStore;
    +import org.apache.carbondata.core.carbon.datastore.block.AbstractIndex;
    +import org.apache.carbondata.core.carbon.datastore.block.BlockletInfos;
    +import org.apache.carbondata.core.carbon.datastore.block.SegmentProperties;
    +import org.apache.carbondata.core.carbon.datastore.block.TableBlockInfo;
    +import org.apache.carbondata.core.carbon.datastore.exception.IndexBuilderException;
    +import org.apache.carbondata.core.carbon.datastore.impl.btree.BTreeDataRefNodeFinder;
    +import org.apache.carbondata.core.carbon.datastore.impl.btree.BlockBTreeLeafNode;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatistic;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatisticsConstants;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatisticsRecorder;
    +import org.apache.carbondata.core.keygenerator.KeyGenException;
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.index.Index;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.hadoop.util.CarbonInputFormatUtil;
    +import org.apache.carbondata.scan.executor.exception.QueryExecutionException;
    +import org.apache.carbondata.scan.filter.FilterExpressionProcessor;
    +import org.apache.carbondata.scan.filter.FilterUtil;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.commons.logging.Log;
    +import org.apache.commons.logging.LogFactory;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +class InMemoryBTreeIndex implements Index {
    --- End diff --
    
    I understand InMemoryBTreeIndex  is segment level's index.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by QiangCai <gi...@git.apache.org>.

Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85463723
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/Segment.java ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.segment;
    +
    +import java.io.IOException;
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.hadoop.fs.FileStatus;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +/**
    + * Within a carbon table, each data load becomes one Segment, which stores all data files belong to this load in
    + * the segment folder.
    + */
    +public abstract class Segment {
    +
    +  protected String id;
    +
    +  /**
    +   * Path of the segment folder
    +   */
    +  private String path;
    +
    +  public Segment(String id, String path) {
    +    this.id = id;
    +    this.path = path;
    +  }
    +
    +  public String getId() {
    +    return id;
    +  }
    +
    +  public String getPath() {
    +    return path;
    +  }
    +
    +  /**
    +   * return all InputSplit of this segment, each file is a InputSplit
    +   * @param job job context
    +   * @return all InputSplit
    +   * @throws IOException
    +   */
    +  public List<InputSplit> getAllSplits(JobContext job) throws IOException {
    --- End diff --
    
    I suggest to return List<CarbonInputSplit>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284] Abstracting index a...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r85343636
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/index/impl/InMemoryBTreeIndex.java ---
    @@ -0,0 +1,215 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.index.impl;
    +
    +import java.io.IOException;
    +import java.util.ArrayList;
    +import java.util.HashMap;
    +import java.util.LinkedList;
    +import java.util.List;
    +import java.util.Map;
    +
    +import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier;
    +import org.apache.carbondata.core.carbon.datastore.DataRefNode;
    +import org.apache.carbondata.core.carbon.datastore.DataRefNodeFinder;
    +import org.apache.carbondata.core.carbon.datastore.IndexKey;
    +import org.apache.carbondata.core.carbon.datastore.SegmentTaskIndexStore;
    +import org.apache.carbondata.core.carbon.datastore.block.AbstractIndex;
    +import org.apache.carbondata.core.carbon.datastore.block.BlockletInfos;
    +import org.apache.carbondata.core.carbon.datastore.block.SegmentProperties;
    +import org.apache.carbondata.core.carbon.datastore.block.TableBlockInfo;
    +import org.apache.carbondata.core.carbon.datastore.exception.IndexBuilderException;
    +import org.apache.carbondata.core.carbon.datastore.impl.btree.BTreeDataRefNodeFinder;
    +import org.apache.carbondata.core.carbon.datastore.impl.btree.BlockBTreeLeafNode;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatistic;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatisticsConstants;
    +import org.apache.carbondata.core.carbon.querystatistics.QueryStatisticsRecorder;
    +import org.apache.carbondata.core.keygenerator.KeyGenException;
    +import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
    +import org.apache.carbondata.hadoop.CarbonInputSplit;
    +import org.apache.carbondata.hadoop.internal.index.Index;
    +import org.apache.carbondata.hadoop.internal.segment.Segment;
    +import org.apache.carbondata.scan.executor.exception.QueryExecutionException;
    +import org.apache.carbondata.scan.filter.FilterExpressionProcessor;
    +import org.apache.carbondata.scan.filter.FilterUtil;
    +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
    +import org.apache.commons.logging.Log;
    +import org.apache.commons.logging.LogFactory;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +class InMemoryBTreeIndex implements Index {
    +
    +  private static final Log LOG = LogFactory.getLog(InMemoryBTreeIndex.class);
    +  private Segment segment;
    +
    +  InMemoryBTreeIndex(Segment segment) {
    --- End diff --
    
    I guess we supposed to pass list of valid segments here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/208#discussion_r82445888
  
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/StreamingSegment.java ---
    @@ -0,0 +1,45 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +
    +package org.apache.carbondata.hadoop.internal.segment;
    +
    +import java.io.IOException;
    +import java.util.List;
    +
    +import org.apache.carbondata.hadoop.api.CarbonInputFormatBase;
    +import org.apache.carbondata.scan.model.QueryModel;
    +import org.apache.hadoop.mapreduce.InputSplit;
    +import org.apache.hadoop.mapreduce.JobContext;
    +
    +public class StreamingSegment extends Segment {
    --- End diff --
    
    This segment is for appending data from stream right?  Why we should handle separately while reading, can't we handle as normal segment  outside but handle reading differently in carbon layer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---