You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by jackylk <gi...@git.apache.org> on 2016/08/29 09:20:09 UTC

[GitHub] incubator-carbondata pull request #104: [CARBONDATA-188] Compress CSV file b...

GitHub user jackylk opened a pull request:

    https://github.com/apache/incubator-carbondata/pull/104

    [CARBONDATA-188] Compress CSV file before loading

    Currently when loading CarbonData file using Spark Dataframe API, it will firstly save as CSV file then load to CarbonData file. 
    
    Sometimes CSV requires a lot of disk space,  in this PR, instead of saving as CSV text file, it will save a compressed CSV file, then load to CarbonData. 
    
    In my laptop, when loading 1 million records, the disk space required for CSV file is reduced 4~5 times.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jackylk/incubator-carbondata compress

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-carbondata/pull/104.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #104
    
----
commit ddeaecb9dad1b51be85302d0ff7ee9c31c1b13d7
Author: jackylk <ja...@huawei.com>
Date:   2016-08-29T08:41:38Z

    compress CSV file using GZIP while loading

commit 1bfc8c3bcb9a3809580386c16b5fe94b2c6b6943
Author: jackylk <ja...@huawei.com>
Date:   2016-08-29T09:05:17Z

    fix checkstyle

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #104: [CARBONDATA-188] Compress CSV file b...

Posted by QiangCai <gi...@git.apache.org>.
Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/104#discussion_r76723873
  
    --- Diff: integration/spark/src/main/scala/org/apache/carbondata/spark/csv/CarbonTextFile.scala ---
    @@ -36,6 +36,8 @@ private[csv] object CarbonTextFile {
         val hadoopConfiguration = new Configuration(sc.hadoopConfiguration)
         hadoopConfiguration.setStrings(FileInputFormat.INPUT_DIR, location)
         hadoopConfiguration.setBoolean(FileInputFormat.INPUT_DIR_RECURSIVE, true)
    +    hadoopConfiguration.set("io.compression.codecs", "org.apache.hadoop.io.compress.GzipCodec")
    --- End diff --
    
    Please check whether if it is a compression file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #104: [CARBONDATA-188] Compress CSV file b...

Posted by Zhangshunyu <gi...@git.apache.org>.
Github user Zhangshunyu commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/104#discussion_r76580960
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -112,25 +116,29 @@ private void initializeReader() throws IOException {
         // if already one input stream is open first we need to close and then
         // open new stream
         close();
    -    // get the block offset
    -    long startOffset = this.csvParserVo.getBlockDetailsList().get(blockCounter).getBlockOffset();
    -    FileType fileType = FileFactory
    -        .getFileType(this.csvParserVo.getBlockDetailsList().get(blockCounter).getFilePath());
    -    // calculate the end offset the block
    -    long endOffset =
    -        this.csvParserVo.getBlockDetailsList().get(blockCounter).getBlockLength() + startOffset;
    -
    -    // create a input stream for the block
    -    DataInputStream dataInputStream = FileFactory
    -        .getDataInputStream(this.csvParserVo.getBlockDetailsList().get(blockCounter).getFilePath(),
    -            fileType, bufferSize, startOffset);
    -    // if start offset is not 0 then reading then reading and ignoring the extra line
    -    if (startOffset != 0) {
    -      LineReader lineReader = new LineReader(dataInputStream, 1);
    -      startOffset += lineReader.readLine(new Text(), 0);
    +
    +    String path = this.csvParserVo.getBlockDetailsList().get(blockCounter).getFilePath();
    +    FileType fileType = FileFactory.getFileType(path);
    +
    +    if (path.endsWith(".gz")) {
    +      DataInputStream dataInputStream =
    +          FileFactory.getCompressedDataInputStream(path, fileType, bufferSize);
    +      inputStreamReader = new BufferedReader(new InputStreamReader(dataInputStream));
    +    } else {
    +      long startOffset = this.csvParserVo.getBlockDetailsList().get(blockCounter).getBlockOffset();
    +      long blockLength = this.csvParserVo.getBlockDetailsList().get(blockCounter).getBlockLength();
    +      long endOffset = blockLength + startOffset;
    +
    +      DataInputStream dataInputStream = FileFactory.getDataInputStream(path, fileType, bufferSize);
    +
    +      // if start offset is not 0 then reading then reading and ignoring the extra line
    +      if (startOffset != 0) {
    +        LineReader lineReader = new LineReader(dataInputStream, 1);
    +        startOffset += lineReader.readLine(new Text(), 0);
    +      }
    +      inputStreamReader = new BufferedReader(new InputStreamReader(
    +          new BoundedDataStream(dataInputStream, endOffset - startOffset)));
    --- End diff --
    
    Can not find class BoundedDataStream


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #104: [CARBONDATA-188] Compress CSV file b...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/104#discussion_r76595559
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -112,25 +116,29 @@ private void initializeReader() throws IOException {
         // if already one input stream is open first we need to close and then
         // open new stream
         close();
    -    // get the block offset
    -    long startOffset = this.csvParserVo.getBlockDetailsList().get(blockCounter).getBlockOffset();
    -    FileType fileType = FileFactory
    -        .getFileType(this.csvParserVo.getBlockDetailsList().get(blockCounter).getFilePath());
    -    // calculate the end offset the block
    -    long endOffset =
    -        this.csvParserVo.getBlockDetailsList().get(blockCounter).getBlockLength() + startOffset;
    -
    -    // create a input stream for the block
    -    DataInputStream dataInputStream = FileFactory
    -        .getDataInputStream(this.csvParserVo.getBlockDetailsList().get(blockCounter).getFilePath(),
    -            fileType, bufferSize, startOffset);
    -    // if start offset is not 0 then reading then reading and ignoring the extra line
    -    if (startOffset != 0) {
    -      LineReader lineReader = new LineReader(dataInputStream, 1);
    -      startOffset += lineReader.readLine(new Text(), 0);
    +
    +    String path = this.csvParserVo.getBlockDetailsList().get(blockCounter).getFilePath();
    +    FileType fileType = FileFactory.getFileType(path);
    +
    +    if (path.endsWith(".gz")) {
    +      DataInputStream dataInputStream =
    +          FileFactory.getCompressedDataInputStream(path, fileType, bufferSize);
    +      inputStreamReader = new BufferedReader(new InputStreamReader(dataInputStream));
    +    } else {
    +      long startOffset = this.csvParserVo.getBlockDetailsList().get(blockCounter).getBlockOffset();
    +      long blockLength = this.csvParserVo.getBlockDetailsList().get(blockCounter).getBlockLength();
    +      long endOffset = blockLength + startOffset;
    +
    +      DataInputStream dataInputStream = FileFactory.getDataInputStream(path, fileType, bufferSize);
    +
    +      // if start offset is not 0 then reading then reading and ignoring the extra line
    +      if (startOffset != 0) {
    +        LineReader lineReader = new LineReader(dataInputStream, 1);
    +        startOffset += lineReader.readLine(new Text(), 0);
    +      }
    +      inputStreamReader = new BufferedReader(new InputStreamReader(
    +          new BoundedDataStream(dataInputStream, endOffset - startOffset)));
    --- End diff --
    
    Forgot to add.
    Added now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #104: [CARBONDATA-188] Compress CSV file b...

Posted by QiangCai <gi...@git.apache.org>.
Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/104#discussion_r76957634
  
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/csvreaderstep/UnivocityCsvParser.java ---
    @@ -112,25 +116,28 @@ private void initializeReader() throws IOException {
         // if already one input stream is open first we need to close and then
         // open new stream
         close();
    -    // get the block offset
    -    long startOffset = this.csvParserVo.getBlockDetailsList().get(blockCounter).getBlockOffset();
    -    FileType fileType = FileFactory
    -        .getFileType(this.csvParserVo.getBlockDetailsList().get(blockCounter).getFilePath());
    -    // calculate the end offset the block
    -    long endOffset =
    -        this.csvParserVo.getBlockDetailsList().get(blockCounter).getBlockLength() + startOffset;
    -
    -    // create a input stream for the block
    -    DataInputStream dataInputStream = FileFactory
    -        .getDataInputStream(this.csvParserVo.getBlockDetailsList().get(blockCounter).getFilePath(),
    -            fileType, bufferSize, startOffset);
    -    // if start offset is not 0 then reading then reading and ignoring the extra line
    -    if (startOffset != 0) {
    -      LineReader lineReader = new LineReader(dataInputStream, 1);
    -      startOffset += lineReader.readLine(new Text(), 0);
    +
    +    String path = this.csvParserVo.getBlockDetailsList().get(blockCounter).getFilePath();
    +    FileType fileType = FileFactory.getFileType(path);
    +
    +    DataInputStream dataInputStream =
    +        FileFactory.getDataInputStream(path, fileType, bufferSize);
    --- End diff --
    
    For csv file ,  DataInputStream need startOffSet  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #104: [CARBONDATA-188] Compress CSV file b...

Posted by QiangCai <gi...@git.apache.org>.
Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/104#discussion_r76723773
  
    --- Diff: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala ---
    @@ -657,6 +657,8 @@ object CarbonDataRDDFactory extends Logging {
               val filePaths = carbonLoadModel.getFactFilePath
               hadoopConfiguration.set("mapreduce.input.fileinputformat.inputdir", filePaths)
               hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
    +          hadoopConfiguration.set("io.compression.codecs",
    +            "org.apache.hadoop.io.compress.GzipCodec")
    --- End diff --
    
    This configuration is only for compression file.
    Please check whether if it is a compression file.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #104: [CARBONDATA-188] Compress CSV file b...

Posted by QiangCai <gi...@git.apache.org>.
Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/104#discussion_r76723805
  
    --- Diff: integration/spark/src/main/scala/org/apache/carbondata/spark/util/GlobalDictionaryUtil.scala ---
    @@ -364,6 +364,7 @@ object GlobalDictionaryUtil extends Logging {
           .option("escape", carbonLoadModel.getEscapeChar)
           .option("ignoreLeadingWhiteSpace", "false")
           .option("ignoreTrailingWhiteSpace", "false")
    +      .option("codec", "gzip")
    --- End diff --
    
    Please check whether if it is a compression file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #104: [CARBONDATA-188] Compress CSV file b...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-carbondata/pull/104


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---