You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by foryou2030 <gi...@git.apache.org> on 2016/08/22 11:41:12 UTC

[GitHub] incubator-carbondata pull request #82: [CARBONDATA-165] Support loading fact...

GitHub user foryou2030 opened a pull request:

    https://github.com/apache/incubator-carbondata/pull/82

    [CARBONDATA-165] Support loading fact file with header for all dictionary

    # Why raise this pr?
    When fact csv already have header and  giving FILEHEADER along with  ALL_DICTIONARY_PATH option ,  header will be considered as data row , which is not correct.
    FILEHEADER option must be given only when CSV do not have header .  We can read the header from fact file when FILEHEADER  is not  given with ALL_DICTIONARY_PATH 
    
    # How to solve?
    Add adapter for loading data with header. While loading fact CSV file with header, get file header from fact file instead of option("FILEHEADER") (the interface "getHeaderFormFactFile" did this)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/foryou2030/incubator-carbondata all_dict_header

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-carbondata/pull/82.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #82
    
----
commit 79d4b0ae87a58cd3437fc6f19bba9865afcac417
Author: foryou2030 <fo...@126.com>
Date:   2016-08-22T10:00:00Z

    adapt data with header for all dictionary

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #82: [CARBONDATA-165] Support loading fact...

Posted by QiangCai <gi...@git.apache.org>.
Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/82#discussion_r76737910
  
    --- Diff: integration/spark/src/main/scala/org/apache/carbondata/spark/util/GlobalDictionaryUtil.scala ---
    @@ -650,6 +650,51 @@ object GlobalDictionaryUtil extends Logging {
       }
     
       /**
    +   * get file headers from fact file
    +   *
    +   * @param carbonLoadModel
    +   * @return headers
    +   */
    +  private def getHeaderFormFactFile(carbonLoadModel: CarbonLoadModel): Array[String] = {
    +    var headers: Array[String] = null
    +    var factFile: String = null
    --- End diff --
    
    factFile = arbonLoadModel.getFactFilePath.split(",")(0)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #82: [CARBONDATA-165] Support loading fact...

Posted by QiangCai <gi...@git.apache.org>.
Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/82#discussion_r76749797
  
    --- Diff: integration/spark/src/main/scala/org/apache/carbondata/spark/util/GlobalDictionaryUtil.scala ---
    @@ -736,27 +761,20 @@ object GlobalDictionaryUtil extends Logging {
             logInfo("Generate global dictionary from all dictionary files!")
             val isNonempty = validateAllDictionaryPath(allDictionaryPath)
             if(isNonempty) {
    -          // fill the map[columnIndex -> columnName]
    -          var fileHeaders : Array[String] = null
    -          if(!StringUtils.isEmpty(carbonLoadModel.getCsvHeader)) {
    -            val splitColumns = carbonLoadModel.getCsvHeader.split("" + CSVWriter.DEFAULT_SEPARATOR)
    -            val fileHeadersArr = new ArrayBuffer[String]()
    -            for(i <- 0 until splitColumns.length) {
    -              fileHeadersArr += splitColumns(i).trim.toLowerCase()
    -            }
    -            fileHeaders = fileHeadersArr.toArray
    +          var headers = if (StringUtils.isEmpty(carbonLoadModel.getCsvHeader)) {
    +            getHeaderFormFactFile(carbonLoadModel)
               } else {
    -            logError("Not found file header! Please set fileheader")
    -            throw new IOException("Failed to get file header")
    +            carbonLoadModel.getCsvHeader.toLowerCase.split("" + CSVWriter.DEFAULT_SEPARATOR)
               }
    +          headers = headers.map(headerName => headerName.trim)
    --- End diff --
    
    headers = headers.map(_)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #82: [CARBONDATA-165] Support loading fact...

Posted by foryou2030 <gi...@git.apache.org>.
Github user foryou2030 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/82#discussion_r76726188
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java ---
    @@ -1443,5 +1446,32 @@ public static int getDictionaryChunkSize() {
         }
         return dictionaryOneChunkSize;
       }
    +
    +  /**
    +   * @param csvFilePath
    +   * @return
    +   */
    +  public static String readHeader(String csvFilePath) {
    +
    +    DataInputStream fileReader = null;
    +    BufferedReader bufferedReader = null;
    +    String readLine = null;
    +
    +    try {
    +      fileReader =
    +          FileFactory.getDataInputStream(csvFilePath, FileFactory.getFileType(csvFilePath));
    +      bufferedReader =
    +          new BufferedReader(new InputStreamReader(fileReader, Charset.defaultCharset()));
    --- End diff --
    
    ok, handled


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #82: [CARBONDATA-165] Support loading fact...

Posted by foryou2030 <gi...@git.apache.org>.
Github user foryou2030 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/82#discussion_r76740663
  
    --- Diff: integration/spark/src/main/scala/org/apache/carbondata/spark/util/GlobalDictionaryUtil.scala ---
    @@ -650,6 +650,51 @@ object GlobalDictionaryUtil extends Logging {
       }
     
       /**
    +   * get file headers from fact file
    +   *
    +   * @param carbonLoadModel
    +   * @return headers
    +   */
    +  private def getHeaderFormFactFile(carbonLoadModel: CarbonLoadModel): Array[String] = {
    +    var headers: Array[String] = null
    +    var factFile: String = null
    +    val fileType = FileFactory.getFileType(carbonLoadModel.getFactFilePath)
    +    val filePath = FileFactory.getCarbonFile(carbonLoadModel.getFactFilePath, fileType)
    +    if (filePath.isDirectory) {
    +      val listFiles = filePath.getParentFile.listFiles()
    --- End diff --
    
    ok, thanks @QiangCai @manishgupta88 
    i have handled



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #82: [CARBONDATA-165] Support loading fact...

Posted by manishgupta88 <gi...@git.apache.org>.
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/82#discussion_r76736815
  
    --- Diff: integration/spark/src/main/scala/org/apache/carbondata/spark/util/GlobalDictionaryUtil.scala ---
    @@ -650,6 +650,51 @@ object GlobalDictionaryUtil extends Logging {
       }
     
       /**
    +   * get file headers from fact file
    +   *
    +   * @param carbonLoadModel
    +   * @return headers
    +   */
    +  private def getHeaderFormFactFile(carbonLoadModel: CarbonLoadModel): Array[String] = {
    +    var headers: Array[String] = null
    +    var factFile: String = null
    +    val fileType = FileFactory.getFileType(carbonLoadModel.getFactFilePath)
    +    val filePath = FileFactory.getCarbonFile(carbonLoadModel.getFactFilePath, fileType)
    +    if (filePath.isDirectory) {
    +      val listFiles = filePath.getParentFile.listFiles()
    --- End diff --
    
    @foryou2030 ....Please clarify the below points.
    1. Why you are using filePath.getParentFile?
    2. There can be a case where one folder can contain only subFolders. In that case you will not be find a csv file. Ideally to find a file it should be a recursive call.
    
    @QiangCai ...please review this method and correct me if I am wrong


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #82: [CARBONDATA-165] Support loading fact...

Posted by QiangCai <gi...@git.apache.org>.
Github user QiangCai commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/82#discussion_r76739692
  
    --- Diff: integration/spark/src/main/scala/org/apache/carbondata/spark/util/GlobalDictionaryUtil.scala ---
    @@ -650,6 +650,51 @@ object GlobalDictionaryUtil extends Logging {
       }
     
       /**
    +   * get file headers from fact file
    +   *
    +   * @param carbonLoadModel
    +   * @return headers
    +   */
    +  private def getHeaderFormFactFile(carbonLoadModel: CarbonLoadModel): Array[String] = {
    +    var headers: Array[String] = null
    +    var factFile: String = null
    +    val fileType = FileFactory.getFileType(carbonLoadModel.getFactFilePath)
    +    val filePath = FileFactory.getCarbonFile(carbonLoadModel.getFactFilePath, fileType)
    +    if (filePath.isDirectory) {
    +      val listFiles = filePath.getParentFile.listFiles()
    --- End diff --
    
    Because we have inovked  FileUtils.getPaths to convert user's input to file list in LoadTable, so these codes are unnecessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #82: [CARBONDATA-165] Support loading fact...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-carbondata/pull/82


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #82: [CARBONDATA-165] Support loading fact...

Posted by manishgupta88 <gi...@git.apache.org>.
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/82#discussion_r76625697
  
    --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java ---
    @@ -1443,5 +1446,32 @@ public static int getDictionaryChunkSize() {
         }
         return dictionaryOneChunkSize;
       }
    +
    +  /**
    +   * @param csvFilePath
    +   * @return
    +   */
    +  public static String readHeader(String csvFilePath) {
    +
    +    DataInputStream fileReader = null;
    +    BufferedReader bufferedReader = null;
    +    String readLine = null;
    +
    +    try {
    +      fileReader =
    +          FileFactory.getDataInputStream(csvFilePath, FileFactory.getFileType(csvFilePath));
    +      bufferedReader =
    +          new BufferedReader(new InputStreamReader(fileReader, Charset.defaultCharset()));
    --- End diff --
    
    @foryou2030 ....instead of using Charset.defaultCharset() use the below line of code.
    Charset.forName( CarbonCommonConstants.DEFAULT_CHARSET)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---