You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Sanoj MG <sa...@gmail.com> on 2017/04/11 02:36:22 UTC

Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Hi All,

In CarbonDataFrameWriter, there is an option to load using CSV file.

if (options.tempCSV) {

  loadTempCSV(options)
} else {
  loadDataFrame(options)
}

Why is this choice required? Is there any issue if we load it directly
without using CSV?

I have many dimension table with comma in string columns, and so always use
 .option("tempCSV", "false"). In CarbonOption can we set the default value
as "false" as below

def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean

Thanks,
Sanoj


On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <ji...@apache.org> wrote:

> Sanoj MG created CARBONDATA-836:
> -----------------------------------
>
>              Summary: Error in load using dataframe  - columns containing
> comma
>                  Key: CARBONDATA-836
>                  URL: https://issues.apache.org/jira/browse/CARBONDATA-836
>              Project: CarbonData
>           Issue Type: Bug
>           Components: spark-integration
>     Affects Versions: 1.1.0-incubating
>          Environment: HDP sandbox 2.5, Spark 1.6.2
>             Reporter: Sanoj MG
>             Priority: Minor
>              Fix For: NONE
>
>
> While trying to load data into Carabondata table using dataframe, the
> columns containing commas are not properly loaded.
>
> Eg:
> scala> df.show(false)
> +-------+------+-----------+----------------+---------+------+
> |Country|Branch|Name       |Address         |ShortName|Status|
> +-------+------+-----------+----------------+---------+------+
> |2      |1     |Main Branch|XXXX, Dubai, UAE|UHO      |256   |
> +-------+------+-----------+----------------+---------+------+
>
>
> scala>  df.write.format("carbondata").option("tableName",
> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
>
>
> scala> cc.sql("select * from branch1").show(false)
>
> +-------+------+-----------+-------+---------+------+
> |country|branch|name       |address|shortname|status|
> +-------+------+-----------+-------+---------+------+
> |2      |1     |Main Branch|XXXX   | Dubai   |null  |
> +-------+------+-----------+-------+---------+------+
>
>
>
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>

Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Posted by Sanoj MG <sa...@gmail.com>.

Thanks Jacky. I have created a JIRA -
https://issues.apache.org/jira/browse/CARBONDATA-909 for this.



Thanks,
Sanoj

On Tue, Apr 11, 2017 at 5:42 PM, Jacky Li <ja...@qq.com> wrote:

> Hi Sanoj,
>
> This is because in CarbonData loading flow, it needs to scan input data
> twice (one for generating global dictionary, another for actual loading).
> If user is using Dataframe to write to CarbonData, and if the input
> dataframe compute is costly, it is better to save it as a temporary CSV
> file first and load into CarbonData instead of computing the dataframe
> twice.
>
> However there is another option that can do single pass data load, by
> using .option(“single_pass”, “true”), in this case, the input dataframe
> should be computed only once. But when I check the code just now, it seems
> this behavior is not implemented. :(
> I think you are free to create JIRA ticket if you want.
>
> Regards,
> Jacky
>
> > 在 2017年4月11日，上午10:36，Sanoj MG <sa...@gmail.com> 写道：
> >
> > Hi All,
> >
> > In CarbonDataFrameWriter, there is an option to load using CSV file.
> >
> > if (options.tempCSV) {
> >
> >  loadTempCSV(options)
> > } else {
> >  loadDataFrame(options)
> > }
> >
> > Why is this choice required? Is there any issue if we load it directly
> > without using CSV?
> >
> > I have many dimension table with comma in string columns, and so always
> use
> > .option("tempCSV", "false"). In CarbonOption can we set the default value
> > as "false" as below
> >
> > def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean
> >
> > Thanks,
> > Sanoj
> >
> >
> > On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <ji...@apache.org>
> wrote:
> >
> >> Sanoj MG created CARBONDATA-836:
> >> -----------------------------------
> >>
> >>             Summary: Error in load using dataframe  - columns containing
> >> comma
> >>                 Key: CARBONDATA-836
> >>                 URL: https://issues.apache.org/
> jira/browse/CARBONDATA-836
> >>             Project: CarbonData
> >>          Issue Type: Bug
> >>          Components: spark-integration
> >>    Affects Versions: 1.1.0-incubating
> >>         Environment: HDP sandbox 2.5, Spark 1.6.2
> >>            Reporter: Sanoj MG
> >>            Priority: Minor
> >>             Fix For: NONE
> >>
> >>
> >> While trying to load data into Carabondata table using dataframe, the
> >> columns containing commas are not properly loaded.
> >>
> >> Eg:
> >> scala> df.show(false)
> >> +-------+------+-----------+----------------+---------+------+
> >> |Country|Branch|Name       |Address         |ShortName|Status|
> >> +-------+------+-----------+----------------+---------+------+
> >> |2      |1     |Main Branch|XXXX, Dubai, UAE|UHO      |256   |
> >> +-------+------+-----------+----------------+---------+------+
> >>
> >>
> >> scala>  df.write.format("carbondata").option("tableName",
> >> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
> >>
> >>
> >> scala> cc.sql("select * from branch1").show(false)
> >>
> >> +-------+------+-----------+-------+---------+------+
> >> |country|branch|name       |address|shortname|status|
> >> +-------+------+-----------+-------+---------+------+
> >> |2      |1     |Main Branch|XXXX   | Dubai   |null  |
> >> +-------+------+-----------+-------+---------+------+
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.3.15#6346)
> >>
>
>

Re: [jira] [Created] (CARBONDATA-836) Error in load using dataframe - columns containing comma

Posted by Jacky Li <ja...@qq.com>.

Hi Sanoj,

This is because in CarbonData loading flow, it needs to scan input data twice (one for generating global dictionary, another for actual loading). If user is using Dataframe to write to CarbonData, and if the input dataframe compute is costly, it is better to save it as a temporary CSV file first and load into CarbonData instead of computing the dataframe twice.

However there is another option that can do single pass data load, by using .option(“single_pass”, “true”), in this case, the input dataframe should be computed only once. But when I check the code just now, it seems this behavior is not implemented. :( 
I think you are free to create JIRA ticket if you want.

Regards,
Jacky

> 在 2017年4月11日，上午10:36，Sanoj MG <sa...@gmail.com> 写道：
> 
> Hi All,
> 
> In CarbonDataFrameWriter, there is an option to load using CSV file.
> 
> if (options.tempCSV) {
> 
>  loadTempCSV(options)
> } else {
>  loadDataFrame(options)
> }
> 
> Why is this choice required? Is there any issue if we load it directly
> without using CSV?
> 
> I have many dimension table with comma in string columns, and so always use
> .option("tempCSV", "false"). In CarbonOption can we set the default value
> as "false" as below
> 
> def tempCSV: Boolean = options.getOrElse("tempCSV", "false").toBoolean
> 
> Thanks,
> Sanoj
> 
> 
> On Thu, Mar 30, 2017 at 12:14 PM, Sanoj MG (JIRA) <ji...@apache.org> wrote:
> 
>> Sanoj MG created CARBONDATA-836:
>> -----------------------------------
>> 
>>             Summary: Error in load using dataframe  - columns containing
>> comma
>>                 Key: CARBONDATA-836
>>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-836
>>             Project: CarbonData
>>          Issue Type: Bug
>>          Components: spark-integration
>>    Affects Versions: 1.1.0-incubating
>>         Environment: HDP sandbox 2.5, Spark 1.6.2
>>            Reporter: Sanoj MG
>>            Priority: Minor
>>             Fix For: NONE
>> 
>> 
>> While trying to load data into Carabondata table using dataframe, the
>> columns containing commas are not properly loaded.
>> 
>> Eg:
>> scala> df.show(false)
>> +-------+------+-----------+----------------+---------+------+
>> |Country|Branch|Name       |Address         |ShortName|Status|
>> +-------+------+-----------+----------------+---------+------+
>> |2      |1     |Main Branch|XXXX, Dubai, UAE|UHO      |256   |
>> +-------+------+-----------+----------------+---------+------+
>> 
>> 
>> scala>  df.write.format("carbondata").option("tableName",
>> "Branch1").option("compress", "true").mode(SaveMode.Overwrite).save()
>> 
>> 
>> scala> cc.sql("select * from branch1").show(false)
>> 
>> +-------+------+-----------+-------+---------+------+
>> |country|branch|name       |address|shortname|status|
>> +-------+------+-----------+-------+---------+------+
>> |2      |1     |Main Branch|XXXX   | Dubai   |null  |
>> +-------+------+-----------+-------+---------+------+
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.15#6346)
>>