You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sid <fl...@gmail.com> on 2022/05/25 20:06:51 UTC

Complexity with the data

Hi Experts,

I have below CSV data that is getting generated automatically. I can't
change the data manually.

The data looks like below:

2020-12-12,abc,2000,,INR,
2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
2020-12-09,fgh,,software_developer,I only manage the development part.

Since I don't have much experience with the other domains.

It is handled by the other people.,INR
2020-12-12,abc,2000,,USD,

The third record is a problem. Since the value is separated by the new line
by the user while filling up the form. So, how do I handle this?

There are 6 columns and 4 records in total. These are the sample records.

Should I load it as RDD and then may be using a regex should eliminate the
new lines? Or how it should be? with ". /n" ?

Any suggestions?

Thanks,
Sid

Re: Complexity with the data

Posted by Sid <fl...@gmail.com>.

Hi Gourav,

Please find the below link for a detailed understanding.

https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark/72391090#72391090

@Bjørn Jørgensen <bj...@gmail.com> :

I was able to read such kind of data using the below code:

spark.read.option("header",True).option("multiline","true").option("escape","\"").csv("sample1.csv")


Also, I have one question about one of my columns. I have one column
with data like below:


[image: image.png]


Have a look at the second record. Should I mark it as corrupt record?
Or is there anyway to process such kind of records.


Thanks,

Sid





On Thu, May 26, 2022 at 10:54 PM Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
> can you please give us a simple map of what the input is and what the
> output should be like? From your description it looks a bit difficult to
> figure out what exactly or how exactly you want the records actually parsed.
>
>
> Regards,
> Gourav Sengupta
>
> On Wed, May 25, 2022 at 9:08 PM Sid <fl...@gmail.com> wrote:
>
>> Hi Experts,
>>
>> I have below CSV data that is getting generated automatically. I can't
>> change the data manually.
>>
>> The data looks like below:
>>
>> 2020-12-12,abc,2000,,INR,
>> 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>> 2020-12-09,fgh,,software_developer,I only manage the development part.
>>
>> Since I don't have much experience with the other domains.
>>
>> It is handled by the other people.,INR
>> 2020-12-12,abc,2000,,USD,
>>
>> The third record is a problem. Since the value is separated by the new
>> line by the user while filling up the form. So, how do I handle this?
>>
>> There are 6 columns and 4 records in total. These are the sample records.
>>
>> Should I load it as RDD and then may be using a regex should eliminate
>> the new lines? Or how it should be? with ". /n" ?
>>
>> Any suggestions?
>>
>> Thanks,
>> Sid
>>
>

Re: Complexity with the data

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,
can you please give us a simple map of what the input is and what the
output should be like? From your description it looks a bit difficult to
figure out what exactly or how exactly you want the records actually parsed.


Regards,
Gourav Sengupta

On Wed, May 25, 2022 at 9:08 PM Sid <fl...@gmail.com> wrote:

> Hi Experts,
>
> I have below CSV data that is getting generated automatically. I can't
> change the data manually.
>
> The data looks like below:
>
> 2020-12-12,abc,2000,,INR,
> 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
> 2020-12-09,fgh,,software_developer,I only manage the development part.
>
> Since I don't have much experience with the other domains.
>
> It is handled by the other people.,INR
> 2020-12-12,abc,2000,,USD,
>
> The third record is a problem. Since the value is separated by the new
> line by the user while filling up the form. So, how do I handle this?
>
> There are 6 columns and 4 records in total. These are the sample records.
>
> Should I load it as RDD and then may be using a regex should eliminate the
> new lines? Or how it should be? with ". /n" ?
>
> Any suggestions?
>
> Thanks,
> Sid
>

Re: Complexity with the data

Posted by Bjørn Jørgensen <bj...@gmail.com>.

Yes, but how do you read it with spark.

tor. 26. mai 2022, 18:30 skrev Sid <fl...@gmail.com>:

> I am not reading it through pandas. I am using Spark because when I tried
> to use pandas which comes under import pyspark.pandas, it gives me an
> error.
>
> On Thu, May 26, 2022 at 9:52 PM Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> ok, but how do you read it now?
>>
>>
>> https://github.com/apache/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216
>> probably have to be updated with the default options. This is so that
>> pandas API on spark will be like pandas.
>>
>> tor. 26. mai 2022 kl. 17:38 skrev Sid <fl...@gmail.com>:
>>
>>> I was passing the wrong escape characters due to which I was facing the
>>> issue. I have updated the user's answer on my post. Now I am able to load
>>> the dataset.
>>>
>>> Thank you everyone for your time and help!
>>>
>>> Much appreciated.
>>>
>>> I have more datasets like this. I hope that would be resolved using this
>>> approach :) Fingers crossed.
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos <
>>> papadopo@csd.auth.gr> wrote:
>>>
>>>> Since you cannot create the DF directly, you may try to first create an
>>>> RDD of tuples from the file
>>>>
>>>> and then convert the RDD to a DF by using the toDF() transformation.
>>>>
>>>> Perhaps you may bypass the issue with this.
>>>>
>>>> Another thing that I have seen in the example is that you are using ""
>>>> as an escape character.
>>>>
>>>> Can you check if this may cause any issues?
>>>>
>>>> Regards,
>>>>
>>>> Apostolos
>>>>
>>>>
>>>>
>>>> On 26/5/22 16:31, Sid wrote:
>>>>
>>>> Thanks for opening the issue, Bjorn. However, could you help me to
>>>> address the problem for now with some kind of alternative?
>>>>
>>>> I am actually stuck in this since yesterday.
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bj...@gmail.com>
>>>> wrote:
>>>>
>>>>> Yes, it looks like a bug that we also have in pandas API on spark.
>>>>>
>>>>> So I have opened a JIRA
>>>>> <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>>>>>
>>>>> tor. 26. mai 2022 kl. 11:09 skrev Sid <fl...@gmail.com>:
>>>>>
>>>>>> Hello Everyone,
>>>>>>
>>>>>> I have posted a question finally with the dataset and the column
>>>>>> names.
>>>>>>
>>>>>> PFB link:
>>>>>>
>>>>>>
>>>>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>>>>>>
>>>>>> Thanks,
>>>>>> Sid
>>>>>>
>>>>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <
>>>>>> bjornjorgensen@gmail.com> wrote:
>>>>>>
>>>>>>> Sid, dump one of yours files.
>>>>>>>
>>>>>>>
>>>>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ons. 25. mai 2022, 23:04 skrev Sid <fl...@gmail.com>:
>>>>>>>
>>>>>>>> I have 10 columns with me but in the dataset, I observed that some
>>>>>>>> records have 11 columns of data(for the additional column it is marked as
>>>>>>>> null). But, how do I handle this?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sid
>>>>>>>>
>>>>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <fl...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> How can I do that? Any examples or links, please. So, this works
>>>>>>>>> well with pandas I suppose. It's just that I need to convert back to the
>>>>>>>>> spark data frame by providing a schema but since we are using a lower spark
>>>>>>>>> version and pandas won't work in a distributed way in the lower versions,
>>>>>>>>> therefore, was wondering if spark could handle this in a much better way.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Sid
>>>>>>>>>
>>>>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ra...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>>>>>>
>>>>>>>>>> You need to normalize the CSV with a parser that can escape
>>>>>>>>>> commas inside of strings
>>>>>>>>>> Not sure if Spark has an option for this?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you so much for your time.
>>>>>>>>>>>
>>>>>>>>>>> I have data like below which I tried to load by setting multiple
>>>>>>>>>>> options while reading the file but however, but I am not able to
>>>>>>>>>>> consolidate the 9th column data within itself.
>>>>>>>>>>>
>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>
>>>>>>>>>>> I tried the below code:
>>>>>>>>>>>
>>>>>>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>>>>>>
>>>>>>>>>>>                                           '"').option(
>>>>>>>>>>>     "delimiter", ",").csv("path")
>>>>>>>>>>>
>>>>>>>>>>> What else I can do?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Sid
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>>>>>>> papadopo@csd.auth.gr> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Dear Sid,
>>>>>>>>>>>>
>>>>>>>>>>>> can you please give us more info? Is it true that every line
>>>>>>>>>>>> may have a
>>>>>>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>>>>>>
>>>>>>>>>>>> every line of the file? From the information you have sent I
>>>>>>>>>>>> cannot
>>>>>>>>>>>> fully understand the "schema" of your data.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Apostolos
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>>>>>>> > Hi Experts,
>>>>>>>>>>>> >
>>>>>>>>>>>> > I have below CSV data that is getting generated
>>>>>>>>>>>> automatically. I can't
>>>>>>>>>>>> > change the data manually.
>>>>>>>>>>>> >
>>>>>>>>>>>> > The data looks like below:
>>>>>>>>>>>> >
>>>>>>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the
>>>>>>>>>>>> development part.
>>>>>>>>>>>> >
>>>>>>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>>>>>>> >
>>>>>>>>>>>> > It is handled by the other people.,INR
>>>>>>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>>>>>>> >
>>>>>>>>>>>> > The third record is a problem. Since the value is separated
>>>>>>>>>>>> by the new
>>>>>>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>>>>>>> handle this?
>>>>>>>>>>>> >
>>>>>>>>>>>> > There are 6 columns and 4 records in total. These are the
>>>>>>>>>>>> sample records.
>>>>>>>>>>>> >
>>>>>>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>>>>>>> eliminate
>>>>>>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>>>>>>> >
>>>>>>>>>>>> > Any suggestions?
>>>>>>>>>>>> >
>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>> > Sid
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>>>>>>> Department of Informatics
>>>>>>>>>>>> Aristotle University of Thessaloniki
>>>>>>>>>>>> Thessaloniki, GREECE
>>>>>>>>>>>> tel: ++0030312310991918
>>>>>>>>>>>> email: papadopo@csd.auth.gr
>>>>>>>>>>>> twitter: @papadopoulos_ap
>>>>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>> --
>>>> Apostolos N. Papadopoulos, Associate Professor
>>>> Department of Informatics
>>>> Aristotle University of Thessaloniki
>>>> Thessaloniki, GREECE
>>>> tel: ++0030312310991918
>>>> email: papadopo@csd.auth.gr
>>>> twitter: @papadopoulos_ap
>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>
>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: Complexity with the data

Posted by Sid <fl...@gmail.com>.

I am not reading it through pandas. I am using Spark because when I tried
to use pandas which comes under import pyspark.pandas, it gives me an
error.

On Thu, May 26, 2022 at 9:52 PM Bjørn Jørgensen <bj...@gmail.com>
wrote:

> ok, but how do you read it now?
>
>
> https://github.com/apache/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216
> probably have to be updated with the default options. This is so that
> pandas API on spark will be like pandas.
>
> tor. 26. mai 2022 kl. 17:38 skrev Sid <fl...@gmail.com>:
>
>> I was passing the wrong escape characters due to which I was facing the
>> issue. I have updated the user's answer on my post. Now I am able to load
>> the dataset.
>>
>> Thank you everyone for your time and help!
>>
>> Much appreciated.
>>
>> I have more datasets like this. I hope that would be resolved using this
>> approach :) Fingers crossed.
>>
>> Thanks,
>> Sid
>>
>> On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos <
>> papadopo@csd.auth.gr> wrote:
>>
>>> Since you cannot create the DF directly, you may try to first create an
>>> RDD of tuples from the file
>>>
>>> and then convert the RDD to a DF by using the toDF() transformation.
>>>
>>> Perhaps you may bypass the issue with this.
>>>
>>> Another thing that I have seen in the example is that you are using ""
>>> as an escape character.
>>>
>>> Can you check if this may cause any issues?
>>>
>>> Regards,
>>>
>>> Apostolos
>>>
>>>
>>>
>>> On 26/5/22 16:31, Sid wrote:
>>>
>>> Thanks for opening the issue, Bjorn. However, could you help me to
>>> address the problem for now with some kind of alternative?
>>>
>>> I am actually stuck in this since yesterday.
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bj...@gmail.com>
>>> wrote:
>>>
>>>> Yes, it looks like a bug that we also have in pandas API on spark.
>>>>
>>>> So I have opened a JIRA
>>>> <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>>>>
>>>> tor. 26. mai 2022 kl. 11:09 skrev Sid <fl...@gmail.com>:
>>>>
>>>>> Hello Everyone,
>>>>>
>>>>> I have posted a question finally with the dataset and the column names.
>>>>>
>>>>> PFB link:
>>>>>
>>>>>
>>>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <
>>>>> bjornjorgensen@gmail.com> wrote:
>>>>>
>>>>>> Sid, dump one of yours files.
>>>>>>
>>>>>>
>>>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>>>>>
>>>>>>
>>>>>>
>>>>>> ons. 25. mai 2022, 23:04 skrev Sid <fl...@gmail.com>:
>>>>>>
>>>>>>> I have 10 columns with me but in the dataset, I observed that some
>>>>>>> records have 11 columns of data(for the additional column it is marked as
>>>>>>> null). But, how do I handle this?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sid
>>>>>>>
>>>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <fl...@gmail.com> wrote:
>>>>>>>
>>>>>>>> How can I do that? Any examples or links, please. So, this works
>>>>>>>> well with pandas I suppose. It's just that I need to convert back to the
>>>>>>>> spark data frame by providing a schema but since we are using a lower spark
>>>>>>>> version and pandas won't work in a distributed way in the lower versions,
>>>>>>>> therefore, was wondering if spark could handle this in a much better way.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sid
>>>>>>>>
>>>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ra...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>>>>>
>>>>>>>>> You need to normalize the CSV with a parser that can escape commas
>>>>>>>>> inside of strings
>>>>>>>>> Not sure if Spark has an option for this?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you so much for your time.
>>>>>>>>>>
>>>>>>>>>> I have data like below which I tried to load by setting multiple
>>>>>>>>>> options while reading the file but however, but I am not able to
>>>>>>>>>> consolidate the 9th column data within itself.
>>>>>>>>>>
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>> I tried the below code:
>>>>>>>>>>
>>>>>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>>>>>
>>>>>>>>>>                                         '"').option(
>>>>>>>>>>     "delimiter", ",").csv("path")
>>>>>>>>>>
>>>>>>>>>> What else I can do?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Sid
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>>>>>> papadopo@csd.auth.gr> wrote:
>>>>>>>>>>
>>>>>>>>>>> Dear Sid,
>>>>>>>>>>>
>>>>>>>>>>> can you please give us more info? Is it true that every line may
>>>>>>>>>>> have a
>>>>>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>>>>>
>>>>>>>>>>> every line of the file? From the information you have sent I
>>>>>>>>>>> cannot
>>>>>>>>>>> fully understand the "schema" of your data.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Apostolos
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>>>>>> > Hi Experts,
>>>>>>>>>>> >
>>>>>>>>>>> > I have below CSV data that is getting generated automatically.
>>>>>>>>>>> I can't
>>>>>>>>>>> > change the data manually.
>>>>>>>>>>> >
>>>>>>>>>>> > The data looks like below:
>>>>>>>>>>> >
>>>>>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the
>>>>>>>>>>> development part.
>>>>>>>>>>> >
>>>>>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>>>>>> >
>>>>>>>>>>> > It is handled by the other people.,INR
>>>>>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>>>>>> >
>>>>>>>>>>> > The third record is a problem. Since the value is separated by
>>>>>>>>>>> the new
>>>>>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>>>>>> handle this?
>>>>>>>>>>> >
>>>>>>>>>>> > There are 6 columns and 4 records in total. These are the
>>>>>>>>>>> sample records.
>>>>>>>>>>> >
>>>>>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>>>>>> eliminate
>>>>>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>>>>>> >
>>>>>>>>>>> > Any suggestions?
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks,
>>>>>>>>>>> > Sid
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>>>>>> Department of Informatics
>>>>>>>>>>> Aristotle University of Thessaloniki
>>>>>>>>>>> Thessaloniki, GREECE
>>>>>>>>>>> tel: ++0030312310991918
>>>>>>>>>>> email: papadopo@csd.auth.gr
>>>>>>>>>>> twitter: @papadopoulos_ap
>>>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>> --
>>> Apostolos N. Papadopoulos, Associate Professor
>>> Department of Informatics
>>> Aristotle University of Thessaloniki
>>> Thessaloniki, GREECE
>>> tel: ++0030312310991918
>>> email: papadopo@csd.auth.gr
>>> twitter: @papadopoulos_ap
>>> web: http://datalab.csd.auth.gr/~apostol
>>>
>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: Complexity with the data

Posted by Bjørn Jørgensen <bj...@gmail.com>.

ok, but how do you read it now?

https://github.com/apache/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216
probably have to be updated with the default options. This is so that
pandas API on spark will be like pandas.

tor. 26. mai 2022 kl. 17:38 skrev Sid <fl...@gmail.com>:

> I was passing the wrong escape characters due to which I was facing the
> issue. I have updated the user's answer on my post. Now I am able to load
> the dataset.
>
> Thank you everyone for your time and help!
>
> Much appreciated.
>
> I have more datasets like this. I hope that would be resolved using this
> approach :) Fingers crossed.
>
> Thanks,
> Sid
>
> On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos <
> papadopo@csd.auth.gr> wrote:
>
>> Since you cannot create the DF directly, you may try to first create an
>> RDD of tuples from the file
>>
>> and then convert the RDD to a DF by using the toDF() transformation.
>>
>> Perhaps you may bypass the issue with this.
>>
>> Another thing that I have seen in the example is that you are using "" as
>> an escape character.
>>
>> Can you check if this may cause any issues?
>>
>> Regards,
>>
>> Apostolos
>>
>>
>>
>> On 26/5/22 16:31, Sid wrote:
>>
>> Thanks for opening the issue, Bjorn. However, could you help me to
>> address the problem for now with some kind of alternative?
>>
>> I am actually stuck in this since yesterday.
>>
>> Thanks,
>> Sid
>>
>> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bj...@gmail.com>
>> wrote:
>>
>>> Yes, it looks like a bug that we also have in pandas API on spark.
>>>
>>> So I have opened a JIRA
>>> <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>>>
>>> tor. 26. mai 2022 kl. 11:09 skrev Sid <fl...@gmail.com>:
>>>
>>>> Hello Everyone,
>>>>
>>>> I have posted a question finally with the dataset and the column names.
>>>>
>>>> PFB link:
>>>>
>>>>
>>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <
>>>> bjornjorgensen@gmail.com> wrote:
>>>>
>>>>> Sid, dump one of yours files.
>>>>>
>>>>>
>>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>>>>
>>>>>
>>>>>
>>>>> ons. 25. mai 2022, 23:04 skrev Sid <fl...@gmail.com>:
>>>>>
>>>>>> I have 10 columns with me but in the dataset, I observed that some
>>>>>> records have 11 columns of data(for the additional column it is marked as
>>>>>> null). But, how do I handle this?
>>>>>>
>>>>>> Thanks,
>>>>>> Sid
>>>>>>
>>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <fl...@gmail.com> wrote:
>>>>>>
>>>>>>> How can I do that? Any examples or links, please. So, this works
>>>>>>> well with pandas I suppose. It's just that I need to convert back to the
>>>>>>> spark data frame by providing a schema but since we are using a lower spark
>>>>>>> version and pandas won't work in a distributed way in the lower versions,
>>>>>>> therefore, was wondering if spark could handle this in a much better way.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sid
>>>>>>>
>>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ra...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>>>>
>>>>>>>> You need to normalize the CSV with a parser that can escape commas
>>>>>>>> inside of strings
>>>>>>>> Not sure if Spark has an option for this?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thank you so much for your time.
>>>>>>>>>
>>>>>>>>> I have data like below which I tried to load by setting multiple
>>>>>>>>> options while reading the file but however, but I am not able to
>>>>>>>>> consolidate the 9th column data within itself.
>>>>>>>>>
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>> I tried the below code:
>>>>>>>>>
>>>>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>>>>
>>>>>>>>>                                         '"').option(
>>>>>>>>>     "delimiter", ",").csv("path")
>>>>>>>>>
>>>>>>>>> What else I can do?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Sid
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>>>>> papadopo@csd.auth.gr> wrote:
>>>>>>>>>
>>>>>>>>>> Dear Sid,
>>>>>>>>>>
>>>>>>>>>> can you please give us more info? Is it true that every line may
>>>>>>>>>> have a
>>>>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>>>>
>>>>>>>>>> every line of the file? From the information you have sent I
>>>>>>>>>> cannot
>>>>>>>>>> fully understand the "schema" of your data.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Apostolos
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>>>>> > Hi Experts,
>>>>>>>>>> >
>>>>>>>>>> > I have below CSV data that is getting generated automatically.
>>>>>>>>>> I can't
>>>>>>>>>> > change the data manually.
>>>>>>>>>> >
>>>>>>>>>> > The data looks like below:
>>>>>>>>>> >
>>>>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the
>>>>>>>>>> development part.
>>>>>>>>>> >
>>>>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>>>>> >
>>>>>>>>>> > It is handled by the other people.,INR
>>>>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>>>>> >
>>>>>>>>>> > The third record is a problem. Since the value is separated by
>>>>>>>>>> the new
>>>>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>>>>> handle this?
>>>>>>>>>> >
>>>>>>>>>> > There are 6 columns and 4 records in total. These are the
>>>>>>>>>> sample records.
>>>>>>>>>> >
>>>>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>>>>> eliminate
>>>>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>>>>> >
>>>>>>>>>> > Any suggestions?
>>>>>>>>>> >
>>>>>>>>>> > Thanks,
>>>>>>>>>> > Sid
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>>>>> Department of Informatics
>>>>>>>>>> Aristotle University of Thessaloniki
>>>>>>>>>> Thessaloniki, GREECE
>>>>>>>>>> tel: ++0030312310991918
>>>>>>>>>> email: papadopo@csd.auth.gr
>>>>>>>>>> twitter: @papadopoulos_ap
>>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>> --
>> Apostolos N. Papadopoulos, Associate Professor
>> Department of Informatics
>> Aristotle University of Thessaloniki
>> Thessaloniki, GREECE
>> tel: ++0030312310991918
>> email: papadopo@csd.auth.gr
>> twitter: @papadopoulos_ap
>> web: http://datalab.csd.auth.gr/~apostol
>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Complexity with the data

Posted by Sid <fl...@gmail.com>.

I was passing the wrong escape characters due to which I was facing the
issue. I have updated the user's answer on my post. Now I am able to load
the dataset.

Thank you everyone for your time and help!

Much appreciated.

I have more datasets like this. I hope that would be resolved using this
approach :) Fingers crossed.

Thanks,
Sid

On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos <
papadopo@csd.auth.gr> wrote:

> Since you cannot create the DF directly, you may try to first create an
> RDD of tuples from the file
>
> and then convert the RDD to a DF by using the toDF() transformation.
>
> Perhaps you may bypass the issue with this.
>
> Another thing that I have seen in the example is that you are using "" as
> an escape character.
>
> Can you check if this may cause any issues?
>
> Regards,
>
> Apostolos
>
>
>
> On 26/5/22 16:31, Sid wrote:
>
> Thanks for opening the issue, Bjorn. However, could you help me to address
> the problem for now with some kind of alternative?
>
> I am actually stuck in this since yesterday.
>
> Thanks,
> Sid
>
> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bj...@gmail.com>
> wrote:
>
>> Yes, it looks like a bug that we also have in pandas API on spark.
>>
>> So I have opened a JIRA
>> <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>>
>> tor. 26. mai 2022 kl. 11:09 skrev Sid <fl...@gmail.com>:
>>
>>> Hello Everyone,
>>>
>>> I have posted a question finally with the dataset and the column names.
>>>
>>> PFB link:
>>>
>>>
>>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <
>>> bjornjorgensen@gmail.com> wrote:
>>>
>>>> Sid, dump one of yours files.
>>>>
>>>>
>>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>>>
>>>>
>>>>
>>>> ons. 25. mai 2022, 23:04 skrev Sid <fl...@gmail.com>:
>>>>
>>>>> I have 10 columns with me but in the dataset, I observed that some
>>>>> records have 11 columns of data(for the additional column it is marked as
>>>>> null). But, how do I handle this?
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>> On Thu, May 26, 2022 at 2:22 AM Sid <fl...@gmail.com> wrote:
>>>>>
>>>>>> How can I do that? Any examples or links, please. So, this works well
>>>>>> with pandas I suppose. It's just that I need to convert back to the spark
>>>>>> data frame by providing a schema but since we are using a lower spark
>>>>>> version and pandas won't work in a distributed way in the lower versions,
>>>>>> therefore, was wondering if spark could handle this in a much better way.
>>>>>>
>>>>>> Thanks,
>>>>>> Sid
>>>>>>
>>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ra...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>>>
>>>>>>> You need to normalize the CSV with a parser that can escape commas
>>>>>>> inside of strings
>>>>>>> Not sure if Spark has an option for this?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thank you so much for your time.
>>>>>>>>
>>>>>>>> I have data like below which I tried to load by setting multiple
>>>>>>>> options while reading the file but however, but I am not able to
>>>>>>>> consolidate the 9th column data within itself.
>>>>>>>>
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> I tried the below code:
>>>>>>>>
>>>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>>>
>>>>>>>>                                       '"').option(
>>>>>>>>     "delimiter", ",").csv("path")
>>>>>>>>
>>>>>>>> What else I can do?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sid
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>>>> papadopo@csd.auth.gr> wrote:
>>>>>>>>
>>>>>>>>> Dear Sid,
>>>>>>>>>
>>>>>>>>> can you please give us more info? Is it true that every line may
>>>>>>>>> have a
>>>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>>>
>>>>>>>>> every line of the file? From the information you have sent I
>>>>>>>>> cannot
>>>>>>>>> fully understand the "schema" of your data.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Apostolos
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>>>> > Hi Experts,
>>>>>>>>> >
>>>>>>>>> > I have below CSV data that is getting generated automatically. I
>>>>>>>>> can't
>>>>>>>>> > change the data manually.
>>>>>>>>> >
>>>>>>>>> > The data looks like below:
>>>>>>>>> >
>>>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development
>>>>>>>>> part.
>>>>>>>>> >
>>>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>>>> >
>>>>>>>>> > It is handled by the other people.,INR
>>>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>>>> >
>>>>>>>>> > The third record is a problem. Since the value is separated by
>>>>>>>>> the new
>>>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>>>> handle this?
>>>>>>>>> >
>>>>>>>>> > There are 6 columns and 4 records in total. These are the sample
>>>>>>>>> records.
>>>>>>>>> >
>>>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>>>> eliminate
>>>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>>>> >
>>>>>>>>> > Any suggestions?
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > Sid
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>>>> Department of Informatics
>>>>>>>>> Aristotle University of Thessaloniki
>>>>>>>>> Thessaloniki, GREECE
>>>>>>>>> tel: ++0030312310991918
>>>>>>>>> email: papadopo@csd.auth.gr
>>>>>>>>> twitter: @papadopoulos_ap
>>>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papadopo@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>

Re: Complexity with the data

Posted by "Apostolos N. Papadopoulos" <pa...@csd.auth.gr>.

Since you cannot create the DF directly, you may try to first create an 
RDD of tuples from the file

and then convert the RDD to a DF by using the toDF() transformation.

Perhaps you may bypass the issue with this.

Another thing that I have seen in the example is that you are using "" 
as an escape character.

Can you check if this may cause any issues?

Regards,

Apostolos



On 26/5/22 16:31, Sid wrote:
> Thanks for opening the issue, Bjorn. However, could you help me to 
> address the problem for now with some kind of alternative?
>
> I am actually stuck in this since yesterday.
>
> Thanks,
> Sid
>
> On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bj...@gmail.com> 
> wrote:
>
>     Yes, it looks like a bug that we also have in pandas API on spark.
>
>     So I have opened a JIRA
>     <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>
>     tor. 26. mai 2022 kl. 11:09 skrev Sid <fl...@gmail.com>:
>
>         Hello Everyone,
>
>         I have posted a question finally with the dataset and the
>         column names.
>
>         PFB link:
>
>         https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>
>         Thanks,
>         Sid
>
>         On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen
>         <bj...@gmail.com> wrote:
>
>             Sid, dump one of yours files.
>
>             https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>
>
>
>             ons. 25. mai 2022, 23:04 skrev Sid <fl...@gmail.com>:
>
>                 I have 10 columns with me but in the dataset, I
>                 observed that some records have 11 columns of data(for
>                 the additional column it is marked as null). But, how
>                 do I handle this?
>
>                 Thanks,
>                 Sid
>
>                 On Thu, May 26, 2022 at 2:22 AM Sid
>                 <fl...@gmail.com> wrote:
>
>                     How can I do that? Any examples or links, please.
>                     So, this works well with pandas I suppose. It's
>                     just that I need to convert back to the spark data
>                     frame by providing a schema but since we are using
>                     a lower spark version and pandas won't work in a
>                     distributed way in the lower versions, therefore,
>                     was wondering if spark could handle this in a much
>                     better way.
>
>                     Thanks,
>                     Sid
>
>                     On Thu, May 26, 2022 at 2:19 AM Gavin Ray
>                     <ra...@gmail.com> wrote:
>
>                         Forgot to reply-all last message, whoops. Not
>                         very good at email.
>
>                         You need to normalize the CSV with a parser
>                         that can escape commas inside of strings
>                         Not sure if Spark has an option for this?
>
>
>                         On Wed, May 25, 2022 at 4:37 PM Sid
>                         <fl...@gmail.com> wrote:
>
>                             Thank you so much for your time.
>
>                             I have data like below which I tried to
>                             load by setting multiple options while
>                             reading the file but however, but I am not
>                             able to consolidate the 9th column data
>                             within itself.
>
>                             image.png
>
>                             I tried the below code:
>
>                             df = spark.read.option("header",
>                             "true").option("multiline",
>                             "true").option("inferSchema",
>                             "true").option("quote",
>                                 '"').option(
>                                 "delimiter", ",").csv("path")
>
>                             What else I can do?
>
>                             Thanks,
>                             Sid
>
>
>                             On Thu, May 26, 2022 at 1:46 AM Apostolos
>                             N. Papadopoulos <pa...@csd.auth.gr> wrote:
>
>                                 Dear Sid,
>
>                                 can you please give us more info? Is
>                                 it true that every line may have a
>                                 different number of columns? Is there
>                                 any rule followed by
>
>                                 every line of the file? From the
>                                 information you have sent I cannot
>                                 fully understand the "schema" of your
>                                 data.
>
>                                 Regards,
>
>                                 Apostolos
>
>
>                                 On 25/5/22 23:06, Sid wrote:
>                                 > Hi Experts,
>                                 >
>                                 > I have below CSV data that is
>                                 getting generated automatically. I can't
>                                 > change the data manually.
>                                 >
>                                 > The data looks like below:
>                                 >
>                                 > 2020-12-12,abc,2000,,INR,
>                                 > 2020-12-09,cde,3000,he is a
>                                 manager,DOLLARS,nothing
>                                 > 2020-12-09,fgh,,software_developer,I
>                                 only manage the development part.
>                                 >
>                                 > Since I don't have much experience
>                                 with the other domains.
>                                 >
>                                 > It is handled by the other people.,INR
>                                 > 2020-12-12,abc,2000,,USD,
>                                 >
>                                 > The third record is a problem. Since
>                                 the value is separated by the new
>                                 > line by the user while filling up
>                                 the form. So, how do I handle this?
>                                 >
>                                 > There are 6 columns and 4 records in
>                                 total. These are the sample records.
>                                 >
>                                 > Should I load it as RDD and then may
>                                 be using a regex should eliminate
>                                 > the new lines? Or how it should be?
>                                 with ". /n" ?
>                                 >
>                                 > Any suggestions?
>                                 >
>                                 > Thanks,
>                                 > Sid
>
>                                 -- 
>                                 Apostolos N. Papadopoulos, Associate
>                                 Professor
>                                 Department of Informatics
>                                 Aristotle University of Thessaloniki
>                                 Thessaloniki, GREECE
>                                 tel: ++0030312310991918
>                                 email: papadopo@csd.auth.gr
>                                 twitter: @papadopoulos_ap
>                                 web: http://datalab.csd.auth.gr/~apostol
>
>
>                                 ---------------------------------------------------------------------
>                                 To unsubscribe e-mail:
>                                 user-unsubscribe@spark.apache.org
>
>
>
>     -- 
>     Bjørn Jørgensen
>     Vestre Aspehaug 4, 6010 Ålesund
>     Norge
>
>     +47 480 94 297
>
-- 
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email:papadopo@csd.auth.gr
twitter: @papadopoulos_ap
web:http://datalab.csd.auth.gr/~apostol

Re: Complexity with the data

Posted by Sid <fl...@gmail.com>.

Thanks for opening the issue, Bjorn. However, could you help me to address
the problem for now with some kind of alternative?

I am actually stuck in this since yesterday.

Thanks,
Sid

On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bj...@gmail.com>
wrote:

> Yes, it looks like a bug that we also have in pandas API on spark.
>
> So I have opened a JIRA
> <https://issues.apache.org/jira/browse/SPARK-39304> for this.
>
> tor. 26. mai 2022 kl. 11:09 skrev Sid <fl...@gmail.com>:
>
>> Hello Everyone,
>>
>> I have posted a question finally with the dataset and the column names.
>>
>> PFB link:
>>
>>
>> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>>
>> Thanks,
>> Sid
>>
>> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <bj...@gmail.com>
>> wrote:
>>
>>> Sid, dump one of yours files.
>>>
>>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>>
>>>
>>>
>>> ons. 25. mai 2022, 23:04 skrev Sid <fl...@gmail.com>:
>>>
>>>> I have 10 columns with me but in the dataset, I observed that some
>>>> records have 11 columns of data(for the additional column it is marked as
>>>> null). But, how do I handle this?
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Thu, May 26, 2022 at 2:22 AM Sid <fl...@gmail.com> wrote:
>>>>
>>>>> How can I do that? Any examples or links, please. So, this works well
>>>>> with pandas I suppose. It's just that I need to convert back to the spark
>>>>> data frame by providing a schema but since we are using a lower spark
>>>>> version and pandas won't work in a distributed way in the lower versions,
>>>>> therefore, was wondering if spark could handle this in a much better way.
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ra...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>>
>>>>>> You need to normalize the CSV with a parser that can escape commas
>>>>>> inside of strings
>>>>>> Not sure if Spark has an option for this?
>>>>>>
>>>>>>
>>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com> wrote:
>>>>>>
>>>>>>> Thank you so much for your time.
>>>>>>>
>>>>>>> I have data like below which I tried to load by setting multiple
>>>>>>> options while reading the file but however, but I am not able to
>>>>>>> consolidate the 9th column data within itself.
>>>>>>>
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>> I tried the below code:
>>>>>>>
>>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>>
>>>>>>>                                       '"').option(
>>>>>>>     "delimiter", ",").csv("path")
>>>>>>>
>>>>>>> What else I can do?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sid
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>>> papadopo@csd.auth.gr> wrote:
>>>>>>>
>>>>>>>> Dear Sid,
>>>>>>>>
>>>>>>>> can you please give us more info? Is it true that every line may
>>>>>>>> have a
>>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>>
>>>>>>>> every line of the file? From the information you have sent I cannot
>>>>>>>> fully understand the "schema" of your data.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Apostolos
>>>>>>>>
>>>>>>>>
>>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>>> > Hi Experts,
>>>>>>>> >
>>>>>>>> > I have below CSV data that is getting generated automatically. I
>>>>>>>> can't
>>>>>>>> > change the data manually.
>>>>>>>> >
>>>>>>>> > The data looks like below:
>>>>>>>> >
>>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development
>>>>>>>> part.
>>>>>>>> >
>>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>>> >
>>>>>>>> > It is handled by the other people.,INR
>>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>>> >
>>>>>>>> > The third record is a problem. Since the value is separated by
>>>>>>>> the new
>>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>>> handle this?
>>>>>>>> >
>>>>>>>> > There are 6 columns and 4 records in total. These are the sample
>>>>>>>> records.
>>>>>>>> >
>>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>>> eliminate
>>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>>> >
>>>>>>>> > Any suggestions?
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Sid
>>>>>>>>
>>>>>>>> --
>>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>>> Department of Informatics
>>>>>>>> Aristotle University of Thessaloniki
>>>>>>>> Thessaloniki, GREECE
>>>>>>>> tel: ++0030312310991918
>>>>>>>> email: papadopo@csd.auth.gr
>>>>>>>> twitter: @papadopoulos_ap
>>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: Complexity with the data

Posted by Bjørn Jørgensen <bj...@gmail.com>.

Yes, it looks like a bug that we also have in pandas API on spark.

So I have opened a JIRA <https://issues.apache.org/jira/browse/SPARK-39304> for
this.

tor. 26. mai 2022 kl. 11:09 skrev Sid <fl...@gmail.com>:

> Hello Everyone,
>
> I have posted a question finally with the dataset and the column names.
>
> PFB link:
>
>
> https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
>
> Thanks,
> Sid
>
> On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> Sid, dump one of yours files.
>>
>> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>>
>>
>>
>> ons. 25. mai 2022, 23:04 skrev Sid <fl...@gmail.com>:
>>
>>> I have 10 columns with me but in the dataset, I observed that some
>>> records have 11 columns of data(for the additional column it is marked as
>>> null). But, how do I handle this?
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Thu, May 26, 2022 at 2:22 AM Sid <fl...@gmail.com> wrote:
>>>
>>>> How can I do that? Any examples or links, please. So, this works well
>>>> with pandas I suppose. It's just that I need to convert back to the spark
>>>> data frame by providing a schema but since we are using a lower spark
>>>> version and pandas won't work in a distributed way in the lower versions,
>>>> therefore, was wondering if spark could handle this in a much better way.
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ra...@gmail.com>
>>>> wrote:
>>>>
>>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>>
>>>>> You need to normalize the CSV with a parser that can escape commas
>>>>> inside of strings
>>>>> Not sure if Spark has an option for this?
>>>>>
>>>>>
>>>>> On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com> wrote:
>>>>>
>>>>>> Thank you so much for your time.
>>>>>>
>>>>>> I have data like below which I tried to load by setting multiple
>>>>>> options while reading the file but however, but I am not able to
>>>>>> consolidate the 9th column data within itself.
>>>>>>
>>>>>> [image: image.png]
>>>>>>
>>>>>> I tried the below code:
>>>>>>
>>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>>
>>>>>>                                     '"').option(
>>>>>>     "delimiter", ",").csv("path")
>>>>>>
>>>>>> What else I can do?
>>>>>>
>>>>>> Thanks,
>>>>>> Sid
>>>>>>
>>>>>>
>>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>>> papadopo@csd.auth.gr> wrote:
>>>>>>
>>>>>>> Dear Sid,
>>>>>>>
>>>>>>> can you please give us more info? Is it true that every line may
>>>>>>> have a
>>>>>>> different number of columns? Is there any rule followed by
>>>>>>>
>>>>>>> every line of the file? From the information you have sent I cannot
>>>>>>> fully understand the "schema" of your data.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Apostolos
>>>>>>>
>>>>>>>
>>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>>> > Hi Experts,
>>>>>>> >
>>>>>>> > I have below CSV data that is getting generated automatically. I
>>>>>>> can't
>>>>>>> > change the data manually.
>>>>>>> >
>>>>>>> > The data looks like below:
>>>>>>> >
>>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development
>>>>>>> part.
>>>>>>> >
>>>>>>> > Since I don't have much experience with the other domains.
>>>>>>> >
>>>>>>> > It is handled by the other people.,INR
>>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>>> >
>>>>>>> > The third record is a problem. Since the value is separated by the
>>>>>>> new
>>>>>>> > line by the user while filling up the form. So, how do I
>>>>>>> handle this?
>>>>>>> >
>>>>>>> > There are 6 columns and 4 records in total. These are the sample
>>>>>>> records.
>>>>>>> >
>>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>>> eliminate
>>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>>> >
>>>>>>> > Any suggestions?
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Sid
>>>>>>>
>>>>>>> --
>>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>>> Department of Informatics
>>>>>>> Aristotle University of Thessaloniki
>>>>>>> Thessaloniki, GREECE
>>>>>>> tel: ++0030312310991918
>>>>>>> email: papadopo@csd.auth.gr
>>>>>>> twitter: @papadopoulos_ap
>>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Complexity with the data

Posted by Sid <fl...@gmail.com>.

Hello Everyone,

I have posted a question finally with the dataset and the column names.

PFB link:

https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark

Thanks,
Sid

On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <bj...@gmail.com>
wrote:

> Sid, dump one of yours files.
>
> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>
>
>
> ons. 25. mai 2022, 23:04 skrev Sid <fl...@gmail.com>:
>
>> I have 10 columns with me but in the dataset, I observed that some
>> records have 11 columns of data(for the additional column it is marked as
>> null). But, how do I handle this?
>>
>> Thanks,
>> Sid
>>
>> On Thu, May 26, 2022 at 2:22 AM Sid <fl...@gmail.com> wrote:
>>
>>> How can I do that? Any examples or links, please. So, this works well
>>> with pandas I suppose. It's just that I need to convert back to the spark
>>> data frame by providing a schema but since we are using a lower spark
>>> version and pandas won't work in a distributed way in the lower versions,
>>> therefore, was wondering if spark could handle this in a much better way.
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ra...@gmail.com> wrote:
>>>
>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>
>>>> You need to normalize the CSV with a parser that can escape commas
>>>> inside of strings
>>>> Not sure if Spark has an option for this?
>>>>
>>>>
>>>> On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com> wrote:
>>>>
>>>>> Thank you so much for your time.
>>>>>
>>>>> I have data like below which I tried to load by setting multiple
>>>>> options while reading the file but however, but I am not able to
>>>>> consolidate the 9th column data within itself.
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> I tried the below code:
>>>>>
>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>
>>>>>                                     '"').option(
>>>>>     "delimiter", ",").csv("path")
>>>>>
>>>>> What else I can do?
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>>
>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>> papadopo@csd.auth.gr> wrote:
>>>>>
>>>>>> Dear Sid,
>>>>>>
>>>>>> can you please give us more info? Is it true that every line may have
>>>>>> a
>>>>>> different number of columns? Is there any rule followed by
>>>>>>
>>>>>> every line of the file? From the information you have sent I cannot
>>>>>> fully understand the "schema" of your data.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Apostolos
>>>>>>
>>>>>>
>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>> > Hi Experts,
>>>>>> >
>>>>>> > I have below CSV data that is getting generated automatically. I
>>>>>> can't
>>>>>> > change the data manually.
>>>>>> >
>>>>>> > The data looks like below:
>>>>>> >
>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development
>>>>>> part.
>>>>>> >
>>>>>> > Since I don't have much experience with the other domains.
>>>>>> >
>>>>>> > It is handled by the other people.,INR
>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>> >
>>>>>> > The third record is a problem. Since the value is separated by the
>>>>>> new
>>>>>> > line by the user while filling up the form. So, how do I
>>>>>> handle this?
>>>>>> >
>>>>>> > There are 6 columns and 4 records in total. These are the sample
>>>>>> records.
>>>>>> >
>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>> eliminate
>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>> >
>>>>>> > Any suggestions?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Sid
>>>>>>
>>>>>> --
>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>> Department of Informatics
>>>>>> Aristotle University of Thessaloniki
>>>>>> Thessaloniki, GREECE
>>>>>> tel: ++0030312310991918
>>>>>> email: papadopo@csd.auth.gr
>>>>>> twitter: @papadopoulos_ap
>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>
>>>>>>

Re: Complexity with the data

Posted by Bjørn Jørgensen <bj...@gmail.com>.

Sid, dump one of yours files.

https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/



ons. 25. mai 2022, 23:04 skrev Sid <fl...@gmail.com>:

> I have 10 columns with me but in the dataset, I observed that some records
> have 11 columns of data(for the additional column it is marked as null).
> But, how do I handle this?
>
> Thanks,
> Sid
>
> On Thu, May 26, 2022 at 2:22 AM Sid <fl...@gmail.com> wrote:
>
>> How can I do that? Any examples or links, please. So, this works well
>> with pandas I suppose. It's just that I need to convert back to the spark
>> data frame by providing a schema but since we are using a lower spark
>> version and pandas won't work in a distributed way in the lower versions,
>> therefore, was wondering if spark could handle this in a much better way.
>>
>> Thanks,
>> Sid
>>
>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ra...@gmail.com> wrote:
>>
>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>
>>> You need to normalize the CSV with a parser that can escape commas
>>> inside of strings
>>> Not sure if Spark has an option for this?
>>>
>>>
>>> On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com> wrote:
>>>
>>>> Thank you so much for your time.
>>>>
>>>> I have data like below which I tried to load by setting multiple
>>>> options while reading the file but however, but I am not able to
>>>> consolidate the 9th column data within itself.
>>>>
>>>> [image: image.png]
>>>>
>>>> I tried the below code:
>>>>
>>>> df = spark.read.option("header", "true").option("multiline",
>>>> "true").option("inferSchema", "true").option("quote",
>>>>
>>>>                                   '"').option(
>>>>     "delimiter", ",").csv("path")
>>>>
>>>> What else I can do?
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>>
>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>> papadopo@csd.auth.gr> wrote:
>>>>
>>>>> Dear Sid,
>>>>>
>>>>> can you please give us more info? Is it true that every line may have
>>>>> a
>>>>> different number of columns? Is there any rule followed by
>>>>>
>>>>> every line of the file? From the information you have sent I cannot
>>>>> fully understand the "schema" of your data.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Apostolos
>>>>>
>>>>>
>>>>> On 25/5/22 23:06, Sid wrote:
>>>>> > Hi Experts,
>>>>> >
>>>>> > I have below CSV data that is getting generated automatically. I
>>>>> can't
>>>>> > change the data manually.
>>>>> >
>>>>> > The data looks like below:
>>>>> >
>>>>> > 2020-12-12,abc,2000,,INR,
>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development
>>>>> part.
>>>>> >
>>>>> > Since I don't have much experience with the other domains.
>>>>> >
>>>>> > It is handled by the other people.,INR
>>>>> > 2020-12-12,abc,2000,,USD,
>>>>> >
>>>>> > The third record is a problem. Since the value is separated by the
>>>>> new
>>>>> > line by the user while filling up the form. So, how do I handle this?
>>>>> >
>>>>> > There are 6 columns and 4 records in total. These are the sample
>>>>> records.
>>>>> >
>>>>> > Should I load it as RDD and then may be using a regex should
>>>>> eliminate
>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>> >
>>>>> > Any suggestions?
>>>>> >
>>>>> > Thanks,
>>>>> > Sid
>>>>>
>>>>> --
>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>> Department of Informatics
>>>>> Aristotle University of Thessaloniki
>>>>> Thessaloniki, GREECE
>>>>> tel: ++0030312310991918
>>>>> email: papadopo@csd.auth.gr
>>>>> twitter: @papadopoulos_ap
>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>
>>>>>

Re: Complexity with the data

Posted by Sid <fl...@gmail.com>.

I have 10 columns with me but in the dataset, I observed that some records
have 11 columns of data(for the additional column it is marked as null).
But, how do I handle this?

Thanks,
Sid

On Thu, May 26, 2022 at 2:22 AM Sid <fl...@gmail.com> wrote:

> How can I do that? Any examples or links, please. So, this works well with
> pandas I suppose. It's just that I need to convert back to the spark data
> frame by providing a schema but since we are using a lower spark version
> and pandas won't work in a distributed way in the lower versions,
> therefore, was wondering if spark could handle this in a much better way.
>
> Thanks,
> Sid
>
> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ra...@gmail.com> wrote:
>
>> Forgot to reply-all last message, whoops. Not very good at email.
>>
>> You need to normalize the CSV with a parser that can escape commas inside
>> of strings
>> Not sure if Spark has an option for this?
>>
>>
>> On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com> wrote:
>>
>>> Thank you so much for your time.
>>>
>>> I have data like below which I tried to load by setting multiple options
>>> while reading the file but however, but I am not able to consolidate the
>>> 9th column data within itself.
>>>
>>> [image: image.png]
>>>
>>> I tried the below code:
>>>
>>> df = spark.read.option("header", "true").option("multiline",
>>> "true").option("inferSchema", "true").option("quote",
>>>
>>>                                   '"').option(
>>>     "delimiter", ",").csv("path")
>>>
>>> What else I can do?
>>>
>>> Thanks,
>>> Sid
>>>
>>>
>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>> papadopo@csd.auth.gr> wrote:
>>>
>>>> Dear Sid,
>>>>
>>>> can you please give us more info? Is it true that every line may have a
>>>> different number of columns? Is there any rule followed by
>>>>
>>>> every line of the file? From the information you have sent I cannot
>>>> fully understand the "schema" of your data.
>>>>
>>>> Regards,
>>>>
>>>> Apostolos
>>>>
>>>>
>>>> On 25/5/22 23:06, Sid wrote:
>>>> > Hi Experts,
>>>> >
>>>> > I have below CSV data that is getting generated automatically. I
>>>> can't
>>>> > change the data manually.
>>>> >
>>>> > The data looks like below:
>>>> >
>>>> > 2020-12-12,abc,2000,,INR,
>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>> > 2020-12-09,fgh,,software_developer,I only manage the development part.
>>>> >
>>>> > Since I don't have much experience with the other domains.
>>>> >
>>>> > It is handled by the other people.,INR
>>>> > 2020-12-12,abc,2000,,USD,
>>>> >
>>>> > The third record is a problem. Since the value is separated by the
>>>> new
>>>> > line by the user while filling up the form. So, how do I handle this?
>>>> >
>>>> > There are 6 columns and 4 records in total. These are the sample
>>>> records.
>>>> >
>>>> > Should I load it as RDD and then may be using a regex should
>>>> eliminate
>>>> > the new lines? Or how it should be? with ". /n" ?
>>>> >
>>>> > Any suggestions?
>>>> >
>>>> > Thanks,
>>>> > Sid
>>>>
>>>> --
>>>> Apostolos N. Papadopoulos, Associate Professor
>>>> Department of Informatics
>>>> Aristotle University of Thessaloniki
>>>> Thessaloniki, GREECE
>>>> tel: ++0030312310991918
>>>> email: papadopo@csd.auth.gr
>>>> twitter: @papadopoulos_ap
>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>

Re: Complexity with the data

Posted by Sid <fl...@gmail.com>.

How can I do that? Any examples or links, please. So, this works well with
pandas I suppose. It's just that I need to convert back to the spark data
frame by providing a schema but since we are using a lower spark version
and pandas won't work in a distributed way in the lower versions,
therefore, was wondering if spark could handle this in a much better way.

Thanks,
Sid

On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ra...@gmail.com> wrote:

> Forgot to reply-all last message, whoops. Not very good at email.
>
> You need to normalize the CSV with a parser that can escape commas inside
> of strings
> Not sure if Spark has an option for this?
>
>
> On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com> wrote:
>
>> Thank you so much for your time.
>>
>> I have data like below which I tried to load by setting multiple options
>> while reading the file but however, but I am not able to consolidate the
>> 9th column data within itself.
>>
>> [image: image.png]
>>
>> I tried the below code:
>>
>> df = spark.read.option("header", "true").option("multiline",
>> "true").option("inferSchema", "true").option("quote",
>>
>>                                 '"').option(
>>     "delimiter", ",").csv("path")
>>
>> What else I can do?
>>
>> Thanks,
>> Sid
>>
>>
>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>> papadopo@csd.auth.gr> wrote:
>>
>>> Dear Sid,
>>>
>>> can you please give us more info? Is it true that every line may have a
>>> different number of columns? Is there any rule followed by
>>>
>>> every line of the file? From the information you have sent I cannot
>>> fully understand the "schema" of your data.
>>>
>>> Regards,
>>>
>>> Apostolos
>>>
>>>
>>> On 25/5/22 23:06, Sid wrote:
>>> > Hi Experts,
>>> >
>>> > I have below CSV data that is getting generated automatically. I can't
>>> > change the data manually.
>>> >
>>> > The data looks like below:
>>> >
>>> > 2020-12-12,abc,2000,,INR,
>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>> > 2020-12-09,fgh,,software_developer,I only manage the development part.
>>> >
>>> > Since I don't have much experience with the other domains.
>>> >
>>> > It is handled by the other people.,INR
>>> > 2020-12-12,abc,2000,,USD,
>>> >
>>> > The third record is a problem. Since the value is separated by the new
>>> > line by the user while filling up the form. So, how do I handle this?
>>> >
>>> > There are 6 columns and 4 records in total. These are the sample
>>> records.
>>> >
>>> > Should I load it as RDD and then may be using a regex should eliminate
>>> > the new lines? Or how it should be? with ". /n" ?
>>> >
>>> > Any suggestions?
>>> >
>>> > Thanks,
>>> > Sid
>>>
>>> --
>>> Apostolos N. Papadopoulos, Associate Professor
>>> Department of Informatics
>>> Aristotle University of Thessaloniki
>>> Thessaloniki, GREECE
>>> tel: ++0030312310991918
>>> email: papadopo@csd.auth.gr
>>> twitter: @papadopoulos_ap
>>> web: http://datalab.csd.auth.gr/~apostol
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>

Re: Complexity with the data

Posted by Gavin Ray <ra...@gmail.com>.

Forgot to reply-all last message, whoops. Not very good at email.

You need to normalize the CSV with a parser that can escape commas inside
of strings
Not sure if Spark has an option for this?


On Wed, May 25, 2022 at 4:37 PM Sid <fl...@gmail.com> wrote:

> Thank you so much for your time.
>
> I have data like below which I tried to load by setting multiple options
> while reading the file but however, but I am not able to consolidate the
> 9th column data within itself.
>
> [image: image.png]
>
> I tried the below code:
>
> df = spark.read.option("header", "true").option("multiline",
> "true").option("inferSchema", "true").option("quote",
>
>                                 '"').option(
>     "delimiter", ",").csv("path")
>
> What else I can do?
>
> Thanks,
> Sid
>
>
> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
> papadopo@csd.auth.gr> wrote:
>
>> Dear Sid,
>>
>> can you please give us more info? Is it true that every line may have a
>> different number of columns? Is there any rule followed by
>>
>> every line of the file? From the information you have sent I cannot
>> fully understand the "schema" of your data.
>>
>> Regards,
>>
>> Apostolos
>>
>>
>> On 25/5/22 23:06, Sid wrote:
>> > Hi Experts,
>> >
>> > I have below CSV data that is getting generated automatically. I can't
>> > change the data manually.
>> >
>> > The data looks like below:
>> >
>> > 2020-12-12,abc,2000,,INR,
>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>> > 2020-12-09,fgh,,software_developer,I only manage the development part.
>> >
>> > Since I don't have much experience with the other domains.
>> >
>> > It is handled by the other people.,INR
>> > 2020-12-12,abc,2000,,USD,
>> >
>> > The third record is a problem. Since the value is separated by the new
>> > line by the user while filling up the form. So, how do I handle this?
>> >
>> > There are 6 columns and 4 records in total. These are the sample
>> records.
>> >
>> > Should I load it as RDD and then may be using a regex should eliminate
>> > the new lines? Or how it should be? with ". /n" ?
>> >
>> > Any suggestions?
>> >
>> > Thanks,
>> > Sid
>>
>> --
>> Apostolos N. Papadopoulos, Associate Professor
>> Department of Informatics
>> Aristotle University of Thessaloniki
>> Thessaloniki, GREECE
>> tel: ++0030312310991918
>> email: papadopo@csd.auth.gr
>> twitter: @papadopoulos_ap
>> web: http://datalab.csd.auth.gr/~apostol
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: Complexity with the data

Posted by Sid <fl...@gmail.com>.

Thank you so much for your time.

I have data like below which I tried to load by setting multiple options
while reading the file but however, but I am not able to consolidate the
9th column data within itself.

[image: image.png]

I tried the below code:

df = spark.read.option("header", "true").option("multiline",
"true").option("inferSchema", "true").option("quote",

                              '"').option(
    "delimiter", ",").csv("path")

What else I can do?

Thanks,
Sid


On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
papadopo@csd.auth.gr> wrote:

> Dear Sid,
>
> can you please give us more info? Is it true that every line may have a
> different number of columns? Is there any rule followed by
>
> every line of the file? From the information you have sent I cannot
> fully understand the "schema" of your data.
>
> Regards,
>
> Apostolos
>
>
> On 25/5/22 23:06, Sid wrote:
> > Hi Experts,
> >
> > I have below CSV data that is getting generated automatically. I can't
> > change the data manually.
> >
> > The data looks like below:
> >
> > 2020-12-12,abc,2000,,INR,
> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
> > 2020-12-09,fgh,,software_developer,I only manage the development part.
> >
> > Since I don't have much experience with the other domains.
> >
> > It is handled by the other people.,INR
> > 2020-12-12,abc,2000,,USD,
> >
> > The third record is a problem. Since the value is separated by the new
> > line by the user while filling up the form. So, how do I handle this?
> >
> > There are 6 columns and 4 records in total. These are the sample records.
> >
> > Should I load it as RDD and then may be using a regex should eliminate
> > the new lines? Or how it should be? with ". /n" ?
> >
> > Any suggestions?
> >
> > Thanks,
> > Sid
>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papadopo@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Complexity with the data

Posted by "Apostolos N. Papadopoulos" <pa...@csd.auth.gr>.

Dear Sid,

can you please give us more info? Is it true that every line may have a 
different number of columns? Is there any rule followed by

every line of the file? From the information you have sent I cannot 
fully understand the "schema" of your data.

Regards,

Apostolos


On 25/5/22 23:06, Sid wrote:
> Hi Experts,
>
> I have below CSV data that is getting generated automatically. I can't 
> change the data manually.
>
> The data looks like below:
>
> 2020-12-12,abc,2000,,INR,
> 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
> 2020-12-09,fgh,,software_developer,I only manage the development part.
>
> Since I don't have much experience with the other domains.
>
> It is handled by the other people.,INR
> 2020-12-12,abc,2000,,USD,
>
> The third record is a problem. Since the value is separated by the new 
> line by the user while filling up the form. So, how do I handle this?
>
> There are 6 columns and 4 records in total. These are the sample records.
>
> Should I load it as RDD and then may be using a regex should eliminate 
> the new lines? Or how it should be? with ". /n" ?
>
> Any suggestions?
>
> Thanks,
> Sid

-- 
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papadopo@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org