You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by anbutech <an...@outlook.com> on 2020/01/09 02:20:35 UTC

Merge multiple different s3 logs using pyspark 2.4.3

Hello,

version = spark 2.4.3

I have 3 different sources json logs data which having same schema(same
columns order) in the raw data and want to add one new column as
"src_category"  for all the  3 different source to distinguish the source 
category  and merge all the  3 different sources into the single dataframe
to read the json data for the  processing.what is the best way to handle
this case.

df = spark.read.json(merged_3sourcesraw_data)

Input:

s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json

output:
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows


Thanks




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Merge multiple different s3 logs using pyspark 2.4.3

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Shraddha,

what is interesting to me that people do not even have the courtesy to
write their name when they request for help to user groups :)

your solution is spot on, there is another option available in spark SQL
though for this.


Regards,
Gourav Sengupta

On Thu, Jan 9, 2020 at 1:19 PM Shraddha Shah <sh...@gmail.com>
wrote:

> Unless I am reading this wrong, this can be achieved with aws sync ?
>
> aws s3 sync
> s3://my-bucket/ingestion/source1/y=2019/m=12/d=12 s3://my-bucket/ingestion/processed/
> *src_category=other*/y=2019/m=12/d=12
>
> Thanks,
> -Shraddha
>
>
>
> On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta <go...@gmail.com>
> wrote:
>
>> why s3a?
>>
>> On Thu, Jan 9, 2020 at 2:20 AM anbutech <an...@outlook.com> wrote:
>>
>>> Hello,
>>>
>>> version = spark 2.4.3
>>>
>>> I have 3 different sources json logs data which having same schema(same
>>> columns order) in the raw data and want to add one new column as
>>> "src_category"  for all the  3 different source to distinguish the
>>> source
>>> category  and merge all the  3 different sources into the single
>>> dataframe
>>> to read the json data for the  processing.what is the best way to handle
>>> this case.
>>>
>>> df = spark.read.json(merged_3sourcesraw_data)
>>>
>>> Input:
>>>
>>> s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
>>> s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
>>> s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json
>>>
>>> output:
>>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
>>>
>>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
>>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>

Re: Merge multiple different s3 logs using pyspark 2.4.3

Posted by Shraddha Shah <sh...@gmail.com>.

Unless I am reading this wrong, this can be achieved with aws sync ?

aws s3 sync
s3://my-bucket/ingestion/source1/y=2019/m=12/d=12
s3://my-bucket/ingestion/processed/
*src_category=other*/y=2019/m=12/d=12

Thanks,
-Shraddha



On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta <go...@gmail.com>
wrote:

> why s3a?
>
> On Thu, Jan 9, 2020 at 2:20 AM anbutech <an...@outlook.com> wrote:
>
>> Hello,
>>
>> version = spark 2.4.3
>>
>> I have 3 different sources json logs data which having same schema(same
>> columns order) in the raw data and want to add one new column as
>> "src_category"  for all the  3 different source to distinguish the source
>> category  and merge all the  3 different sources into the single dataframe
>> to read the json data for the  processing.what is the best way to handle
>> this case.
>>
>> df = spark.read.json(merged_3sourcesraw_data)
>>
>> Input:
>>
>> s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
>> s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
>> s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json
>>
>> output:
>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
>>
>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows
>>
>>
>> Thanks
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: Merge multiple different s3 logs using pyspark 2.4.3

Posted by Gourav Sengupta <go...@gmail.com>.

why s3a?

On Thu, Jan 9, 2020 at 2:20 AM anbutech <an...@outlook.com> wrote:

> Hello,
>
> version = spark 2.4.3
>
> I have 3 different sources json logs data which having same schema(same
> columns order) in the raw data and want to add one new column as
> "src_category"  for all the  3 different source to distinguish the source
> category  and merge all the  3 different sources into the single dataframe
> to read the json data for the  processing.what is the best way to handle
> this case.
>
> df = spark.read.json(merged_3sourcesraw_data)
>
> Input:
>
> s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
> s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
> s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json
>
> output:
> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
>
> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows
>
>
> Thanks
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>