You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sid <fl...@gmail.com> on 2022/04/26 14:43:14 UTC

Dealing with large number of small files

Hello,

Can somebody help me with the below problem?

https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark


Thanks,
Sid

Re: Dealing with large number of small files

Posted by Bjørn Jørgensen <bj...@gmail.com>.

df = spark.read.json("/*.json")

use the *.json


tir. 26. apr. 2022 kl. 16:44 skrev Sid <fl...@gmail.com>:

> Hello,
>
> Can somebody help me with the below problem?
>
>
> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark
>
>
> Thanks,
> Sid
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Dealing with large number of small files

Posted by Sid <fl...@gmail.com>.

Yes,


It created a list of records separated by , and it was created faster as
well.

On Wed, 27 Apr 2022, 13:42 Gourav Sengupta, <go...@gmail.com>
wrote:

> Hi,
> did that result in valid JSON in the output file?
>
> Regards,
> Gourav Sengupta
>
> On Tue, Apr 26, 2022 at 8:18 PM Sid <fl...@gmail.com> wrote:
>
>> I have .txt files with JSON inside it. It is generated by some API calls
>> by the Client.
>>
>> On Wed, Apr 27, 2022 at 12:39 AM Bjørn Jørgensen <
>> bjornjorgensen@gmail.com> wrote:
>>
>>> What is that you have? Is it txt files or json files?
>>> Or do you have txt files with JSON inside?
>>>
>>>
>>>
>>> tir. 26. apr. 2022 kl. 20:41 skrev Sid <fl...@gmail.com>:
>>>
>>>> Thanks for your time, everyone :)
>>>>
>>>> Much appreciated.
>>>>
>>>> I solved it using jq utility since I was dealing with JSON. I have
>>>> solved it using below script:
>>>>
>>>> find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Sid
>>>>
>>>>
>>>> On Tue, Apr 26, 2022 at 9:37 PM Bjørn Jørgensen <
>>>> bjornjorgensen@gmail.com> wrote:
>>>>
>>>>> and the bash script seems to read txt files not json
>>>>>
>>>>> for f in Agent/*.txt; do cat ${f} >> merged.json;done;
>>>>>
>>>>>
>>>>>
>>>>> tir. 26. apr. 2022 kl. 18:03 skrev Gourav Sengupta <
>>>>> gourav.sengupta@gmail.com>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> what is the version of spark are you using? And where is the data
>>>>>> stored.
>>>>>>
>>>>>> I am not quite sure that just using a bash script will help because
>>>>>> concatenating all the files into a single file creates a valid JSON.
>>>>>>
>>>>>> Regards,
>>>>>> Gourav
>>>>>>
>>>>>> On Tue, Apr 26, 2022 at 3:44 PM Sid <fl...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Can somebody help me with the below problem?
>>>>>>>
>>>>>>>
>>>>>>> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sid
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>

Re: Dealing with large number of small files

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,
did that result in valid JSON in the output file?

Regards,
Gourav Sengupta

On Tue, Apr 26, 2022 at 8:18 PM Sid <fl...@gmail.com> wrote:

> I have .txt files with JSON inside it. It is generated by some API calls
> by the Client.
>
> On Wed, Apr 27, 2022 at 12:39 AM Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> What is that you have? Is it txt files or json files?
>> Or do you have txt files with JSON inside?
>>
>>
>>
>> tir. 26. apr. 2022 kl. 20:41 skrev Sid <fl...@gmail.com>:
>>
>>> Thanks for your time, everyone :)
>>>
>>> Much appreciated.
>>>
>>> I solved it using jq utility since I was dealing with JSON. I have
>>> solved it using below script:
>>>
>>> find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt
>>>
>>>
>>> Thanks,
>>>
>>> Sid
>>>
>>>
>>> On Tue, Apr 26, 2022 at 9:37 PM Bjørn Jørgensen <
>>> bjornjorgensen@gmail.com> wrote:
>>>
>>>> and the bash script seems to read txt files not json
>>>>
>>>> for f in Agent/*.txt; do cat ${f} >> merged.json;done;
>>>>
>>>>
>>>>
>>>> tir. 26. apr. 2022 kl. 18:03 skrev Gourav Sengupta <
>>>> gourav.sengupta@gmail.com>:
>>>>
>>>>> Hi,
>>>>>
>>>>> what is the version of spark are you using? And where is the data
>>>>> stored.
>>>>>
>>>>> I am not quite sure that just using a bash script will help because
>>>>> concatenating all the files into a single file creates a valid JSON.
>>>>>
>>>>> Regards,
>>>>> Gourav
>>>>>
>>>>> On Tue, Apr 26, 2022 at 3:44 PM Sid <fl...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Can somebody help me with the below problem?
>>>>>>
>>>>>>
>>>>>> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Sid
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: Dealing with large number of small files

Posted by Sid <fl...@gmail.com>.

I have .txt files with JSON inside it. It is generated by some API calls by
the Client.

On Wed, Apr 27, 2022 at 12:39 AM Bjørn Jørgensen <bj...@gmail.com>
wrote:

> What is that you have? Is it txt files or json files?
> Or do you have txt files with JSON inside?
>
>
>
> tir. 26. apr. 2022 kl. 20:41 skrev Sid <fl...@gmail.com>:
>
>> Thanks for your time, everyone :)
>>
>> Much appreciated.
>>
>> I solved it using jq utility since I was dealing with JSON. I have solved
>> it using below script:
>>
>> find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt
>>
>>
>> Thanks,
>>
>> Sid
>>
>>
>> On Tue, Apr 26, 2022 at 9:37 PM Bjørn Jørgensen <bj...@gmail.com>
>> wrote:
>>
>>> and the bash script seems to read txt files not json
>>>
>>> for f in Agent/*.txt; do cat ${f} >> merged.json;done;
>>>
>>>
>>>
>>> tir. 26. apr. 2022 kl. 18:03 skrev Gourav Sengupta <
>>> gourav.sengupta@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> what is the version of spark are you using? And where is the data
>>>> stored.
>>>>
>>>> I am not quite sure that just using a bash script will help because
>>>> concatenating all the files into a single file creates a valid JSON.
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Tue, Apr 26, 2022 at 3:44 PM Sid <fl...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Can somebody help me with the below problem?
>>>>>
>>>>>
>>>>> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: Dealing with large number of small files

Posted by Bjørn Jørgensen <bj...@gmail.com>.

What is that you have? Is it txt files or json files?
Or do you have txt files with JSON inside?



tir. 26. apr. 2022 kl. 20:41 skrev Sid <fl...@gmail.com>:

> Thanks for your time, everyone :)
>
> Much appreciated.
>
> I solved it using jq utility since I was dealing with JSON. I have solved
> it using below script:
>
> find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt
>
>
> Thanks,
>
> Sid
>
>
> On Tue, Apr 26, 2022 at 9:37 PM Bjørn Jørgensen <bj...@gmail.com>
> wrote:
>
>> and the bash script seems to read txt files not json
>>
>> for f in Agent/*.txt; do cat ${f} >> merged.json;done;
>>
>>
>>
>> tir. 26. apr. 2022 kl. 18:03 skrev Gourav Sengupta <
>> gourav.sengupta@gmail.com>:
>>
>>> Hi,
>>>
>>> what is the version of spark are you using? And where is the data stored.
>>>
>>> I am not quite sure that just using a bash script will help because
>>> concatenating all the files into a single file creates a valid JSON.
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Tue, Apr 26, 2022 at 3:44 PM Sid <fl...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Can somebody help me with the below problem?
>>>>
>>>>
>>>> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark
>>>>
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Dealing with large number of small files

Posted by Sid <fl...@gmail.com>.

Thanks for your time, everyone :)

Much appreciated.

I solved it using jq utility since I was dealing with JSON. I have solved
it using below script:

find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt


Thanks,

Sid


On Tue, Apr 26, 2022 at 9:37 PM Bjørn Jørgensen <bj...@gmail.com>
wrote:

> and the bash script seems to read txt files not json
>
> for f in Agent/*.txt; do cat ${f} >> merged.json;done;
>
>
>
> tir. 26. apr. 2022 kl. 18:03 skrev Gourav Sengupta <
> gourav.sengupta@gmail.com>:
>
>> Hi,
>>
>> what is the version of spark are you using? And where is the data stored.
>>
>> I am not quite sure that just using a bash script will help because
>> concatenating all the files into a single file creates a valid JSON.
>>
>> Regards,
>> Gourav
>>
>> On Tue, Apr 26, 2022 at 3:44 PM Sid <fl...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Can somebody help me with the below problem?
>>>
>>>
>>> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark
>>>
>>>
>>> Thanks,
>>> Sid
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: Dealing with large number of small files

Posted by Bjørn Jørgensen <bj...@gmail.com>.

and the bash script seems to read txt files not json

for f in Agent/*.txt; do cat ${f} >> merged.json;done;



tir. 26. apr. 2022 kl. 18:03 skrev Gourav Sengupta <
gourav.sengupta@gmail.com>:

> Hi,
>
> what is the version of spark are you using? And where is the data stored.
>
> I am not quite sure that just using a bash script will help because
> concatenating all the files into a single file creates a valid JSON.
>
> Regards,
> Gourav
>
> On Tue, Apr 26, 2022 at 3:44 PM Sid <fl...@gmail.com> wrote:
>
>> Hello,
>>
>> Can somebody help me with the below problem?
>>
>>
>> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark
>>
>>
>> Thanks,
>> Sid
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Dealing with large number of small files

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

what is the version of spark are you using? And where is the data stored.

I am not quite sure that just using a bash script will help because
concatenating all the files into a single file creates a valid JSON.

Regards,
Gourav

On Tue, Apr 26, 2022 at 3:44 PM Sid <fl...@gmail.com> wrote:

> Hello,
>
> Can somebody help me with the below problem?
>
>
> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark
>
>
> Thanks,
> Sid
>

Re: Dealing with large number of small files

Posted by Artemis User <ar...@dtechspace.com>.

Most likely your JSON files are not formatted correctly.  Please see the 
Spark doc on specific formatting requirement for JSON data.

https://spark.apache.org/docs/latest/sql-data-sources-json.html.

On 4/26/22 10:43 AM, Sid wrote:
> Hello,
>
> Can somebody help me with the below problem?
>
> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark
>
>
> Thanks,
> Sid

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org