You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by John Lilley <jo...@redpoint.net> on 2013/07/02 18:04:30 UTC

typical JSON data sets

I would like to hear your experiences working with large JSON data sets, specifically:

1)      How large is each JSON document?

2)      Do they tend to be a single JSON doc per file, or multiples per file?

3)      Do the JSON schemas change over time?

4)      Are there interesting public data sets you would recommend for experiment?
Thanks
John


Re: typical JSON data sets

Posted by Lenin Raj <em...@gmail.com>.
Hi John,

I have just started pulling Twitter conversions using Apache Flume. But I
have not started processing the pulled data yet. And my answers below:

1)      How large is each JSON document?

Averages from 100 KB to 2 MB. Flume rolls a new file every 1 minutes (which
is configurable). So the size depends on the number of events happened
during that interval

2)      Do they tend to be a single JSON doc per file, or multiples per
file?

Multiples per file - The max file (3.2 MB) had about 1100 JSON docs

3)      Do the JSON schemas change over time?

Nope. Since its the standard Twitter API
4)      Are there interesting public data sets you would recommend for
experiment?

Twitter API


Thanks,
Lenin


On Tue, Jul 2, 2013 at 9:34 PM, John Lilley <jo...@redpoint.net>wrote:

>  I would like to hear your experiences working with large JSON data sets,
> specifically:****
>
> **1)      **How large is each JSON document?****
>
> **2)      **Do they tend to be a single JSON doc per file, or multiples
> per file?****
>
> **3)      **Do the JSON schemas change over time?****
>
> **4)      **Are there interesting public data sets you would recommend
> for experiment?****
>
> Thanks****
>
> John****
>
> ** **
>

Re: typical JSON data sets

Posted by Lenin Raj <em...@gmail.com>.
Hi John,

I have just started pulling Twitter conversions using Apache Flume. But I
have not started processing the pulled data yet. And my answers below:

1)      How large is each JSON document?

Averages from 100 KB to 2 MB. Flume rolls a new file every 1 minutes (which
is configurable). So the size depends on the number of events happened
during that interval

2)      Do they tend to be a single JSON doc per file, or multiples per
file?

Multiples per file - The max file (3.2 MB) had about 1100 JSON docs

3)      Do the JSON schemas change over time?

Nope. Since its the standard Twitter API
4)      Are there interesting public data sets you would recommend for
experiment?

Twitter API


Thanks,
Lenin


On Tue, Jul 2, 2013 at 9:34 PM, John Lilley <jo...@redpoint.net>wrote:

>  I would like to hear your experiences working with large JSON data sets,
> specifically:****
>
> **1)      **How large is each JSON document?****
>
> **2)      **Do they tend to be a single JSON doc per file, or multiples
> per file?****
>
> **3)      **Do the JSON schemas change over time?****
>
> **4)      **Are there interesting public data sets you would recommend
> for experiment?****
>
> Thanks****
>
> John****
>
> ** **
>

Re: typical JSON data sets

Posted by Lenin Raj <em...@gmail.com>.
Hi John,

I have just started pulling Twitter conversions using Apache Flume. But I
have not started processing the pulled data yet. And my answers below:

1)      How large is each JSON document?

Averages from 100 KB to 2 MB. Flume rolls a new file every 1 minutes (which
is configurable). So the size depends on the number of events happened
during that interval

2)      Do they tend to be a single JSON doc per file, or multiples per
file?

Multiples per file - The max file (3.2 MB) had about 1100 JSON docs

3)      Do the JSON schemas change over time?

Nope. Since its the standard Twitter API
4)      Are there interesting public data sets you would recommend for
experiment?

Twitter API


Thanks,
Lenin


On Tue, Jul 2, 2013 at 9:34 PM, John Lilley <jo...@redpoint.net>wrote:

>  I would like to hear your experiences working with large JSON data sets,
> specifically:****
>
> **1)      **How large is each JSON document?****
>
> **2)      **Do they tend to be a single JSON doc per file, or multiples
> per file?****
>
> **3)      **Do the JSON schemas change over time?****
>
> **4)      **Are there interesting public data sets you would recommend
> for experiment?****
>
> Thanks****
>
> John****
>
> ** **
>

Re: typical JSON data sets

Posted by Lenin Raj <em...@gmail.com>.
Hi John,

I have just started pulling Twitter conversions using Apache Flume. But I
have not started processing the pulled data yet. And my answers below:

1)      How large is each JSON document?

Averages from 100 KB to 2 MB. Flume rolls a new file every 1 minutes (which
is configurable). So the size depends on the number of events happened
during that interval

2)      Do they tend to be a single JSON doc per file, or multiples per
file?

Multiples per file - The max file (3.2 MB) had about 1100 JSON docs

3)      Do the JSON schemas change over time?

Nope. Since its the standard Twitter API
4)      Are there interesting public data sets you would recommend for
experiment?

Twitter API


Thanks,
Lenin


On Tue, Jul 2, 2013 at 9:34 PM, John Lilley <jo...@redpoint.net>wrote:

>  I would like to hear your experiences working with large JSON data sets,
> specifically:****
>
> **1)      **How large is each JSON document?****
>
> **2)      **Do they tend to be a single JSON doc per file, or multiples
> per file?****
>
> **3)      **Do the JSON schemas change over time?****
>
> **4)      **Are there interesting public data sets you would recommend
> for experiment?****
>
> Thanks****
>
> John****
>
> ** **
>