You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by xiaohe lan <zo...@gmail.com> on 2015/04/02 07:26:45 UTC
Dataset for hive
Hi All,
I am new to Hive. Just set up a 5 nodes Hadoop environment and want to have
a try on HiveQL.
Is there any dataset I can download to play HiveQL. The dataset should have
several tables some I can write some complex join. About 100G should be
fine.
Thanks,
Xiaohe
Re: Dataset for hive
Posted by venkatanathen kannan <ve...@yahoo.com>.
HI Gopal & Xiaohe,
Thanks for sharing.
Thanks,VK
On Wednesday, April 15, 2015 9:23 AM, xiaohe lan <zo...@gmail.com> wrote:
I just have time to generate the data a few minutes ago. It can generate 100G data for me in tens of minutes on my 5 nodes cluster.
Thanks all for helping me.
Regards,Xiaohe
On Fri, Apr 3, 2015 at 9:00 PM, Fabio C. <an...@gmail.com> wrote:
Thanks Gopal, but since it was a while ago and I didn't have to generate too much data I just run the tpc-ds generator binaries in parallel and uploaded it manually. Anyway if you want to have a look at the error: http://hortonworks.com/community/forums/topic/hive-testbench-error/
Maybe it's trivial and it can help someone else.
Regards
Fabio
On Thu, Apr 2, 2015 at 7:20 PM, Gopal Vijayaraghavan <go...@apache.org> wrote:
> https://github.com/hortonworks/hive-testbench
>
> The official procedure to generate and upload the data has never worked
>for me (and it looks like it's not a supported software), so it could be
>a bit tricky to do it manually and on a single host.
I wrote the MapReduce jobs for that (tpcds-gen/tpch-gen) after waiting a
whole weekend for 1Tb of data to be generated on a single machine.
If you or anyone else has issues with it, I can take a look at it.
Cheers,
Gopal
Re: Dataset for hive
Posted by xiaohe lan <zo...@gmail.com>.
I just have time to generate the data a few minutes ago. It can generate
100G data for me in tens of minutes on my 5 nodes cluster.
Thanks all for helping me.
Regards,
Xiaohe
On Fri, Apr 3, 2015 at 9:00 PM, Fabio C. <an...@gmail.com> wrote:
> Thanks Gopal, but since it was a while ago and I didn't have to generate
> too much data I just run the tpc-ds generator binaries in parallel and
> uploaded it manually. Anyway if you want to have a look at the error:
> http://hortonworks.com/community/forums/topic/hive-testbench-error/
> Maybe it's trivial and it can help someone else.
>
> Regards
>
> Fabio
>
> On Thu, Apr 2, 2015 at 7:20 PM, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
>>
>>
>> > https://github.com/hortonworks/hive-testbench
>> >
>> > The official procedure to generate and upload the data has never worked
>> >for me (and it looks like it's not a supported software), so it could be
>> >a bit tricky to do it manually and on a single host.
>>
>> I wrote the MapReduce jobs for that (tpcds-gen/tpch-gen) after waiting a
>> whole weekend for 1Tb of data to be generated on a single machine.
>>
>> If you or anyone else has issues with it, I can take a look at it.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>
Re: Dataset for hive
Posted by "Fabio C." <an...@gmail.com>.
Thanks Gopal, but since it was a while ago and I didn't have to generate
too much data I just run the tpc-ds generator binaries in parallel and
uploaded it manually. Anyway if you want to have a look at the error:
http://hortonworks.com/community/forums/topic/hive-testbench-error/
Maybe it's trivial and it can help someone else.
Regards
Fabio
On Thu, Apr 2, 2015 at 7:20 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:
>
>
> > https://github.com/hortonworks/hive-testbench
> >
> > The official procedure to generate and upload the data has never worked
> >for me (and it looks like it's not a supported software), so it could be
> >a bit tricky to do it manually and on a single host.
>
> I wrote the MapReduce jobs for that (tpcds-gen/tpch-gen) after waiting a
> whole weekend for 1Tb of data to be generated on a single machine.
>
> If you or anyone else has issues with it, I can take a look at it.
>
> Cheers,
> Gopal
>
>
>
Re: Dataset for hive
Posted by Gopal Vijayaraghavan <go...@apache.org>.
> https://github.com/hortonworks/hive-testbench
>
> The official procedure to generate and upload the data has never worked
>for me (and it looks like it's not a supported software), so it could be
>a bit tricky to do it manually and on a single host.
I wrote the MapReduce jobs for that (tpcds-gen/tpch-gen) after waiting a
whole weekend for 1Tb of data to be generated on a single machine.
If you or anyone else has issues with it, I can take a look at it.
Cheers,
Gopal
Re: Dataset for hive
Posted by "Fabio C." <an...@gmail.com>.
https://github.com/hortonworks/hive-testbench
The official procedure to generate and upload the data has never worked for
me (and it looks like it's not a supported software), so it could be a bit
tricky to do it manually and on a single host. The good point is you
already have several queries and you can set the size of the data you want
to generate.
On Thu, Apr 2, 2015 at 8:29 AM, xiaohe lan <zo...@gmail.com> wrote:
> Hi Vivek Veeramani,
>
> Actually, I already have that. But with the wiki dataset, I can only do
> "select *" queries.
>
> Thanks,
> Xiaohe
>
> On Thu, Apr 2, 2015 at 1:44 PM, vivek veeramani <
> vivek.veeramani87@gmail.com> wrote:
>
>> Hi Xiaohe,
>>
>> If it's data set that you're looking for, you can find wikipedia data
>> dumps @ http://dumps.wikimedia.org/enwiki/. Also documentation on the
>> dumps @ http://meta.wikimedia.org/wiki/Data_dumps.
>>
>> Hope this helps..
>>
>>
>> On Thu, Apr 2, 2015 at 10:56 AM, xiaohe lan <zo...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am new to Hive. Just set up a 5 nodes Hadoop environment and want to
>>> have a try on HiveQL.
>>> Is there any dataset I can download to play HiveQL. The dataset should
>>> have several tables some I can write some complex join. About 100G should
>>> be fine.
>>>
>>> Thanks,
>>> Xiaohe
>>>
>>
>>
>>
>> --
>> Thanks ,
>> Vivek Veeramani
>>
>>
>> cell : +91-9632 975 975
>> +91-9895 277 101
>>
>
>
Re: Dataset for hive
Posted by xiaohe lan <zo...@gmail.com>.
Hi Vivek Veeramani,
Actually, I already have that. But with the wiki dataset, I can only do
"select *" queries.
Thanks,
Xiaohe
On Thu, Apr 2, 2015 at 1:44 PM, vivek veeramani <vivek.veeramani87@gmail.com
> wrote:
> Hi Xiaohe,
>
> If it's data set that you're looking for, you can find wikipedia data
> dumps @ http://dumps.wikimedia.org/enwiki/. Also documentation on the
> dumps @ http://meta.wikimedia.org/wiki/Data_dumps.
>
> Hope this helps..
>
>
> On Thu, Apr 2, 2015 at 10:56 AM, xiaohe lan <zo...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I am new to Hive. Just set up a 5 nodes Hadoop environment and want to
>> have a try on HiveQL.
>> Is there any dataset I can download to play HiveQL. The dataset should
>> have several tables some I can write some complex join. About 100G should
>> be fine.
>>
>> Thanks,
>> Xiaohe
>>
>
>
>
> --
> Thanks ,
> Vivek Veeramani
>
>
> cell : +91-9632 975 975
> +91-9895 277 101
>
Re: Dataset for hive
Posted by vivek veeramani <vi...@gmail.com>.
Hi Xiaohe,
If it's data set that you're looking for, you can find wikipedia data dumps
@ http://dumps.wikimedia.org/enwiki/. Also documentation on the dumps @
http://meta.wikimedia.org/wiki/Data_dumps.
Hope this helps..
On Thu, Apr 2, 2015 at 10:56 AM, xiaohe lan <zo...@gmail.com> wrote:
> Hi All,
>
> I am new to Hive. Just set up a 5 nodes Hadoop environment and want to
> have a try on HiveQL.
> Is there any dataset I can download to play HiveQL. The dataset should
> have several tables some I can write some complex join. About 100G should
> be fine.
>
> Thanks,
> Xiaohe
>
--
Thanks ,
Vivek Veeramani
cell : +91-9632 975 975
+91-9895 277 101