You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by xiaohe lan <zo...@gmail.com> on 2015/04/02 07:26:45 UTC

Dataset for hive

Hi All,

I am new to Hive. Just set up a 5 nodes Hadoop environment and want to have
a try on HiveQL.
Is there any dataset I can download to play HiveQL. The dataset should have
several tables some I can write some complex join. About 100G should be
fine.

Thanks,
Xiaohe

Re: Dataset for hive

Posted by venkatanathen kannan <ve...@yahoo.com>.
HI Gopal & Xiaohe, 
Thanks for sharing.
Thanks,VK  


     On Wednesday, April 15, 2015 9:23 AM, xiaohe lan <zo...@gmail.com> wrote:
   

 I just have time to generate the data a few minutes ago. It can generate 100G data for me in tens of minutes on my 5 nodes cluster.
Thanks all for helping me.
Regards,Xiaohe
On Fri, Apr 3, 2015 at 9:00 PM, Fabio C. <an...@gmail.com> wrote:

Thanks Gopal, but since it was a while ago and I didn't have to generate too much data I just run the tpc-ds generator binaries in parallel and uploaded it manually. Anyway if you want to have a look at the error: http://hortonworks.com/community/forums/topic/hive-testbench-error/ 
Maybe it's trivial and it can help someone else.

Regards

Fabio

On Thu, Apr 2, 2015 at 7:20 PM, Gopal Vijayaraghavan <go...@apache.org> wrote:



> https://github.com/hortonworks/hive-testbench
>
> The official procedure to generate and upload the data has never worked
>for me (and it looks like it's not a supported software), so it could be
>a bit tricky to do it manually and on a single host.

I wrote the MapReduce jobs for that (tpcds-gen/tpch-gen) after waiting a
whole weekend for 1Tb of data to be generated on a single machine.

If you or anyone else has issues with it, I can take a look at it.

Cheers,
Gopal








  

Re: Dataset for hive

Posted by xiaohe lan <zo...@gmail.com>.
I just have time to generate the data a few minutes ago. It can generate
100G data for me in tens of minutes on my 5 nodes cluster.

Thanks all for helping me.

Regards,
Xiaohe

On Fri, Apr 3, 2015 at 9:00 PM, Fabio C. <an...@gmail.com> wrote:

> Thanks Gopal, but since it was a while ago and I didn't have to generate
> too much data I just run the tpc-ds generator binaries in parallel and
> uploaded it manually. Anyway if you want to have a look at the error:
> http://hortonworks.com/community/forums/topic/hive-testbench-error/
> Maybe it's trivial and it can help someone else.
>
> Regards
>
> Fabio
>
> On Thu, Apr 2, 2015 at 7:20 PM, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
>>
>>
>> > https://github.com/hortonworks/hive-testbench
>> >
>> > The official procedure to generate and upload the data has never worked
>> >for me (and it looks like it's not a supported software), so it could be
>> >a bit tricky to do it manually and on a single host.
>>
>> I wrote the MapReduce jobs for that (tpcds-gen/tpch-gen) after waiting a
>> whole weekend for 1Tb of data to be generated on a single machine.
>>
>> If you or anyone else has issues with it, I can take a look at it.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Re: Dataset for hive

Posted by "Fabio C." <an...@gmail.com>.
Thanks Gopal, but since it was a while ago and I didn't have to generate
too much data I just run the tpc-ds generator binaries in parallel and
uploaded it manually. Anyway if you want to have a look at the error:
http://hortonworks.com/community/forums/topic/hive-testbench-error/
Maybe it's trivial and it can help someone else.

Regards

Fabio

On Thu, Apr 2, 2015 at 7:20 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
>
> > https://github.com/hortonworks/hive-testbench
> >
> > The official procedure to generate and upload the data has never worked
> >for me (and it looks like it's not a supported software), so it could be
> >a bit tricky to do it manually and on a single host.
>
> I wrote the MapReduce jobs for that (tpcds-gen/tpch-gen) after waiting a
> whole weekend for 1Tb of data to be generated on a single machine.
>
> If you or anyone else has issues with it, I can take a look at it.
>
> Cheers,
> Gopal
>
>
>

Re: Dataset for hive

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> https://github.com/hortonworks/hive-testbench
>
> The official procedure to generate and upload the data has never worked
>for me (and it looks like it's not a supported software), so it could be
>a bit tricky to do it manually and on a single host.

I wrote the MapReduce jobs for that (tpcds-gen/tpch-gen) after waiting a
whole weekend for 1Tb of data to be generated on a single machine.

If you or anyone else has issues with it, I can take a look at it.

Cheers,
Gopal



Re: Dataset for hive

Posted by "Fabio C." <an...@gmail.com>.
https://github.com/hortonworks/hive-testbench
The official procedure to generate and upload the data has never worked for
me (and it looks like it's not a supported software), so it could be a bit
tricky to do it manually and on a single host. The good point is you
already have several queries and you can set the size of the data you want
to generate.

On Thu, Apr 2, 2015 at 8:29 AM, xiaohe lan <zo...@gmail.com> wrote:

> Hi Vivek Veeramani,
>
> Actually, I already have that. But with the wiki dataset, I can only do
> "select *" queries.
>
> Thanks,
> Xiaohe
>
> On Thu, Apr 2, 2015 at 1:44 PM, vivek veeramani <
> vivek.veeramani87@gmail.com> wrote:
>
>> Hi Xiaohe,
>>
>> If it's data set that you're looking for, you can find wikipedia data
>> dumps @ http://dumps.wikimedia.org/enwiki/. Also documentation on the
>> dumps @ http://meta.wikimedia.org/wiki/Data_dumps.
>>
>> Hope this helps..
>>
>>
>> On Thu, Apr 2, 2015 at 10:56 AM, xiaohe lan <zo...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am new to Hive. Just set up a 5 nodes Hadoop environment and want to
>>> have a try on HiveQL.
>>> Is there any dataset I can download to play HiveQL. The dataset should
>>> have several tables some I can write some complex join. About 100G should
>>> be fine.
>>>
>>> Thanks,
>>> Xiaohe
>>>
>>
>>
>>
>> --
>> Thanks ,
>> Vivek Veeramani
>>
>>
>> cell : +91-9632 975 975
>>         +91-9895 277 101
>>
>
>

Re: Dataset for hive

Posted by xiaohe lan <zo...@gmail.com>.
Hi Vivek Veeramani,

Actually, I already have that. But with the wiki dataset, I can only do
"select *" queries.

Thanks,
Xiaohe

On Thu, Apr 2, 2015 at 1:44 PM, vivek veeramani <vivek.veeramani87@gmail.com
> wrote:

> Hi Xiaohe,
>
> If it's data set that you're looking for, you can find wikipedia data
> dumps @ http://dumps.wikimedia.org/enwiki/. Also documentation on the
> dumps @ http://meta.wikimedia.org/wiki/Data_dumps.
>
> Hope this helps..
>
>
> On Thu, Apr 2, 2015 at 10:56 AM, xiaohe lan <zo...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I am new to Hive. Just set up a 5 nodes Hadoop environment and want to
>> have a try on HiveQL.
>> Is there any dataset I can download to play HiveQL. The dataset should
>> have several tables some I can write some complex join. About 100G should
>> be fine.
>>
>> Thanks,
>> Xiaohe
>>
>
>
>
> --
> Thanks ,
> Vivek Veeramani
>
>
> cell : +91-9632 975 975
>         +91-9895 277 101
>

Re: Dataset for hive

Posted by vivek veeramani <vi...@gmail.com>.
Hi Xiaohe,

If it's data set that you're looking for, you can find wikipedia data dumps
@ http://dumps.wikimedia.org/enwiki/. Also documentation on the dumps @
http://meta.wikimedia.org/wiki/Data_dumps.

Hope this helps..


On Thu, Apr 2, 2015 at 10:56 AM, xiaohe lan <zo...@gmail.com> wrote:

> Hi All,
>
> I am new to Hive. Just set up a 5 nodes Hadoop environment and want to
> have a try on HiveQL.
> Is there any dataset I can download to play HiveQL. The dataset should
> have several tables some I can write some complex join. About 100G should
> be fine.
>
> Thanks,
> Xiaohe
>



-- 
Thanks ,
Vivek Veeramani


cell : +91-9632 975 975
        +91-9895 277 101