You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Henrik Aagaard Jørgensen <BU...@tmf.kk.dk> on 2014/09/04 09:30:55 UTC

Hadoop and Open Data (CKAN.org).

Dear all,

I'm very new to Hadoop as I'm still trying to grasp its value and  purpose. I do hope my question on this mailing list is OK.

I manage our open data platform at our municipality, using CKAN.org. It works very well for its purpose of showing data and adding API's to data.

However, I'm very interested in knowing more about Hadoop and if it would fit into a (open) data platform, as we are getting more and more data to show and to work with internally at our municipality.

However, I cannot figure out if it's the right purpose to use Hadoop for, if it is "overkill" or...

Could someone elaborate on such topic?

I've Googled around a lot and looked at various videos online and Hadoop seems to have it place, also in an open data platform environment.

Best regards,
Henrik

Re: Hadoop and Open Data (CKAN.org).

Posted by Mohan Radhakrishnan <ra...@gmail.com>.
I understand that coding MR jobs using a language is required but if we are
just processing large amounts of data (Machine Learning for example) we
could use Pig. I recently processed 0.25 TB on AWS clusters in a reasonably
short time. In this case the development effort is very less.


Thanks,
Mohan


On Thu, Sep 4, 2014 at 6:41 PM, Alec Ten Harmsel <al...@alectenharmsel.com>
wrote:

>  I would recommend using Hadoop only if you are ingesting a lot of data
> and you need reasonable performance at scale. I would recommend starting
> with using <insert language/tool of choice> to ingest and transform data
> until that process starts taking too long.
>
> For example, one of our researchers at the University of Michigan had to
> process ~150GB of data. Using python, processing that data took about 45
> minutes - it was not worth it to spend extra development time to run it on
> Hadoop. This time will change depending on what you need to do and the
> hardware available, naturally.
>
> So until you need to frequently process large amounts of data, I'd stick
> with something you're already familiar with.
>
> Alec Ten Harmsel
>
> On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:
>
>  Dear all,
>
>
>
> I’m very new to Hadoop as I’m still trying to grasp its value and
> purpose. I do hope my question on this mailing list is OK.
>
>
>
> I manage our open data platform at our municipality, using CKAN.org. It
> works very well for its purpose of showing data and adding API’s to data.
>
>
>
> However, I’m very interested in knowing more about Hadoop and if it would
> fit into a (open) data platform, as we are getting more and more data to
> show and to work with internally at our municipality.
>
>
>
> However, I cannot figure out if it’s the right purpose to use Hadoop for,
> if it is “overkill” or…
>
>
>
> Could someone elaborate on such topic?
>
>
>
> I’ve Googled around a lot and looked at various videos online and Hadoop
> seems to have it place, also in an open data platform environment.
>
>
>
> Best regards,
>
> Henrik
>
>
>

Re: Hadoop and Open Data (CKAN.org).

Posted by Mohan Radhakrishnan <ra...@gmail.com>.
I understand that coding MR jobs using a language is required but if we are
just processing large amounts of data (Machine Learning for example) we
could use Pig. I recently processed 0.25 TB on AWS clusters in a reasonably
short time. In this case the development effort is very less.


Thanks,
Mohan


On Thu, Sep 4, 2014 at 6:41 PM, Alec Ten Harmsel <al...@alectenharmsel.com>
wrote:

>  I would recommend using Hadoop only if you are ingesting a lot of data
> and you need reasonable performance at scale. I would recommend starting
> with using <insert language/tool of choice> to ingest and transform data
> until that process starts taking too long.
>
> For example, one of our researchers at the University of Michigan had to
> process ~150GB of data. Using python, processing that data took about 45
> minutes - it was not worth it to spend extra development time to run it on
> Hadoop. This time will change depending on what you need to do and the
> hardware available, naturally.
>
> So until you need to frequently process large amounts of data, I'd stick
> with something you're already familiar with.
>
> Alec Ten Harmsel
>
> On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:
>
>  Dear all,
>
>
>
> I’m very new to Hadoop as I’m still trying to grasp its value and
> purpose. I do hope my question on this mailing list is OK.
>
>
>
> I manage our open data platform at our municipality, using CKAN.org. It
> works very well for its purpose of showing data and adding API’s to data.
>
>
>
> However, I’m very interested in knowing more about Hadoop and if it would
> fit into a (open) data platform, as we are getting more and more data to
> show and to work with internally at our municipality.
>
>
>
> However, I cannot figure out if it’s the right purpose to use Hadoop for,
> if it is “overkill” or…
>
>
>
> Could someone elaborate on such topic?
>
>
>
> I’ve Googled around a lot and looked at various videos online and Hadoop
> seems to have it place, also in an open data platform environment.
>
>
>
> Best regards,
>
> Henrik
>
>
>

Re: Hadoop and Open Data (CKAN.org).

Posted by Mohan Radhakrishnan <ra...@gmail.com>.
I understand that coding MR jobs using a language is required but if we are
just processing large amounts of data (Machine Learning for example) we
could use Pig. I recently processed 0.25 TB on AWS clusters in a reasonably
short time. In this case the development effort is very less.


Thanks,
Mohan


On Thu, Sep 4, 2014 at 6:41 PM, Alec Ten Harmsel <al...@alectenharmsel.com>
wrote:

>  I would recommend using Hadoop only if you are ingesting a lot of data
> and you need reasonable performance at scale. I would recommend starting
> with using <insert language/tool of choice> to ingest and transform data
> until that process starts taking too long.
>
> For example, one of our researchers at the University of Michigan had to
> process ~150GB of data. Using python, processing that data took about 45
> minutes - it was not worth it to spend extra development time to run it on
> Hadoop. This time will change depending on what you need to do and the
> hardware available, naturally.
>
> So until you need to frequently process large amounts of data, I'd stick
> with something you're already familiar with.
>
> Alec Ten Harmsel
>
> On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:
>
>  Dear all,
>
>
>
> I’m very new to Hadoop as I’m still trying to grasp its value and
> purpose. I do hope my question on this mailing list is OK.
>
>
>
> I manage our open data platform at our municipality, using CKAN.org. It
> works very well for its purpose of showing data and adding API’s to data.
>
>
>
> However, I’m very interested in knowing more about Hadoop and if it would
> fit into a (open) data platform, as we are getting more and more data to
> show and to work with internally at our municipality.
>
>
>
> However, I cannot figure out if it’s the right purpose to use Hadoop for,
> if it is “overkill” or…
>
>
>
> Could someone elaborate on such topic?
>
>
>
> I’ve Googled around a lot and looked at various videos online and Hadoop
> seems to have it place, also in an open data platform environment.
>
>
>
> Best regards,
>
> Henrik
>
>
>

Re: Hadoop and Open Data (CKAN.org).

Posted by Mohan Radhakrishnan <ra...@gmail.com>.
I understand that coding MR jobs using a language is required but if we are
just processing large amounts of data (Machine Learning for example) we
could use Pig. I recently processed 0.25 TB on AWS clusters in a reasonably
short time. In this case the development effort is very less.


Thanks,
Mohan


On Thu, Sep 4, 2014 at 6:41 PM, Alec Ten Harmsel <al...@alectenharmsel.com>
wrote:

>  I would recommend using Hadoop only if you are ingesting a lot of data
> and you need reasonable performance at scale. I would recommend starting
> with using <insert language/tool of choice> to ingest and transform data
> until that process starts taking too long.
>
> For example, one of our researchers at the University of Michigan had to
> process ~150GB of data. Using python, processing that data took about 45
> minutes - it was not worth it to spend extra development time to run it on
> Hadoop. This time will change depending on what you need to do and the
> hardware available, naturally.
>
> So until you need to frequently process large amounts of data, I'd stick
> with something you're already familiar with.
>
> Alec Ten Harmsel
>
> On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:
>
>  Dear all,
>
>
>
> I’m very new to Hadoop as I’m still trying to grasp its value and
> purpose. I do hope my question on this mailing list is OK.
>
>
>
> I manage our open data platform at our municipality, using CKAN.org. It
> works very well for its purpose of showing data and adding API’s to data.
>
>
>
> However, I’m very interested in knowing more about Hadoop and if it would
> fit into a (open) data platform, as we are getting more and more data to
> show and to work with internally at our municipality.
>
>
>
> However, I cannot figure out if it’s the right purpose to use Hadoop for,
> if it is “overkill” or…
>
>
>
> Could someone elaborate on such topic?
>
>
>
> I’ve Googled around a lot and looked at various videos online and Hadoop
> seems to have it place, also in an open data platform environment.
>
>
>
> Best regards,
>
> Henrik
>
>
>

Re: Hadoop and Open Data (CKAN.org).

Posted by Alec Ten Harmsel <al...@alectenharmsel.com>.
I would recommend using Hadoop only if you are ingesting a lot of data
and you need reasonable performance at scale. I would recommend starting
with using <insert language/tool of choice> to ingest and transform data
until that process starts taking too long.

For example, one of our researchers at the University of Michigan had to
process ~150GB of data. Using python, processing that data took about 45
minutes - it was not worth it to spend extra development time to run it
on Hadoop. This time will change depending on what you need to do and
the hardware available, naturally.

So until you need to frequently process large amounts of data, I'd stick
with something you're already familiar with.

Alec Ten Harmsel

On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:
>
> Dear all,
>
>  
>
> I’m very new to Hadoop as I’m still trying to grasp its value and 
> purpose. I do hope my question on this mailing list is OK.
>
>  
>
> I manage our open data platform at our municipality, using CKAN.org.
> It works very well for its purpose of showing data and adding API’s to
> data.
>
>  
>
> However, I’m very interested in knowing more about Hadoop and if it
> would fit into a (open) data platform, as we are getting more and more
> data to show and to work with internally at our municipality.
>
>  
>
> However, I cannot figure out if it’s the right purpose to use Hadoop
> for, if it is “overkill” or…
>
>  
>
> Could someone elaborate on such topic?
>
>  
>
> I’ve Googled around a lot and looked at various videos online and
> Hadoop seems to have it place, also in an open data platform environment.
>
>  
>
> Best regards,
>
> Henrik
>


Re: Hadoop and Open Data (CKAN.org).

Posted by Alec Ten Harmsel <al...@alectenharmsel.com>.
I would recommend using Hadoop only if you are ingesting a lot of data
and you need reasonable performance at scale. I would recommend starting
with using <insert language/tool of choice> to ingest and transform data
until that process starts taking too long.

For example, one of our researchers at the University of Michigan had to
process ~150GB of data. Using python, processing that data took about 45
minutes - it was not worth it to spend extra development time to run it
on Hadoop. This time will change depending on what you need to do and
the hardware available, naturally.

So until you need to frequently process large amounts of data, I'd stick
with something you're already familiar with.

Alec Ten Harmsel

On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:
>
> Dear all,
>
>  
>
> I’m very new to Hadoop as I’m still trying to grasp its value and 
> purpose. I do hope my question on this mailing list is OK.
>
>  
>
> I manage our open data platform at our municipality, using CKAN.org.
> It works very well for its purpose of showing data and adding API’s to
> data.
>
>  
>
> However, I’m very interested in knowing more about Hadoop and if it
> would fit into a (open) data platform, as we are getting more and more
> data to show and to work with internally at our municipality.
>
>  
>
> However, I cannot figure out if it’s the right purpose to use Hadoop
> for, if it is “overkill” or…
>
>  
>
> Could someone elaborate on such topic?
>
>  
>
> I’ve Googled around a lot and looked at various videos online and
> Hadoop seems to have it place, also in an open data platform environment.
>
>  
>
> Best regards,
>
> Henrik
>


Re: Hadoop and Open Data (CKAN.org).

Posted by Alec Ten Harmsel <al...@alectenharmsel.com>.
I would recommend using Hadoop only if you are ingesting a lot of data
and you need reasonable performance at scale. I would recommend starting
with using <insert language/tool of choice> to ingest and transform data
until that process starts taking too long.

For example, one of our researchers at the University of Michigan had to
process ~150GB of data. Using python, processing that data took about 45
minutes - it was not worth it to spend extra development time to run it
on Hadoop. This time will change depending on what you need to do and
the hardware available, naturally.

So until you need to frequently process large amounts of data, I'd stick
with something you're already familiar with.

Alec Ten Harmsel

On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:
>
> Dear all,
>
>  
>
> I’m very new to Hadoop as I’m still trying to grasp its value and 
> purpose. I do hope my question on this mailing list is OK.
>
>  
>
> I manage our open data platform at our municipality, using CKAN.org.
> It works very well for its purpose of showing data and adding API’s to
> data.
>
>  
>
> However, I’m very interested in knowing more about Hadoop and if it
> would fit into a (open) data platform, as we are getting more and more
> data to show and to work with internally at our municipality.
>
>  
>
> However, I cannot figure out if it’s the right purpose to use Hadoop
> for, if it is “overkill” or…
>
>  
>
> Could someone elaborate on such topic?
>
>  
>
> I’ve Googled around a lot and looked at various videos online and
> Hadoop seems to have it place, also in an open data platform environment.
>
>  
>
> Best regards,
>
> Henrik
>


Re: Hadoop and Open Data (CKAN.org).

Posted by Alec Ten Harmsel <al...@alectenharmsel.com>.
I would recommend using Hadoop only if you are ingesting a lot of data
and you need reasonable performance at scale. I would recommend starting
with using <insert language/tool of choice> to ingest and transform data
until that process starts taking too long.

For example, one of our researchers at the University of Michigan had to
process ~150GB of data. Using python, processing that data took about 45
minutes - it was not worth it to spend extra development time to run it
on Hadoop. This time will change depending on what you need to do and
the hardware available, naturally.

So until you need to frequently process large amounts of data, I'd stick
with something you're already familiar with.

Alec Ten Harmsel

On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:
>
> Dear all,
>
>  
>
> I’m very new to Hadoop as I’m still trying to grasp its value and 
> purpose. I do hope my question on this mailing list is OK.
>
>  
>
> I manage our open data platform at our municipality, using CKAN.org.
> It works very well for its purpose of showing data and adding API’s to
> data.
>
>  
>
> However, I’m very interested in knowing more about Hadoop and if it
> would fit into a (open) data platform, as we are getting more and more
> data to show and to work with internally at our municipality.
>
>  
>
> However, I cannot figure out if it’s the right purpose to use Hadoop
> for, if it is “overkill” or…
>
>  
>
> Could someone elaborate on such topic?
>
>  
>
> I’ve Googled around a lot and looked at various videos online and
> Hadoop seems to have it place, also in an open data platform environment.
>
>  
>
> Best regards,
>
> Henrik
>