You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oodt.apache.org by Keith Bannister <ke...@csiro.au> on 2017/04/04 01:36:46 UTC

OODT Help

Hi,

I'm trying to work out whether OODT is the right framework for me.

I have a radio astronomy application. Data rate is roughly 12 TB/day. 
Data format it a custom one with all sorts of metadata flying around 
(including sky direction in lat/long coordinates).

The raw data is pretty huge, and I can't store it on an OODT machine. 
The big disk I have access to won't run OODT>

Basically I want to:

1. Save the metadata of the raw data into an index somewhere.
2. Run some GPU codes over the raw data. The GPU code parameters should 
be set based on the metadata.
3. Save the GPU results in an archive, with even more metadata
4. Copy the raw data to a remote disk with a long-running bbcp  task.
5. Delete the raw data, but keep the GPU results and all the metadata

I'm having trouble finding the right documentation the describes how I 
can do this. Can you give me a top level page? (I've looked at the wiki, 
but it's a bit tricky to work out where to start).

K


-- 
KEITH BANNISTER | Principal Research Engineer
CSIRO Astronomy and Space Science
T +61 2 9372 4295
E keith.bannister@csiro.au

Re: OODT Help

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.
Also this presentation is useful to understand the OODT Workflow Manager:

https://www.slideshare.net/chrismattmann/wengines-workflows-and-2-years-of-advanced-data-processing-in-apache-oodt 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 4/3/17, 6:51 PM, "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov> wrote:

    Hi Keith,
    
    Thanks for contacting us. Yes this is precisely the type of thing that
    OODT can help you with.
    
    As a start, I would recommend reading this guide that shows you
    how to use the algorithm wrapper, CAS-PGE. You can build a workflow
    of several of these wrappers to push out your production pipeline:
    
    https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Example
    
    In addition to the above guide, I would start with installing OODT RADIX, the
    quick installer:
    
    https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT
    
    Once RADIX is installed, then edit your CAS-PGE algorithm wrappers and write
    some config files. Then test out your production pipeline. If you run into trouble
    with your CAS-PGE here’s an FAQ:
    
    https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Help+and+Documentation
    
    If you want to understand more about how metadata flows in the system, you can check
    this out:
    
    https://cwiki.apache.org/confluence/display/OODT/Understanding+the+flow+of+Metadata+during+PGE+based+Processing
    
    and this:
    
    https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    
    Finally there are two examples of full-up OODT pipelines/deployments. The first is DRAT, which does
    large scale code license analysis via OODT map reduce (there is a paper in the GitHub repo you can check out):
    
    http://github.com/chrismattmann/drat/
    
    The second is Big Translate, a large scale Map Reduce machine translation pipeline, is here:
    
    http://github.com/chrismattmann/bigtranslate/
    
    Cheers and if we can help more let us know.
    
    Cheers,
    Chris
    
    
    
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Principal Data Scientist, Engineering Administrative Office (3010)
    Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
    NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-503
    Email: chris.a.mattmann@nasa.gov
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS)
    Adjunct Associate Professor, Computer Science Department
    University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     
    
    On 4/3/17, 6:36 PM, "Keith Bannister" <ke...@csiro.au> wrote:
    
        Hi,
        
        I'm trying to work out whether OODT is the right framework for me.
        
        I have a radio astronomy application. Data rate is roughly 12 TB/day. 
        Data format it a custom one with all sorts of metadata flying around 
        (including sky direction in lat/long coordinates).
        
        The raw data is pretty huge, and I can't store it on an OODT machine. 
        The big disk I have access to won't run OODT>
        
        Basically I want to:
        
        1. Save the metadata of the raw data into an index somewhere.
        2. Run some GPU codes over the raw data. The GPU code parameters should 
        be set based on the metadata.
        3. Save the GPU results in an archive, with even more metadata
        4. Copy the raw data to a remote disk with a long-running bbcp  task.
        5. Delete the raw data, but keep the GPU results and all the metadata
        
        I'm having trouble finding the right documentation the describes how I 
        can do this. Can you give me a top level page? (I've looked at the wiki, 
        but it's a bit tricky to work out where to start).
        
        K
        
        
        -- 
        KEITH BANNISTER | Principal Research Engineer
        CSIRO Astronomy and Space Science
        T +61 2 9372 4295
        E keith.bannister@csiro.au
        
    
    


Re: OODT Help

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.
Hi Keith,

Thanks for contacting us. Yes this is precisely the type of thing that
OODT can help you with.

As a start, I would recommend reading this guide that shows you
how to use the algorithm wrapper, CAS-PGE. You can build a workflow
of several of these wrappers to push out your production pipeline:

https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Example

In addition to the above guide, I would start with installing OODT RADIX, the
quick installer:

https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT

Once RADIX is installed, then edit your CAS-PGE algorithm wrappers and write
some config files. Then test out your production pipeline. If you run into trouble
with your CAS-PGE here’s an FAQ:

https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Help+and+Documentation

If you want to understand more about how metadata flows in the system, you can check
this out:

https://cwiki.apache.org/confluence/display/OODT/Understanding+the+flow+of+Metadata+during+PGE+based+Processing

and this:

https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence

Finally there are two examples of full-up OODT pipelines/deployments. The first is DRAT, which does
large scale code license analysis via OODT map reduce (there is a paper in the GitHub repo you can check out):

http://github.com/chrismattmann/drat/

The second is Big Translate, a large scale Map Reduce machine translation pipeline, is here:

http://github.com/chrismattmann/bigtranslate/

Cheers and if we can help more let us know.

Cheers,
Chris



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 4/3/17, 6:36 PM, "Keith Bannister" <ke...@csiro.au> wrote:

    Hi,
    
    I'm trying to work out whether OODT is the right framework for me.
    
    I have a radio astronomy application. Data rate is roughly 12 TB/day. 
    Data format it a custom one with all sorts of metadata flying around 
    (including sky direction in lat/long coordinates).
    
    The raw data is pretty huge, and I can't store it on an OODT machine. 
    The big disk I have access to won't run OODT>
    
    Basically I want to:
    
    1. Save the metadata of the raw data into an index somewhere.
    2. Run some GPU codes over the raw data. The GPU code parameters should 
    be set based on the metadata.
    3. Save the GPU results in an archive, with even more metadata
    4. Copy the raw data to a remote disk with a long-running bbcp  task.
    5. Delete the raw data, but keep the GPU results and all the metadata
    
    I'm having trouble finding the right documentation the describes how I 
    can do this. Can you give me a top level page? (I've looked at the wiki, 
    but it's a bit tricky to work out where to start).
    
    K
    
    
    -- 
    KEITH BANNISTER | Principal Research Engineer
    CSIRO Astronomy and Space Science
    T +61 2 9372 4295
    E keith.bannister@csiro.au