You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by ea...@apache.org on 2013/11/07 20:19:30 UTC

svn commit: r1539770 - in /uima/site/trunk/uima-website: docs/doc-uimaducc-whatitam.html docs/images/getting-started/jobmodel.odp xdocs/doc-uimaducc-whatitam.xml

Author: eae
Date: Thu Nov  7 19:19:30 2013
New Revision: 1539770

URL: http://svn.apache.org/r1539770
Log:
UIMA-3407 getting started: contrast with Hadoop; add figure 1 source

Added:
    uima/site/trunk/uima-website/docs/images/getting-started/jobmodel.odp   (with props)
Modified:
    uima/site/trunk/uima-website/docs/doc-uimaducc-whatitam.html
    uima/site/trunk/uima-website/xdocs/doc-uimaducc-whatitam.xml

Modified: uima/site/trunk/uima-website/docs/doc-uimaducc-whatitam.html
URL: http://svn.apache.org/viewvc/uima/site/trunk/uima-website/docs/doc-uimaducc-whatitam.html?rev=1539770&r1=1539769&r2=1539770&view=diff
==============================================================================
--- uima/site/trunk/uima-website/docs/doc-uimaducc-whatitam.html (original)
+++ uima/site/trunk/uima-website/docs/doc-uimaducc-whatitam.html Thu Nov  7 19:19:30 2013
@@ -209,12 +209,12 @@ for high throughput collection processin
 real-tme applications.
 Building on UIMA-AS, DUCC is particularly well suited to run large memory Java 
 analytics in multiple threads in order to fully utilize multicore machines.
-DUCC manages the life cycle of all processes deployed across the cluster.
-Non-UIMA processes such as tomcat servers or VNC sessions can also be managed.
+DUCC manages the life cycle of all processes deployed across the cluster, including
+non-UIMA processes such as tomcat servers or VNC sessions.
    </p>
                                                 <p>
-DUCC has an extensive web interface providing details on all user activity
-across the cluster. Because DUCC is built for UIMA-based analytics from the
+DUCC has an extensive web interface providing details on all user activity.
+Because DUCC is built for UIMA-based analytics from the
 ground up it automatically makes available such details as what annotators are
 currently initializing as well as the timing breakdown for each primitive annotator 
 in a pipeline.
@@ -237,8 +237,9 @@ data are stored in user filesystem space
 user to decide how to manage the metadata associated work submitted to DUCC.
    </p>
                                                 <p>
-The following sections will describe each of the three type of DUCC managed processes:
-collection processing jobs, services and arbitary processes.
+The following sections will describe each of the three types of DUCC managed processes
+(collection processing jobs, services and arbitary processes) and constrast some
+differences between DUCC and Hadoop for scaling out UIMA applications.
    </p>
                             </blockquote>
         </td></tr>
@@ -325,12 +326,6 @@ DUCC has a default pinger for UIMA-AS se
 CUSTOM services must register a pinger class.
    </p>
                                                 <p>
-Each service instance runs in a cgroup, and any spawned subprocesses will also run in the same container.
-When a UIMA-AS service instance is stopped DUCC will initiate a quiesce and then 60 seconds later
-do hard kill. CUSTOM services will first receive a SIGTERM and then 60 seconds later SIGKILL.
-All spawned processes are terminated when the container is removed.
-   </p>
-                                                <p>
 Services are tracked on the Services page of DUCC webserver.
    </p>
                             </blockquote>
@@ -350,14 +345,15 @@ Services are tracked on the Services pag
         <blockquote class="subsectionBody">
                                     <p>
 DUCC can be used to run an arbitrary process on a DUCC worker node. 
-Resources are allocated according to the memory size and scheduling class requested. 
-The process is run in a cgroup.
+Resources are allocated according to the memory size and scheduling class requested,
+and the process is run in a cgroup.
 The allocated resource is freed when the process terminates.
    </p>
                                                 <p>
-A command line script, viaducc, that can be used to launch processes on a DUCC worker node.
-With a symlink named java-viaducc-&gt;$DUCC_HOME/bin/viaducc placed in $JAVA_HOME/bin, viaducc can
-be used to launch arbitrary processes directly from eclipse onto DUCC worker nodes.
+A command line script, viaducc, can be used to launch processes on a DUCC worker node.
+With a symlink named "java-viaducc" pointing at $DUCC_HOME/bin/viaducc, java commands
+can be run remotely from the command line. If java-viaducc is put into $JAVA_HOME/bin,
+eclipse can be configured to launch processes onto remote machines.
    </p>
                                                 <p>
 Arbitrary processes are tracked on the Reservations page of DUCC webserver.
@@ -370,6 +366,67 @@ Arbitrary processes are tracked on the R
        
        
        
+          <a name="Scaling UIMA with DUCC vs Hadoop">
+            <h2>Scaling UIMA with DUCC vs Hadoop
+                        </h2>
+          </a>
+      </td></tr>
+      <tr><td>
+        <blockquote class="subsectionBody">
+                                    <p>
+DUCC offers a number of potential advantages over Hadoop for many UIMA applications.
+     <dl>
+	<dt>Threading</dt>
+	<dd>
+Hadoop mapper processes are intended to have a single analytic thread.
+DUCC is designed to run multiple UIMA pipelines in 
+a single job process,
+allowing sharing of static Java objects and yielding significant RAM saving.
+	</dd> <br />
+	<dt>Application Interface</dt>
+	<dd>
+The application interfaces for a UIMA application continue to be UIMA-standard
+components: CollectionReader, CasConsumer, and CasMultiplier.
+Hadoop requires integrating a new set of interface components.
+	</dd> <br />
+	<dt>Collection Processing Errors</dt>
+	<dd>
+If a mapper fails to handle a single work item in a collection the entire
+collection must be reprocessed after fixing the mapper problem, no
+way to make incremental progress.
+DUCC Jobs are designed to preserve previous results, if appropriate.
+	</dd> <br />
+	<dt>Performance Reports</dt>
+	<dd>
+For every job DUCC automatically provides the performance breakdown for every
+UIMA component.
+	</dd> <br />
+	<dt>Other Workloads</dt>
+	<dd>
+DUCC has support for managing a wide range of processes, including non-UIMA
+processes. For example, DUCC could dynamically start a Hadoop instance
+on a subset of DUCC worker machines.
+	</dd> <br />
+	<dt>Debug Support</dt>
+	<dd>
+DUCC offers tight integration with Eclipse debugging. 
+All or part of a UIMA application can be run in the Eclipse debugger
+by adding a single parameter to the Job submission.
+	</dd>
+     </dl>
+   </p>
+                                                <p>
+   </p>
+                                                <p>
+   </p>
+                            </blockquote>
+        </td></tr>
+    </table>
+                                                      <table class="subsectionTable">
+        <tr><td>
+       
+       
+       
           <a name="DUCC - What next?">
             <h2>DUCC - What next?
                         </h2>

Added: uima/site/trunk/uima-website/docs/images/getting-started/jobmodel.odp
URL: http://svn.apache.org/viewvc/uima/site/trunk/uima-website/docs/images/getting-started/jobmodel.odp?rev=1539770&view=auto
==============================================================================
Binary file - no diff available.

Propchange: uima/site/trunk/uima-website/docs/images/getting-started/jobmodel.odp
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: uima/site/trunk/uima-website/xdocs/doc-uimaducc-whatitam.xml
URL: http://svn.apache.org/viewvc/uima/site/trunk/uima-website/xdocs/doc-uimaducc-whatitam.xml?rev=1539770&r1=1539769&r2=1539770&view=diff
==============================================================================
--- uima/site/trunk/uima-website/xdocs/doc-uimaducc-whatitam.xml (original)
+++ uima/site/trunk/uima-website/xdocs/doc-uimaducc-whatitam.xml Thu Nov  7 19:19:30 2013
@@ -42,12 +42,12 @@ for high throughput collection processin
 real-tme applications.
 Building on UIMA-AS, DUCC is particularly well suited to run large memory Java 
 analytics in multiple threads in order to fully utilize multicore machines.
-DUCC manages the life cycle of all processes deployed across the cluster.
-Non-UIMA processes such as tomcat servers or VNC sessions can also be managed.
+DUCC manages the life cycle of all processes deployed across the cluster, including
+non-UIMA processes such as tomcat servers or VNC sessions.
    </p>
    <p>
-DUCC has an extensive web interface providing details on all user activity
-across the cluster. Because DUCC is built for UIMA-based analytics from the
+DUCC has an extensive web interface providing details on all user activity.
+Because DUCC is built for UIMA-based analytics from the
 ground up it automatically makes available such details as what annotators are
 currently initializing as well as the timing breakdown for each primitive annotator 
 in a pipeline.
@@ -70,8 +70,9 @@ data are stored in user filesystem space
 user to decide how to manage the metadata associated work submitted to DUCC.
    </p>
    <p>
-The following sections will describe each of the three type of DUCC managed processes:
-collection processing jobs, services and arbitary processes.
+The following sections will describe each of the three types of DUCC managed processes
+(collection processing jobs, services and arbitary processes) and constrast some
+differences between DUCC and Hadoop for scaling out UIMA applications.
    </p>
    </subsection>
   
@@ -134,12 +135,6 @@ DUCC has a default pinger for UIMA-AS se
 CUSTOM services must register a pinger class.
    </p> 
    <p>
-Each service instance runs in a cgroup, and any spawned subprocesses will also run in the same container.
-When a UIMA-AS service instance is stopped DUCC will initiate a quiesce and then 60 seconds later
-do hard kill. CUSTOM services will first receive a SIGTERM and then 60 seconds later SIGKILL.
-All spawned processes are terminated when the container is removed.
-   </p>
-   <p>
 Services are tracked on the Services page of DUCC webserver.
    </p>
   </subsection>
@@ -147,20 +142,70 @@ Services are tracked on the Services pag
    <subsection name="DUCC Arbitrary Processes">
    <p>
 DUCC can be used to run an arbitrary process on a DUCC worker node. 
-Resources are allocated according to the memory size and scheduling class requested. 
-The process is run in a cgroup.
+Resources are allocated according to the memory size and scheduling class requested,
+and the process is run in a cgroup.
 The allocated resource is freed when the process terminates.
    </p> 
    <p>
-A command line script, viaducc, that can be used to launch processes on a DUCC worker node.
-With a symlink named java-viaducc->$DUCC_HOME/bin/viaducc placed in $JAVA_HOME/bin, viaducc can
-be used to launch arbitrary processes directly from eclipse onto DUCC worker nodes.
+A command line script, viaducc, can be used to launch processes on a DUCC worker node.
+With a symlink named "java-viaducc" pointing at $DUCC_HOME/bin/viaducc, java commands
+can be run remotely from the command line. If java-viaducc is put into $JAVA_HOME/bin,
+eclipse can be configured to launch processes onto remote machines.
    </p>
    <p>
 Arbitrary processes are tracked on the Reservations page of DUCC webserver.
    </p>
   </subsection>
   
+   <subsection name="Scaling UIMA with DUCC vs Hadoop">
+   <p>
+DUCC offers a number of potential advantages over Hadoop for many UIMA applications.
+     <dl>
+	<dt>Threading</dt>
+	<dd>
+Hadoop mapper processes are intended to have a single analytic thread.
+DUCC is designed to run multiple UIMA pipelines in 
+a single job process,
+allowing sharing of static Java objects and yielding significant RAM saving.
+	</dd> <br></br>
+	<dt>Application Interface</dt>
+	<dd>
+The application interfaces for a UIMA application continue to be UIMA-standard
+components: CollectionReader, CasConsumer, and CasMultiplier.
+Hadoop requires integrating a new set of interface components.
+	</dd> <br></br>
+	<dt>Collection Processing Errors</dt>
+	<dd>
+If a mapper fails to handle a single work item in a collection the entire
+collection must be reprocessed after fixing the mapper problem, no
+way to make incremental progress.
+DUCC Jobs are designed to preserve previous results, if appropriate.
+	</dd> <br></br>
+	<dt>Performance Reports</dt>
+	<dd>
+For every job DUCC automatically provides the performance breakdown for every
+UIMA component.
+	</dd> <br></br>
+	<dt>Other Workloads</dt>
+	<dd>
+DUCC has support for managing a wide range of processes, including non-UIMA
+processes. For example, DUCC could dynamically start a Hadoop instance
+on a subset of DUCC worker machines.
+	</dd> <br></br>
+	<dt>Debug Support</dt>
+	<dd>
+DUCC offers tight integration with Eclipse debugging. 
+All or part of a UIMA application can be run in the Eclipse debugger
+by adding a single parameter to the Job submission.
+	</dd>
+     </dl>
+   </p>
+   <p>
+   </p>
+   <p>
+   </p>
+  </subsection>
+  
    <subsection name="DUCC - What next?">
    <p>
 Go to ??? and see the full documentation.