You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Ari Rabkin (JIRA)" <ji...@apache.org> on 2009/01/13 03:10:59 UTC

[jira] Created: (HADOOP-5018) Chukwa should support pipelined writers

Chukwa should support pipelined writers
---------------------------------------

                 Key: HADOOP-5018
                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
             Project: Hadoop Core
          Issue Type: New Feature
          Components: contrib/chukwa
            Reporter: Ari Rabkin
            Assignee: Ari Rabkin


We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated HADOOP-5018:
-------------------------------

    Status: Patch Available  (was: Open)

This will be very useful for Berkeley since want to do near-real-time collection, which we can do in a pipeline stage.

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated HADOOP-5018:
-------------------------------

    Attachment: pipeline3.patch

No idea what was wrong with previous patch; try this one.

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch, pipeline2.patch, pipeline3.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Eric Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663544#action_12663544 ] 

Eric Yang commented on HADOOP-5018:
-----------------------------------

Wouldn't it be better to decouple the pipeline logic from ServletCollector?  It may be better to have a interface between ServletCollector and Pipeline logic.  Hence, the pipline logic can be implemented as synchronized stages or paralleled stages for different use cases.  i.e duplication data filtering, or real time monitoring alerts.

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663455#action_12663455 ] 

Ari Rabkin commented on HADOOP-5018:
------------------------------------

Both 1 and 2 are worthy goals.  I think that pipelines are a fairly natural way to accomplish both.  I intended to write a pipeline stage for doing subscriptions for real-time delivery; if you're also working on that, it's pretty awesome, and we should open a JIRA. 

I hadn't thought of log-to-local-storage, but it should be easy to write a pipeline stage that stores everything, passes it through, and that also has a worker thread that does the write to HDFS.

What do you mean by removing the Hadoop dependency?  I assume you don't literally mean breaking all dependence on Hadoop-core. But you can already point the SeqFileWriter at a local filesystem; you don't need an HDFS cluster. 

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663552#action_12663552 ] 

Jerome Boulon commented on HADOOP-5018:
---------------------------------------

My point is hat the pipelineWriter should be an implementation of the ChukwaWriter interface and that's really the only thing that the collector should be aware of.
So to be able to do what you want:

1) The collector should instantiate one writer implementation based on his configuration
2) The writer should be able to get the collector configuration from somewhere (current design) or should have an init method with a Configuration parameter
3) The contract from the collector point of view is the same call one method on the writer class and the result is success if there's no exception

the delta with your implementation is:

- Remove code from     if (conf.get("chukwaCollector.pipeline") != null) ..
- Replace by something like:

writerClassName = conf.get("chukwaCollector.writer","org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter").
Class myWriter = conf.getClassByName(writerClassName);
Writer st = myWriter.newInstance()
st.init();

- Remove all writer initialization from CollectorStub.java
- and move all the code to create the pipeline to the init method inside a PipelineWriter class, instead of ServletCollector.java

That way the writer interface is still simple, the collector class stay very simple and this does not prevent anybody from having a specific writer implementation.
So at the end you have:

public class PipelineWriter implements ChukwaWriter
{

public void init() throws WriterException
{
+    if (conf.get("chukwaCollector.pipeline") != null) {
+      String pipeline = conf.get("chukwaCollector.pipeline");
+      try {
+        String[] classes = pipeline.split(",");
+        ArrayList<PipelineStageWriter> stages = new ArrayList<PipelineStageWriter>();
[...]
}

public void add(List<Chunk> chunks) throws WriterException
{
// call all PipelineStageWriter in sequence
}




> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663468#action_12663468 ] 

Ari Rabkin commented on HADOOP-5018:
------------------------------------

The collector doesn't require an HDFS system.  You can point it at a local filesystem and it'll work fine.

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663552#action_12663552 ] 

jboulon edited comment on HADOOP-5018 at 1/13/09 3:57 PM:
----------------------------------------------------------------

My point is hat the pipelineWriter should be an implementation of the ChukwaWriter interface and that's really the only thing that the collector should be aware of.
So to be able to do what you want:

1) The collector should instantiate one writer implementation based on his configuration
2) The writer should be able to get the collector configuration from somewhere (current design) or should have an init method with a Configuration parameter
3) The contract from the collector point of view is the same call one method on the writer class and the result is success if there's no exception

the delta with your implementation is:

- Remove code from     if (conf.get("chukwaCollector.pipeline") != null) ..
- Replace by something like:

writerClassName = conf.get("chukwaCollector.writer","org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter").
Class myWriter = conf.getClassByName(writerClassName);
Writer st = myWriter.newInstance()
st.init();

- Remove all writer initialization from CollectorStub.java
- and move all the code to create the pipeline to the init method inside a PipelineWriter class, instead of ServletCollector.java

That way the writer interface is still simple, the collector class stay very simple and this does not prevent anybody from having a specific writer implementation.
So at the end you have:

public class PipelineWriter implements ChukwaWriter
{

public void init() throws WriterException
{
+    if (conf.get("chukwaCollector.pipeline") != null) {
+      String pipeline = conf.get("chukwaCollector.pipeline");
+      try {
+        String[] classes = pipeline.split(",");
+        ArrayList<PipelineStageWriter> stages = new ArrayList<PipelineStageWriter>();
[...]
}

public void add(List<Chunk> chunks) throws WriterException
{ // call all PipelineStageWriter in sequence }

}



      was (Author: jboulon):
    My point is hat the pipelineWriter should be an implementation of the ChukwaWriter interface and that's really the only thing that the collector should be aware of.
So to be able to do what you want:

1) The collector should instantiate one writer implementation based on his configuration
2) The writer should be able to get the collector configuration from somewhere (current design) or should have an init method with a Configuration parameter
3) The contract from the collector point of view is the same call one method on the writer class and the result is success if there's no exception

the delta with your implementation is:

- Remove code from     if (conf.get("chukwaCollector.pipeline") != null) ..
- Replace by something like:

writerClassName = conf.get("chukwaCollector.writer","org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter").
Class myWriter = conf.getClassByName(writerClassName);
Writer st = myWriter.newInstance()
st.init();

- Remove all writer initialization from CollectorStub.java
- and move all the code to create the pipeline to the init method inside a PipelineWriter class, instead of ServletCollector.java

That way the writer interface is still simple, the collector class stay very simple and this does not prevent anybody from having a specific writer implementation.
So at the end you have:

public class PipelineWriter implements ChukwaWriter
{

public void init() throws WriterException
{
+    if (conf.get("chukwaCollector.pipeline") != null) {
+      String pipeline = conf.get("chukwaCollector.pipeline");
+      try {
+        String[] classes = pipeline.split(",");
+        ArrayList<PipelineStageWriter> stages = new ArrayList<PipelineStageWriter>();
[...]
}

public void add(List<Chunk> chunks) throws WriterException
{
// call all PipelineStageWriter in sequence
}

}


  
> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673967#action_12673967 ] 

Hudson commented on HADOOP-5018:
--------------------------------

Integrated in Hadoop-trunk #756 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/756/])
    

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.21.0
>
>         Attachments: pipeline.patch, pipeline2.patch, pipeline3.patch, pipeline4.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663464#action_12663464 ] 

Jerome Boulon commented on HADOOP-5018:
---------------------------------------

>>What do you mean by removing the Hadoop dependency? 
The collector should not require an HDFS system but can use and/or take advantage of hadoop-core but this will be pipeline dependent.


> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-5018:
----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.21.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

I committed this. Thanks, Ari

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>             Fix For: 0.21.0
>
>         Attachments: pipeline.patch, pipeline2.patch, pipeline3.patch, pipeline4.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated HADOOP-5018:
-------------------------------

    Attachment: pipeline4.patch

Whups, I see what I did.  A local rename broke things. Let's try this. Tested on TRUNK, and it seems to work.

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch, pipeline2.patch, pipeline3.patch, pipeline4.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated HADOOP-5018:
-------------------------------

    Attachment: pipeline2.patch

Revise to take Jerome's feedback into account.   Also add some previously-missing Apache license headers.

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch, pipeline2.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Eric Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665743#action_12665743 ] 

Eric Yang commented on HADOOP-5018:
-----------------------------------

src/contrib/chukwa/src/java/org/apache/hadoop/chukwa/datacollection/writer/P
ipelineableWriter.java doesn't exist in the public SVN.  pipeline3.patch does not contain the whole file for PipelineableWriter.java.  Please make sure your patch contain PipelineableWriter.java as a whole file.  Thanks

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch, pipeline2.patch, pipeline3.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664994#action_12664994 ] 

Ari Rabkin commented on HADOOP-5018:
------------------------------------

[issue with previous patch is that a non-patch SVN command [a mode change] got rolled in as well.  This has been removed and the latest patch should be good]

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch, pipeline2.patch, pipeline3.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Eric Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670216#action_12670216 ] 

Eric Yang commented on HADOOP-5018:
-----------------------------------

+1 for pipeline writer.  pipeline4.patch is the good patch.

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch, pipeline2.patch, pipeline3.patch, pipeline4.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663468#action_12663468 ] 

asrabkin edited comment on HADOOP-5018 at 1/13/09 12:46 PM:
--------------------------------------------------------------

The sequencefilewriter *doesn't* require an HDFS system.  You can point it at a local filesystem and it'll work fine.

      was (Author: asrabkin):
    The collector doesn't require an HDFS system.  You can point it at a local filesystem and it'll work fine.
  
> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663552#action_12663552 ] 

jboulon edited comment on HADOOP-5018 at 1/13/09 3:56 PM:
----------------------------------------------------------------

My point is hat the pipelineWriter should be an implementation of the ChukwaWriter interface and that's really the only thing that the collector should be aware of.
So to be able to do what you want:

1) The collector should instantiate one writer implementation based on his configuration
2) The writer should be able to get the collector configuration from somewhere (current design) or should have an init method with a Configuration parameter
3) The contract from the collector point of view is the same call one method on the writer class and the result is success if there's no exception

the delta with your implementation is:

- Remove code from     if (conf.get("chukwaCollector.pipeline") != null) ..
- Replace by something like:

writerClassName = conf.get("chukwaCollector.writer","org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter").
Class myWriter = conf.getClassByName(writerClassName);
Writer st = myWriter.newInstance()
st.init();

- Remove all writer initialization from CollectorStub.java
- and move all the code to create the pipeline to the init method inside a PipelineWriter class, instead of ServletCollector.java

That way the writer interface is still simple, the collector class stay very simple and this does not prevent anybody from having a specific writer implementation.
So at the end you have:

public class PipelineWriter implements ChukwaWriter
{

public void init() throws WriterException
{
+    if (conf.get("chukwaCollector.pipeline") != null) {
+      String pipeline = conf.get("chukwaCollector.pipeline");
+      try {
+        String[] classes = pipeline.split(",");
+        ArrayList<PipelineStageWriter> stages = new ArrayList<PipelineStageWriter>();
[...]
}

public void add(List<Chunk> chunks) throws WriterException
{
// call all PipelineStageWriter in sequence
}

}



      was (Author: jboulon):
    My point is hat the pipelineWriter should be an implementation of the ChukwaWriter interface and that's really the only thing that the collector should be aware of.
So to be able to do what you want:

1) The collector should instantiate one writer implementation based on his configuration
2) The writer should be able to get the collector configuration from somewhere (current design) or should have an init method with a Configuration parameter
3) The contract from the collector point of view is the same call one method on the writer class and the result is success if there's no exception

the delta with your implementation is:

- Remove code from     if (conf.get("chukwaCollector.pipeline") != null) ..
- Replace by something like:

writerClassName = conf.get("chukwaCollector.writer","org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter").
Class myWriter = conf.getClassByName(writerClassName);
Writer st = myWriter.newInstance()
st.init();

- Remove all writer initialization from CollectorStub.java
- and move all the code to create the pipeline to the init method inside a PipelineWriter class, instead of ServletCollector.java

That way the writer interface is still simple, the collector class stay very simple and this does not prevent anybody from having a specific writer implementation.
So at the end you have:

public class PipelineWriter implements ChukwaWriter
{

public void init() throws WriterException
{
+    if (conf.get("chukwaCollector.pipeline") != null) {
+      String pipeline = conf.get("chukwaCollector.pipeline");
+      try {
+        String[] classes = pipeline.split(",");
+        ArrayList<PipelineStageWriter> stages = new ArrayList<PipelineStageWriter>();
[...]
}

public void add(List<Chunk> chunks) throws WriterException
{
// call all PipelineStageWriter in sequence
}



  
> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Jerome Boulon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663405#action_12663405 ] 

Jerome Boulon commented on HADOOP-5018:
---------------------------------------

Hi Ari,
I just want let you know that I'm planning to remove the HDFS dependency, 
1) The collector will first write to the local file system and then 2) the data will be pushed to a pub/sub framework to be used by real time components.
Later on the data will be moved to HDFS in a background thread or process.

Why 1 and 2

1) because people may want to only use chukwa to collect their data without any Hadoop dependency
2) to easily be able to extends Chukwa just by listening to an event.

The pub/sub framework will allow to filter by dataType and tags like source/cluster for example

I also want to solve the duplicate removal problem for chunks at the collector level.

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin updated HADOOP-5018:
-------------------------------

    Attachment: pipeline.patch

Fairly major surgery on the ChukwaWriter and ServletCollector classes in order to support dynamic creation of a writer pipeline.  Adds some test code.

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5018) Chukwa should support pipelined writers

Posted by "Eric Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664772#action_12664772 ] 

Eric Yang commented on HADOOP-5018:
-----------------------------------

Is pipeline2.patch depending on pipeline.patch?  I can't get pipeline2.patch to apply by itself.

> Chukwa should support pipelined writers
> ---------------------------------------
>
>                 Key: HADOOP-5018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5018
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>            Reporter: Ari Rabkin
>            Assignee: Ari Rabkin
>         Attachments: pipeline.patch, pipeline2.patch
>
>
> We ought to support chaining together writers; this will radically increase flexibility and make it practical to add new features without major surgery by putting them in pass-through or filter classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.