You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org> on 2012/09/10 02:50:07 UTC

[jira] [Created] (CRUNCH-60) Splitting the core crunch module

Vinod Kumar Vavilapalli created CRUNCH-60:
---------------------------------------------

             Summary: Splitting the core crunch module
                 Key: CRUNCH-60
                 URL: https://issues.apache.org/jira/browse/CRUNCH-60
             Project: Crunch
          Issue Type: Bug
            Reporter: Vinod Kumar Vavilapalli


It looks like the api is interspersed with the implementation details and libraries/utils a bit. How about:
 - An api module which only has the APIs that users need to code against
  -- Most of org.apache.crunch
  --  org.apache.crunch.types.*
 - A common/lib module
  -- package org.apache.crunch.fn
  -- some stuff like MapFn, FilterFn from org.apache.crunch package
  -- All of org.apache.crunch.lib.* that is not included in the other modules above and below
  -- org.apache.crunch.util
  -- org.apache.crunch.tool
 - A crunch-impl module where the rest of it resides.
  -- All of *impl* packages
  -- org.apache.crunch.hadoop.mapreduce.lib.jobcontrol
  -- org.apache.crunch.hadoop.mapreduce.lib.output
  -- org.apache.crunch.materialize?

Also move org.apache.crunch.test to src/test/java.

Need help on placing org.apache.crunch.io.* correctly.

Note that despite all this, if necessary, we can choose to have a single artifact (jars etc) to avoid users the onus of importing multiple modules.

Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-60) Splitting the core crunch module

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453705#comment-13453705 ] 

Vinod Kumar Vavilapalli commented on CRUNCH-60:
-----------------------------------------------

Posting Matthais's response on the dev-list
{quote}
Splitting Crunch is on my agenda, too, but I haven't been able to come
up with a game plan yet (and I needed a break after all the dependency
cleanup work and the HBase split). I think it's a great idea, we should
definitely do it.

Unfortunately, it's a bit complicated because right now there are lots
of cyclic package dependencies (see [1], the picture there shows Crunch's
dependency graph). Splitting stuff into modules is going to require quite
a bit of refactoring because we have to cut dependencies.

I think we should first draw a high-level package diagram (just the top
packages) that shows which package depends on which. As per Robert C.
Martin's SOLID principles, interface packages should not depend on
implementation packages. Then we can assign the existing classes to
packages and refactor if necessary.

As an example, the "io" package looks to me like it should be an
implementation package; I'd move the interfaces (PathTarget, OutputHandler
etc.) to the client API package ("org.apache.crunch" currently) to separate
them from implementations like From, To, and At.

[1] http://blog.mafr.de/2012/08/26/visualizing-package-dependencies/
{quote}
                
> Splitting the core crunch module
> --------------------------------
>
>                 Key: CRUNCH-60
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-60
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>
> It looks like the api is interspersed with the implementation details and libraries/utils a bit. How about:
>  - An api module which only has the APIs that users need to code against
>   -- Most of org.apache.crunch
>   --  org.apache.crunch.types.*
>  - A common/lib module
>   -- package org.apache.crunch.fn
>   -- some stuff like MapFn, FilterFn from org.apache.crunch package
>   -- All of org.apache.crunch.lib.* that is not included in the other modules above and below
>   -- org.apache.crunch.util
>   -- org.apache.crunch.tool
>  - A crunch-impl module where the rest of it resides.
>   -- All of *impl* packages
>   -- org.apache.crunch.hadoop.mapreduce.lib.jobcontrol
>   -- org.apache.crunch.hadoop.mapreduce.lib.output
>   -- org.apache.crunch.materialize?
> Also move org.apache.crunch.test to src/test/java.
> Need help on placing org.apache.crunch.io.* correctly.
> Note that despite all this, if necessary, we can choose to have a single artifact (jars etc) to avoid users the onus of importing multiple modules.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-60) Splitting the core crunch module

Posted by "Matthias Friedrich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454107#comment-13454107 ] 

Matthias Friedrich commented on CRUNCH-60:
------------------------------------------

I think we agree on moving base abstractions to a crunch-api Maven module. Let's start with that, with the benchmark being crunch-examples: A crunch client application should only depend on crunch-api, a pipeline implementation, some PTypes, and optionally extension library stuff (for joining and the like). We could have crunch-hadoop which contains our pipeline implementations, Writables and Avros, and basic I/O (text, sequence files).

That would make it
  - crunch-api
  - crunch-hadoop depends on crunch-api
  - crunch-lib depends on crunch-api and (for now) crunch-hadoop

Once we have this high-level order, we can drill down to package level. What do you think?
                
> Splitting the core crunch module
> --------------------------------
>
>                 Key: CRUNCH-60
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-60
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>
> It looks like the api is interspersed with the implementation details and libraries/utils a bit. How about:
>  - An api module which only has the APIs that users need to code against
>   -- Most of org.apache.crunch
>   --  org.apache.crunch.types.*
>  - A common/lib module
>   -- package org.apache.crunch.fn
>   -- some stuff like MapFn, FilterFn from org.apache.crunch package
>   -- All of org.apache.crunch.lib.* that is not included in the other modules above and below
>   -- org.apache.crunch.util
>   -- org.apache.crunch.tool
>  - A crunch-impl module where the rest of it resides.
>   -- All of *impl* packages
>   -- org.apache.crunch.hadoop.mapreduce.lib.jobcontrol
>   -- org.apache.crunch.hadoop.mapreduce.lib.output
>   -- org.apache.crunch.materialize?
> Also move org.apache.crunch.test to src/test/java.
> Need help on placing org.apache.crunch.io.* correctly.
> Note that despite all this, if necessary, we can choose to have a single artifact (jars etc) to avoid users the onus of importing multiple modules.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CRUNCH-60) Splitting the core crunch module

Posted by "Matthias Friedrich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CRUNCH-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias Friedrich updated CRUNCH-60:
-------------------------------------

    Attachment: refactor-1.png
                refactor-0.png

The patch helped a little to simplify dependencies of the crunch module. The attached files shows the situation before and after the patch. Red lines are dependencies to cut (according to the tool), number annotations indicate the number of references (import statements).
                
> Splitting the core crunch module
> --------------------------------
>
>                 Key: CRUNCH-60
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-60
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>         Attachments: CRUNCH-60-Create-crunch-api-package-mafr-v1.patch, refactor-0.png, refactor-1.png
>
>
> It looks like the api is interspersed with the implementation details and libraries/utils a bit. How about:
>  - An api module which only has the APIs that users need to code against
>   -- Most of org.apache.crunch
>   --  org.apache.crunch.types.*
>  - A common/lib module
>   -- package org.apache.crunch.fn
>   -- some stuff like MapFn, FilterFn from org.apache.crunch package
>   -- All of org.apache.crunch.lib.* that is not included in the other modules above and below
>   -- org.apache.crunch.util
>   -- org.apache.crunch.tool
>  - A crunch-impl module where the rest of it resides.
>   -- All of *impl* packages
>   -- org.apache.crunch.hadoop.mapreduce.lib.jobcontrol
>   -- org.apache.crunch.hadoop.mapreduce.lib.output
>   -- org.apache.crunch.materialize?
> Also move org.apache.crunch.test to src/test/java.
> Need help on placing org.apache.crunch.io.* correctly.
> Note that despite all this, if necessary, we can choose to have a single artifact (jars etc) to avoid users the onus of importing multiple modules.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-60) Splitting the core crunch module

Posted by "Josh Wills (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451843#comment-13451843 ] 

Josh Wills commented on CRUNCH-60:
----------------------------------

I think that from a code organization perspective, it makes sense, with nothing depending on -api, and it would be ideal for -impl to be independent of common/lib, although that wouldn't work right now (e.g., mapside joins.) I think the first step in this direction would be adding API methods that would make that independence feasible.

-io could be its own component that is independent of the -impl and common/lib modules, me thinks.

My only strong objection would be putting o.a.crunch.test into src/test/java, since it's designed to be used by clients who are writing unit tests and I'm against having dependencies on test-jar targets. common/lib makes the most sense logically.
                
> Splitting the core crunch module
> --------------------------------
>
>                 Key: CRUNCH-60
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-60
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>
> It looks like the api is interspersed with the implementation details and libraries/utils a bit. How about:
>  - An api module which only has the APIs that users need to code against
>   -- Most of org.apache.crunch
>   --  org.apache.crunch.types.*
>  - A common/lib module
>   -- package org.apache.crunch.fn
>   -- some stuff like MapFn, FilterFn from org.apache.crunch package
>   -- All of org.apache.crunch.lib.* that is not included in the other modules above and below
>   -- org.apache.crunch.util
>   -- org.apache.crunch.tool
>  - A crunch-impl module where the rest of it resides.
>   -- All of *impl* packages
>   -- org.apache.crunch.hadoop.mapreduce.lib.jobcontrol
>   -- org.apache.crunch.hadoop.mapreduce.lib.output
>   -- org.apache.crunch.materialize?
> Also move org.apache.crunch.test to src/test/java.
> Need help on placing org.apache.crunch.io.* correctly.
> Note that despite all this, if necessary, we can choose to have a single artifact (jars etc) to avoid users the onus of importing multiple modules.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-60) Splitting the core crunch module

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457634#comment-13457634 ] 

Vinod Kumar Vavilapalli commented on CRUNCH-60:
-----------------------------------------------

bq. Almost everything ended up in the base package, I don't think there's a way to split it into subpackages without introducing dependency cycles. 
I've observed this too. There are lots of circular dependencies between the API classes themselves, we need to untangle these.
                
> Splitting the core crunch module
> --------------------------------
>
>                 Key: CRUNCH-60
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-60
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>         Attachments: CRUNCH-60-Create-crunch-api-package-mafr-v1.patch, refactor-0.png, refactor-1.png
>
>
> It looks like the api is interspersed with the implementation details and libraries/utils a bit. How about:
>  - An api module which only has the APIs that users need to code against
>   -- Most of org.apache.crunch
>   --  org.apache.crunch.types.*
>  - A common/lib module
>   -- package org.apache.crunch.fn
>   -- some stuff like MapFn, FilterFn from org.apache.crunch package
>   -- All of org.apache.crunch.lib.* that is not included in the other modules above and below
>   -- org.apache.crunch.util
>   -- org.apache.crunch.tool
>  - A crunch-impl module where the rest of it resides.
>   -- All of *impl* packages
>   -- org.apache.crunch.hadoop.mapreduce.lib.jobcontrol
>   -- org.apache.crunch.hadoop.mapreduce.lib.output
>   -- org.apache.crunch.materialize?
> Also move org.apache.crunch.test to src/test/java.
> Need help on placing org.apache.crunch.io.* correctly.
> Note that despite all this, if necessary, we can choose to have a single artifact (jars etc) to avoid users the onus of importing multiple modules.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CRUNCH-60) Splitting the core crunch module

Posted by "Matthias Friedrich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CRUNCH-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias Friedrich updated CRUNCH-60:
-------------------------------------

    Attachment: CRUNCH-60-Create-crunch-api-package-mafr-v1.patch

Patch (draft!) to move client-facing core abstractions out of the crunch module into a new crunch-api module. Almost everything ended up in the base package, I don't think there's a way to split it into subpackages without introducing dependency cycles.

Open issues: I'm not quite sure what to do with the test package. There's also a lot of stuff in CombineFn; I think we should move that somewhere else because nothing in crunch-api needs it.


                
> Splitting the core crunch module
> --------------------------------
>
>                 Key: CRUNCH-60
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-60
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>         Attachments: CRUNCH-60-Create-crunch-api-package-mafr-v1.patch
>
>
> It looks like the api is interspersed with the implementation details and libraries/utils a bit. How about:
>  - An api module which only has the APIs that users need to code against
>   -- Most of org.apache.crunch
>   --  org.apache.crunch.types.*
>  - A common/lib module
>   -- package org.apache.crunch.fn
>   -- some stuff like MapFn, FilterFn from org.apache.crunch package
>   -- All of org.apache.crunch.lib.* that is not included in the other modules above and below
>   -- org.apache.crunch.util
>   -- org.apache.crunch.tool
>  - A crunch-impl module where the rest of it resides.
>   -- All of *impl* packages
>   -- org.apache.crunch.hadoop.mapreduce.lib.jobcontrol
>   -- org.apache.crunch.hadoop.mapreduce.lib.output
>   -- org.apache.crunch.materialize?
> Also move org.apache.crunch.test to src/test/java.
> Need help on placing org.apache.crunch.io.* correctly.
> Note that despite all this, if necessary, we can choose to have a single artifact (jars etc) to avoid users the onus of importing multiple modules.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-60) Splitting the core crunch module

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453717#comment-13453717 ] 

Vinod Kumar Vavilapalli commented on CRUNCH-60:
-----------------------------------------------

bq.  My only strong objection would be putting o.a.crunch.test into src/test/java, since it's designed to be used by clients who are writing unit tests and I'm against having dependencies on test-jar targets. common/lib makes the most sense logically.
Not sure why you don't like test-jar targets, but sure we can put them in common/lib.

bq. Unfortunately, it's a bit complicated because right now there are lots of cyclic package dependencies (see [1], the picture there shows Crunch's dependency graph).
Agreed, we should start somewhere and start fixing them one by one.

bq. I think we should first draw a high-level package diagram (just the top packages) that shows which package depends on which. 
My description of the packages above with the dependencies as follows? 
 - crunch-api
 - crunch-lib depends on crunch-api
 - crunch-impl depends on crunch-api and crunch-lib
                
> Splitting the core crunch module
> --------------------------------
>
>                 Key: CRUNCH-60
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-60
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>
> It looks like the api is interspersed with the implementation details and libraries/utils a bit. How about:
>  - An api module which only has the APIs that users need to code against
>   -- Most of org.apache.crunch
>   --  org.apache.crunch.types.*
>  - A common/lib module
>   -- package org.apache.crunch.fn
>   -- some stuff like MapFn, FilterFn from org.apache.crunch package
>   -- All of org.apache.crunch.lib.* that is not included in the other modules above and below
>   -- org.apache.crunch.util
>   -- org.apache.crunch.tool
>  - A crunch-impl module where the rest of it resides.
>   -- All of *impl* packages
>   -- org.apache.crunch.hadoop.mapreduce.lib.jobcontrol
>   -- org.apache.crunch.hadoop.mapreduce.lib.output
>   -- org.apache.crunch.materialize?
> Also move org.apache.crunch.test to src/test/java.
> Need help on placing org.apache.crunch.io.* correctly.
> Note that despite all this, if necessary, we can choose to have a single artifact (jars etc) to avoid users the onus of importing multiple modules.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira