You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pirk.apache.org by Tim Ellison <t....@gmail.com> on 2016/07/18 17:24:49 UTC

Some code structure thoughts

Breaking out the discussion started by Suneel, I see opportunities for a
bit of judicious refactoring of the codebase.

I'm no expert on Beam etc., so I'm going to keep out of that decision,
simply to say I agree that it is probably a bad idea to create
submodules too early.

At the moment I'm simply trying to code up the "hello world" of the Pirk
APIs, and I'm struggling.  I'll happily admit that I don't have the
background here to make it simple, but as I wander around the code under
main/java I see
 - core implementation code (e.g. Querier, Response, etc)
 - performance / test classes (e.g. PaillierBenchmark)
 - examples, drivers, CLI's (e.g. QuerierDriver)
 - providers for Hadoop, standalone, Spark

I'm left somewhat confused about stripping this down to a library of
types that I need to interact with to programmatically use Pirk.  I
think I'm getting far more in the target JAR than I need, and that the
core types are not yet finessed to offer a usable API -- or am I being
unfair?

Does it make sense to move some of this CLI / test material out of the
way?  Should the providers be a bit more pluggable rather than hard-coded?

I will continue with my quest, and will raise usage questions here as I
go if people are prepared to tolerate a PIR newbie!

Regards,
Tim

Re: Some code structure thoughts

Posted by Ellison Anne Williams <ea...@gmail.com>.

Hi,

At the risk of the thread becoming too nested, please find the answers
inline below.


On Wed, Jul 20, 2016 at 6:31 AM, Tim Ellison <t....@gmail.com> wrote:

> On 18/07/16 19:29, Ellison Anne Williams wrote:
> > Good points.
> >
> > Yes, there are currently no examples included in the codebase (other than
> > in a roundabout kind of way via the tests, which doesn't count) and the
> > website doesn't have a step-by-step walkthrough of how to get up and
> > running from a user perspective. We can certainly add these in.
>
> Apologies if I am being a pain here.  I'm just curious about how to use
> Pirk, and come at it with no background -- so I may not be the best
> target user.  I won't be offended if you tell me that it requires a
> smarter bear than me :-)
>


EAW: Don't be silly :) Yes, we can and should have user docs and examples -
I will open two JIRA issues now.



>
> > In terms of what you can look at right now to help get going -- take a
> look
> > at the performQuery() method of the
> > org.apache.pirk.test.distributed.testsuite.DistTestSuite -- it walks you
> > through the basic steps.
>
> Yep, I found that, and it has been useful -- though I am having to step
> through in a debugger to figure it out.
>
> One of the main problems (for me) is that the SystemConfiguration is
> used not simply to set Pirk's *implementation* options as I would have
> expected (such as the defaults of paillier.useGMPForModPow,
> pir.primeCertainty, etc), but it is also used to pass values around
> globally that I would expect to be (only) part of the *usage* API (e.g.
> query.schemas), and unit test data (test.inputJSONFile), and things that
> I would expect to be configuration of Pirk's plug-ins, such as runtime
> values for Hadoop, Elastic search, Spark, and ...
>
> It's a real grab-bag of global values.
>
>
EAW: Yes, you nailed it -- it's a grab bag right now (no, not the best
coding practice...). Let's discuss a better model. Any thoughts on this?

Also, we should probably have multiple properties files rather than one
gigantic pirk.properties file. With multiple Responder providers, we should
probably have a separate properties file for each holding the
provider-specific properties (i.e. specific only to Storm or Spark or
whatever).

Realize too that there are many CLI options for the Responder and Querier
drivers that are not yet in the properties file...



> > What are you thinking in terms of making the providers more pluggable?
> > Perhaps Responder/Querier core and Responder/Querier modules/providers?
> >
> > Right now, the 'providers' fall under the algorithm, algorithm ->
> provider
> > -- i.e., org.apache.pirk.responder.wideskies.spark
> > and org.apache.pirk.responder.wideskies.mapreduce under 'wideskies'. This
> > could be changed to provider -> algorithm. Thus, we would have a module
> for
> > spark and then all algorithm implementations for spark would fall under
> the
> > spark module (and leverage the core). Thoughts?
>
> I've not worked my way up to looking at the CLI, Hadoop, Spark
> integration -- I'm still digging through the lower level PIR algorithms
> impl.  They do seem to be kept out of the lower level code which implies
> they are well factored.
>
> How about pulling the CLI out of the way too? so put types like
> QuerierDriver and QuerierDriverCLI + friends into their own package
> namespace? [1]
>

EAW: I am in favor of this...


>
> Down at the "lower level", it seems unnatural that types such as
> Queirier have the logic for writing themselves to a file using
> serialization and readFromHDFSFile.
>
> I would expect types like that (Query, Querier, Response) should be
> capable of being exchanged in any number of formats.  So if I choose to
> store or transmit it in BSON, or Google Protocol Buffers, or whatever, I
> would not expect to have another set of methods to deal with that on
> each of these classes; and when the version changes these types have to
> deal with backwards compatibility, etc. etc.  So I'd be inclined to move
> the persistence out of these classes.
>
>
EAW: Yes, I agree. The initial use cases were file-based and you see that
reflected in the current code. It needs to evolve to include other
transport formats and mechanisms.



> > Agree with doing some judicious refactoring of the codebase...
> >
> > Thanks!
>
> Hey, it's just my 2c!  I've not lived with this code, I've not even
> written a working example using it, so take what I say with a large
> pinch of salt.  It is certainly not my intention, and I am in no
> position, to critique.
>
> It's an interesting project, and these are stream of conciousness
> thoughts as I wander around finding my bearings.
>
>
EAW: Happy to have stream of consciousness - keep in coming! :)



> [1] Probably not worth considering separate mvn modules, yet.
>
> Regards,
> Tim
>
>

Re: Some code structure thoughts

Posted by Tim Ellison <t....@gmail.com>.

On 18/07/16 19:29, Ellison Anne Williams wrote:
> Good points.
> 
> Yes, there are currently no examples included in the codebase (other than
> in a roundabout kind of way via the tests, which doesn't count) and the
> website doesn't have a step-by-step walkthrough of how to get up and
> running from a user perspective. We can certainly add these in.

Apologies if I am being a pain here.  I'm just curious about how to use
Pirk, and come at it with no background -- so I may not be the best
target user.  I won't be offended if you tell me that it requires a
smarter bear than me :-)

> In terms of what you can look at right now to help get going -- take a look
> at the performQuery() method of the
> org.apache.pirk.test.distributed.testsuite.DistTestSuite -- it walks you
> through the basic steps.

Yep, I found that, and it has been useful -- though I am having to step
through in a debugger to figure it out.

One of the main problems (for me) is that the SystemConfiguration is
used not simply to set Pirk's *implementation* options as I would have
expected (such as the defaults of paillier.useGMPForModPow,
pir.primeCertainty, etc), but it is also used to pass values around
globally that I would expect to be (only) part of the *usage* API (e.g.
query.schemas), and unit test data (test.inputJSONFile), and things that
I would expect to be configuration of Pirk's plug-ins, such as runtime
values for Hadoop, Elastic search, Spark, and ...

It's a real grab-bag of global values.

> What are you thinking in terms of making the providers more pluggable?
> Perhaps Responder/Querier core and Responder/Querier modules/providers?
> 
> Right now, the 'providers' fall under the algorithm, algorithm -> provider
> -- i.e., org.apache.pirk.responder.wideskies.spark
> and org.apache.pirk.responder.wideskies.mapreduce under 'wideskies'. This
> could be changed to provider -> algorithm. Thus, we would have a module for
> spark and then all algorithm implementations for spark would fall under the
> spark module (and leverage the core). Thoughts?

I've not worked my way up to looking at the CLI, Hadoop, Spark
integration -- I'm still digging through the lower level PIR algorithms
impl.  They do seem to be kept out of the lower level code which implies
they are well factored.

How about pulling the CLI out of the way too? so put types like
QuerierDriver and QuerierDriverCLI + friends into their own package
namespace? [1]

Down at the "lower level", it seems unnatural that types such as
Queirier have the logic for writing themselves to a file using
serialization and readFromHDFSFile.

I would expect types like that (Query, Querier, Response) should be
capable of being exchanged in any number of formats.  So if I choose to
store or transmit it in BSON, or Google Protocol Buffers, or whatever, I
would not expect to have another set of methods to deal with that on
each of these classes; and when the version changes these types have to
deal with backwards compatibility, etc. etc.  So I'd be inclined to move
the persistence out of these classes.

> Agree with doing some judicious refactoring of the codebase...
> 
> Thanks!

Hey, it's just my 2c!  I've not lived with this code, I've not even
written a working example using it, so take what I say with a large
pinch of salt.  It is certainly not my intention, and I am in no
position, to critique.

It's an interesting project, and these are stream of conciousness
thoughts as I wander around finding my bearings.

[1] Probably not worth considering separate mvn modules, yet.

Regards,
Tim

Re: Some code structure thoughts

Posted by Ellison Anne Williams <ea...@gmail.com>.

Hi Tim,

Good points.

Yes, there are currently no examples included in the codebase (other than
in a roundabout kind of way via the tests, which doesn't count) and the
website doesn't have a step-by-step walkthrough of how to get up and
running from a user perspective. We can certainly add these in.

In terms of what you can look at right now to help get going -- take a look
at the performQuery() method of the
org.apache.pirk.test.distributed.testsuite.DistTestSuite -- it walks you
through the basic steps.

What are you thinking in terms of making the providers more pluggable?
Perhaps Responder/Querier core and Responder/Querier modules/providers?

Right now, the 'providers' fall under the algorithm, algorithm -> provider
-- i.e., org.apache.pirk.responder.wideskies.spark
and org.apache.pirk.responder.wideskies.mapreduce under 'wideskies'. This
could be changed to provider -> algorithm. Thus, we would have a module for
spark and then all algorithm implementations for spark would fall under the
spark module (and leverage the core). Thoughts?

Agree with doing some judicious refactoring of the codebase...

Thanks!

Ellison Anne

On Mon, Jul 18, 2016 at 1:24 PM, Tim Ellison <t....@gmail.com> wrote:

> Breaking out the discussion started by Suneel, I see opportunities for a
> bit of judicious refactoring of the codebase.
>
> I'm no expert on Beam etc., so I'm going to keep out of that decision,
> simply to say I agree that it is probably a bad idea to create
> submodules too early.
>
> At the moment I'm simply trying to code up the "hello world" of the Pirk
> APIs, and I'm struggling.  I'll happily admit that I don't have the
> background here to make it simple, but as I wander around the code under
> main/java I see
>  - core implementation code (e.g. Querier, Response, etc)
>  - performance / test classes (e.g. PaillierBenchmark)
>  - examples, drivers, CLI's (e.g. QuerierDriver)
>  - providers for Hadoop, standalone, Spark
>
> I'm left somewhat confused about stripping this down to a library of
> types that I need to interact with to programmatically use Pirk.  I
> think I'm getting far more in the target JAR than I need, and that the
> core types are not yet finessed to offer a usable API -- or am I being
> unfair?
>
> Does it make sense to move some of this CLI / test material out of the
> way?  Should the providers be a bit more pluggable rather than hard-coded?
>
> I will continue with my quest, and will raise usage questions here as I
> go if people are prepared to tolerate a PIR newbie!
>
> Regards,
> Tim
>
>