You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openoffice.apache.org by "Dennis E. Hamilton" <or...@apache.org> on 2015/11/22 21:16:49 UTC

RE: [QUESTIONS] How Is Apache OpenOffice Used - Instrumentation

I suppose a course estimate is also a coarse one [;<).

If we need to go beyond analyzing what is revealed in reports to the project, there remains the prospect for instrumentation.

I am not certain that we have the resources to do that.  So this is a thought-experiment.

INSTRUMENTATION

There is no instrumentation of Apache OpenOffice at present.  There is an existing path to doing so.  It already provides a crude measurement, if used.  There are ways that, with adjustment of the software, more useful data could be obtained.  Producing and capturing the information involves development work.  And any collection of such data must be kept anonymous, while recognizing data from the same installation.

Instrumentation can require considerable work and large databases for the captured information.  There might not be sufficient capacity to undertake any degree of instrumentation in the face of higher-priority needs.

The following note is a bit over-engineered.  It is simpler if we do not need to differentiate data sources at all, but that might not get us what we need.

How important is knowing what we could find out about usage patterns this way?

  1. Privacy of Data Collection
     It is possible to instrument the software to collect certain data, such as the numbers and formats of files opened and saved-as since the previous collection of data from a source.  This requires additions to the software to accumulate such information and to the servers receiving the request for capturing the information.
     Some data might need to be longitudinal, with data captured at different times from the same source recognized and combined in some way.  This allows quite different patterns of usage to be distinguished and not lumped together in a single mass, if that becomes important.
     This means that the source of the data must be anonymized in some manner that still allows data from the same copy of Apache OpenOffice to recognized, but without recording of anything that allows the captured data to be traced back to the originating source. 
     All of this involves substantial careful development.  The means for prevention of identifying sources must be carefully managed.  It must also be possible to protect the data collection procedure from exploits and denial of service attacks.

  2. Update Checking as a Data Source.  When installed copies of Apache OpenOffice conduct an automatic or manual check for updates, that is a source of information.  Unqualified, it is an indication that an installed copy of the software is being used in some manner.  
     Update checks are only useful, however, if pings estimated to be from the same installation are distinguishable.  The crudest measure is simply the date and time of the latest ping from the same (estimated) source, along with the version of Apache OpenOffice being used.  This could be captured without any modification of the existing software package. 
     To distinguish sources, it may be necessary to keep a database with up to 50-100 million records that identifies information from each source without revealing that source.  The same principle is needed if additional data is provided as part of check-for-update requests from the software.
     Information in the currently-implemented HTTP request can be used to estimate when requests are from the same source.  To preserve anonymity of the source, that information can be transformed into a cryptographic hash that cannot be used to determine the original source but can be used to determine a match with a previously-captured ping.  This is a coarse arrangement.

  3. Specific Instrumentation.  If future releases were modified to collect and report usage data (with appropriate opt-in as part of the configuration set-up), that data could be attached to checks-for-updates when allowed.  To accumulate patterns over time, accumulation of data is best tied to user profiles.  By generating a statistically-unique cryptographically-random identifier as part of each user profile that is initialized, that can be used to recognize instrumentation from the same profile.  When the data is collected, the identifier is used in making the cryptographic hash in (2) and then discarded.  



> -----Original Message-----
> From: Dennis E. Hamilton [mailto:orcmid@apache.org]
> Sent: Sunday, November 22, 2015 11:50
> To: dev@openoffice.apache.org
> Subject: [QUESTIONS] How Is Apache OpenOffice Used (was Apache
> OpenOffice ODF in the Marketplace ...)
> 
> I have changed the topic because Marketplace is misleading -- the AOO
> Project is not so much a participant in a market system.  Yet it is
> useful to determine who our public community is and what the adopters of
> Apache OpenOffice are doing with it.
> 
> We have the statistics below as a course estimate of the size of the
> active AOO community, our public.
> 
> The original question was, how important is ODF to those adopters?
> 
> That's an answer that is more likely to be found by asking "What are the
> adopters doing with their copies of Apache OneOffice?  In particular,
> what document formats are they using and to what relative degree?"
> 
> We have no way to know that directly at the moment.
> 
> There is one immediately-available source.
> 
> REPORTS TO US
> 
> What we know the most about what folks are doing with Apache OpenOffice
> comes from what the patterns of complaints are.  These can arise in
> questions to lists dev@ and users@, in filing of Bugzilla reports (or
> commenting on existing ones), and in comments on the Community Forums.
> 
> We can use those to determine more narrowly on what users on what
> platforms are reporting and what they are reporting about.  This
> provides evidence of what is found to be important enough to make the
> effort to report.  That is important all by itself.  It is a clue to
> what others may be experiencing and do not choose or known to report.
> 
> A subset of these reports may hinge on particular document formats and
> interchange/interoperability experiences with document formats.  My
> unqualified impression is that interchange via Microsoft Office formats
> will dominate, just as Microsoft Windows users are predominant among the
> population of AOO adopters.  It will be interesting to identify the ODF-
> related matters that also come up and what the balance is.
> 
> It is not easy to analyze this source mechanically but it is possible to
> do some manual "analytics" of various kinds.
> 
> Is this worth doing?
> 
> Of what value would digging this information out at an initial level of
> detail be?
> 
> We could probably look at a couple of month's data for clues and then
> examine a longer period if it seems profitable.
> 
> 
> > -----Original Message-----
> > From: Dennis E. Hamilton [mailto:dennis.hamilton@acm.org]
> > Sent: Sunday, November 8, 2015 22:19
> > To: dev@openoffice.apache.org
> > Subject: [REPORT] Apache OpenOffice ODF in the Marketplace - AOO 4.1.1
> > downloads
> >
> > Here are updates of the downloads for Apache OpenOffice 4.1.1, now
> that
> > 4.1.2 is being distributed by the mirror system.
> >
> > From Sourceforge,
> >
> <http://sourceforge.net/projects/openofficeorg.mirror/files/4.1.1/stats/
> > os?dates=2014-08-01+to+2015-11-08>
> >
> > Just shy of 50,000,000 downloads.  This number will be exceeded as
> older
> > versions will still continue downloading, although at an ever-
> decreasing
> > rate.
> >
> >    87.7% for Windows
> >     9.0% for Macintosh (0.1% small drop from end of August)
> >     3.3% for everything else, including Linux
> >
> > For the different countries in the same period (53.6 million for all
> > distributions, not just 4.1.1), the breakdown can be found here:
> >
> <http://sourceforge.net/projects/openofficeorg.mirror/files/stats/map?da
> > tes=2014-08-01+to+2015-11-09>.
> >
> > It is cool that there were 3 to Antartica: 2 for Windows, 1 for
> > Macintosh.
> >
> >  - Dennis
> >
> >
> >
> > > -----Original Message-----
> > > From: Dennis E. Hamilton [mailto:dennis.hamilton@acm.org]
> > > Sent: Wednesday, September 23, 2015 18:38
> > > To: dev@openoffice.apache.org
> > > Subject: RE: [DISCUSS] Apache OpenOffice ODF in the Marketplace -
> > > Downloading
> > >
> [ ... ]
> > > What is more difficult to determine is what folks are actually doing
> > > with Apache OpenOffice.  There may be ways to learn more.
> > >
> > >  - Dennis
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Dennis E. Hamilton [mailto:dennis.hamilton@acm.org]
> > > Sent: Friday, September 4, 2015 20:01
> > > To: dev@openoffice.apache.org
> > > Subject: [DISCUSS] Apache OpenOffice ODF in the Marketplace
> > >
> > > I had not encountered the topic of "ODF in the market place" with
> > regard
> > > to status of Apache OpenOffice.  Perhaps I have not been paying
> > > attention.
> > >
> > > I am curious how we might characterize how support for ODF matters
> to
> > > Apache OpenOffice users and various institutions that value support
> > for
> > > ODF in their reliance on Apache OpenOffice and related software.
> > >
> > > How can we determine what the influence of ODF is with respect to
> > Apache
> > > OpenOffice?
> > >
> > > It strikes me there are two parts to this question.
> > >
> > >  1. Who are the users of Apache OpenOffice?
> > >
> > >  2. What are the ways ODF is (comparatively) significant to those
> > users?
> > >
> > > [ ... ]
> > >
> > > WHO ARE THE USERS?
> > >
> > > Although there are now over 150 million downloads of Apache
> > OpenOffice,
> > > that does not tell us how many individual users are involved.
> > >
> > > Perhaps the download counts just for AOO 4.1.1 would be a
> > representable
> > > sample of a particularly-active segment of the user base, even
> though
> > > that would be underestimated a couple of ways.  But that, and the
> > > average weekly rate would be useful as "at least" figures.
> > >
> > > The mix of platforms for those downloads is also important,
> reflecting
> > > the context in which those installed downloads are used by new users
> > and
> > > those who are keeping their configurations current.
> > >
> > >
> > > [ ... ]
> > >
> > >
> > > --------------------------------------------------------------------
> -
> > > To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
> > > For additional commands, e-mail: dev-help@openoffice.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
> > For additional commands, e-mail: dev-help@openoffice.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
> For additional commands, e-mail: dev-help@openoffice.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org


Re: [QUESTIONS] How Is Apache OpenOffice Used - Instrumentation

Posted by Andrea Pescetti <pe...@apache.org>.
Dennis E. Hamilton wrote:
> I am not certain that we have the resources to do that.  So this is a thought-experiment.

It is a thought-experiment, but it is code we (probably) already have. 
Just, we've now disabled the usage tracking, which was existing (always 
opt-in, never silently enabled by default) before OpenOffice came to 
Apache. You can still find some data on the Wiki. We (probably) don't 
have the server-side processing in the code that is in SVN.

I'm not saying that I consider this to be a priority for development. 
But if one is curious, everything is likely available in our source code 
and on the Wiki somewhere.

Regards,
   Andrea.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org