%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{diffpriv}

\documentclass[twoside,11pt]{article}

% Any additional packages needed should be included after jmlr2e.
% Note that jmlr2e.sty includes epsfig, amssymb, natbib and graphicx,
% and defines many common macros, such as 'proof' and 'example'.
%
% It also sets the bibliographystyle to plainnat; for more information on
% natbib citation styles, see the natbib documentation, a copy of which
% is archived at http://www.jmlr.org/format/natbib.pdf

\usepackage{jmlr2e}
\usepackage{xspace}

% Definitions of handy macros can go here

\newcommand{\pckPlain}{diffpriv\xspace}
\newcommand{\pck}{\textsf{\pckPlain}\xspace}
\newcommand{\R}{\textsf{R}\xspace}
\newcommand{\term}[1]{\emph{#1}\xspace}
\newcommand{\cf}{\emph{cf.}\xspace}
\newcommand{\eg}{\emph{e.g.},\xspace}
\newcommand{\ie}{\emph{i.e.},\xspace}
\newcommand{\cB}{\ensuremath{\mathcal{B}}\xspace}
\newcommand{\cD}{\ensuremath{\mathcal{D}}\xspace}
\newcommand{\cR}{\ensuremath{\mathcal{R}}\xspace}
\newcommand{\reals}{\ensuremath{\mathbb{R}}\xspace}
\renewcommand{\v}[1]{\ensuremath{\mathbf{#1}}}
\renewcommand{\Pr}[1]{\ensuremath{\mathrm{Pr}\left(#1\right)}}
\newcommand{\Exp}[1]{\ensuremath{\mathbb{E}\left[#1\right]}}
\newcommand{\indic}[1]{\ensuremath{\mathbf{1}\left[#1\right]}}
\newcommand{\code}[1]{\texttt{#1}\xspace}

% Heading arguments are {volume}{year}{pages}{submitted}{published}{author-full-names}

\jmlrheading{18}{2017}{1--5}{7/17}{}{}{Benjamin I. P. Rubinstein and Francesco Ald\`{a}}

% Short headings should be running head and authors last names

\ShortHeadings{\pckPlain: An R Package for Practical Differential Privacy}{Rubinstein and Ald\`{a}}
\firstpageno{1}

\begin{document}
%%\SweaveOpts{concordance=TRUE}

\title{\pckPlain: An R Package for Easy Differential Privacy}

\author{\name Benjamin I. P. Rubinstein \email brubinstein@unimelb.edu.au \\
       \addr School of Computing and Information Systems\\
       The University of Melbourne,
       Parkville, VIC 3010, Australia
       \AND
       \name Francesco Ald\`{a} \email francesco.alda@rub.de \\
       \addr Horst G\"ortz Institute for IT Security and Faculty of Mathematics\\
       Ruhr-Universit\"at Bochum,
       D-44801 Bochum, Germany}

\editor{TBD}

\maketitle

\begin{abstract}%   <- trailing '%' for backward compatibility of .sty file
The \R package \pck provides tools for statistics and machine
learning under differential privacy. Suitable for releasing analyses
on privacy-sensitive data to untrusted third parties, differential
privacy has become the framework of choice for privacy-preserving
learning. \pck delivers: (a) implementations of generic
mechanisms for privatizing non-private target functions, including the
Laplace, Gaussian, exponential and Bernstein mechanisms; (b) a recent
sensitivity sampler due to \citet{rubinstein2017sampler} that empirically
estimates the sensitivity of non-private targets---obviating mathematical
analysis for exact sensitivity bounds needed for most generic mechanisms;
(c) an extensible framework for implementing differentially-private mechanisms.
Together, the components of \pck permit easy high-utility privatization
of complex analyses, learners and even black-box software programs.
\end{abstract}

\begin{keywords}
differential privacy, empirical process theory, R, open-source software
\end{keywords}

\section{Introduction}

<<knitr_options, include=FALSE>>=
library(knitr)
opts_chunk$set(fig.width=12, fig.height=4, fig.path='', comment="#>",
               warning=FALSE, message=FALSE, tidy=FALSE, size="small")
options(width=60)
set.seed(53079239)
# install package if necessary:
#if(!require("qtl")) install.packages("qtl", repos="http://cran.us.r-project.org")
@

Differential privacy~\citep{dwork2006calibrating} has quickly become a
key framework for semantic guarantees of data privacy when releasing
analysis on privacy-sensitive data to untrusted third parties.
A great deal of its popularity is owed to a suite of generic mechanisms
for privatizing non-private target functions of data \ie statistics,
estimation procedures, and learners. Common to these generic mechanisms
is the requirement that the non-private target's sensitivity to dataset
perturbation is known and bounded. In all except the most trivial
analyses, bounding sensitivity is prohibitively involved. This paper
describes the \pck \R package that implements generic
mechanisms for differential privacy, along with our recent sensitivity
sampler that replaces exact sensitivity bounds with empirical
estimates~\citep{rubinstein2017sampler}. As a result, \pck privatizes a wide
range of procedures under random differential privacy~\citep{hall2012random},
automatically without mathematical analysis and in many cases achieving high
utility. \pck is available from \texttt{https://github.com/brubinstein/diffpriv} under an open-source license.

\section{Generic Mechanisms for Differential Privacy}
\label{sec:generic}

Fundamental to differential privacy is a privacy-sensitive
\term{dataset} (or \term{database}) $D\in\cD^n$ on \term{domain} \cD.
In \pck a dataset can be a \texttt{list}, \texttt{matrix},
\texttt{data.frame}, \texttt{vector}. We say that a pair of
databases $D,D'\in\cD^n$ is \term{neighboring} if they differ on
exactly one record. While individual records should not
be revealed, we aim to release aggregate information
on $D$ with a mechanism. A \term{mechanism} $M: \cD^n \to \cR$ is a
random-valued function of databases taking values in a response set \cR;
implemented in \pck as virtual class \code{DPMech}.

\begin{definition}[\citealp{dwork2006calibrating}]
For $\epsilon>0$, mechanism $M: \cD^n \to \cR$ preserves
\term{$\epsilon$-differential privacy} if for all neighboring pairs
$D,D'\in\cD^n$, measurable response sets
$R\subseteq\cR$, $\Pr{M(D)\in R}\leq \exp(\epsilon)\cdot\Pr{M(D')\in R}$.
Alternatively for $\delta\in(0,1)$, relaxed
\term{$(\epsilon,\delta)$-differential privacy} holds if
$\Pr{M(D)\in R}\leq \exp(\epsilon)\cdot\Pr{M(D')\in R}+\delta$.
\end{definition}

Mechanism $M$ preserves differential privacy if its response distributions are
close on neighboring pairs: an adversary cannot determine an
unknown record from $M$ responses, even with knowledge of the other $n-1$
records.
Privacy parameters $\epsilon$ ($\epsilon,\delta$) are encapsulated in
class \code{DPParamsEps} (\code{DPParamsDel} respectively).
Virtual \code{DPMech} generic method
\texttt{releaseResponse(mechanism, privacyParams, X)} takes privacy
parameters along with sensitive dataset. Most generic mechanisms in
differential privacy share a number of properties leveraged by the \pck package
as follows.

\paragraph{Privatizing a target function.} Many mechanisms $M: \cD^n\to\cR$
seek to privatize a non-private \term{target} function $f: \cD^n\to\cB$, with
range \cB often (but not always!) coinciding with \cR. Accordingly
\code{DPMech} objects are initialized with \code{target} slot of type
\code{function}. A mechanism's \code{releaseResponse()} method calls
\code{target} to form privacy-preserving responses.

\paragraph{Normed target range space.} Target $f$'s output space \cB
is typically imbued with a norm, denoted $\|\cdot\|_{\cB}$, needed
for measuring the sensitivity of $f$'s outputs to input perturbation. \pck flexibly
represents this norm within \code{DPMech} objects as described next.

\paragraph{Sensitivity-induced privacy.} Many mechanisms achieve differential
privacy by calibrating randomization to the sensitivity of the target function
$f$. Targets that are relatively insensitive (sensitive) to perturbing
input $D$ to neighboring $D'$ need relatively little (large) response
randomization. On a pair of neighboring databases $D,D'\in\cD^n$ the
\term{sensitivity} of $f$ is measured as $\Delta(D,D')= \|f(D) - f(D')\|_{\cB}$.
\term{Global sensitivity} is the largest such value
$\overline{\Delta}=\sup_{D,D'}\|f(D)=f(D')\|_{\cB}$ over all possible
neighboring pairs.
As we discuss in \citep{rubinstein2017sampler}, a broad class of generic
mechanisms, taking sensitivity $\Delta$ as a parameter, are
\term{sensitivity-induced private}: for any neighboring pair $D,D'\in\cD^n$ if
$\Delta(D,D')\leq\Delta$ then the mechanism $M_{\Delta}$ run with parameter
$\Delta$ achieves
$\Pr{M_{\Delta}(D)\in R}\leq\exp(\epsilon)\cdot\Pr{M_{\Delta}(D')\in R}$
for all $R\subseteq\cR$. When run with $\Delta=\overline{\Delta}$,
this condition holds for all neighboring pairs: $M_{\overline{\Delta}}$
satisfies $\epsilon$-differential privacy. Similarly for
$(\epsilon,\delta)$-differential privacy. This is essentially how differential
privacy is proved for most generic mechanisms, included those in \pck.
\code{DPMech} objects can be initialized with a \code{sensitivity} argument,
stored in an S4 slot of the same name. If the user provides a manually-derived
global sensitivity bound $\overline{\Delta}$, then \code{releaseResponse()}
responses preserve $\epsilon$- or ($\epsilon,\delta$)-DP (depending on
the specific mechanism). We now demonstrate this use case.

\paragraph{Example: Laplace mechanism.}
\pck implements Laplace~\citep{dwork2006calibrating},
Gaussian~\citep{dwork2014algorithmic},
exponential~\citep{mcsherry2007mechanism}, and
Bernstein~\citep{alda2017bernstein} mechanisms as \code{DPMech} subclasses
\code{DPMechLaplace}, \code{DPMechGaussian}, \code{DPMechExponential}, and
\code{DPMechBernstein}. An exponential example is included below.
\code{DPMechLaplace} releases numeric vectors, and requires that \cB be numeric
$\reals^d$ for some
$d$ and adopts the $L_1$ norm (sum of absolutes) for $\|\cdot\|_{\cB}$. The
mechanism releases vectors in $\cR=\cB=\reals^d$ by adding an i.i.d. sample of
$d$ Laplace-distributed random variables with means 0 and scales
$\overline{\Delta}/\epsilon$ to $f(D)$ to achieve $\epsilon$-differential
privacy. \code{DPMechGaussian} also privatizes numeric responses but under
$L_2$-sensitivity and weaker $(\epsilon,\delta)$-DP; \code{DPMechBernstein}
leverages the Laplace mechanism to release multivariate real-valued functions.

We next demonstrate Laplace privatization of the sample mean on
bounded data in $\cD^n=[0,1]^n$, for which \cB dimension \code{dims} $d$ is
one. Global sensitivity is readily bounded as $1/n$: For any
neighboring pair $D,D'\in[0,1]^n$,
$\Delta(D,D')=n^{-1}\left|\sum_{i=1}^n D_i - \sum_{i=1}^n D'_i\right|$.
Since $n-1$ records are identical, and all records are in $[0,1]$, this is
$|D_n - D'_n|/n \leq 1/n$.

<<genericLaplace, include=TRUE, echo=TRUE, results='markup'>>=
library(diffpriv)
f <- function(X) mean(X) ## target function
n <- 100 ## dataset size
mechanism <- DPMechLaplace(target = f, sensitivity = 1/n, dims = 1)
D <- runif(n, min = 0, max = 1) ## the sensitive database in [0,1]^n
pparams <- DPParamsEps(epsilon = 1) ## desired privacy budget
r <- releaseResponse(mechanism, privacyParams = pparams, X = D)
cat("Private response r$response:", r$response,
  "\nNon-private response f(D):  ", f(D))
@

\vspace{-1em}

\section{Sensitivity Sampling for Random Differential Privacy}
\label{sec:sampler}

When \code{target} global sensitivity is supplied as \code{sensitivity} within
\code{DPMech} construction, responses are differentially private.
Global sensitivity is known for \emph{idealizations} of \eg coefficients for
regularized logistic regression~\citep{chaudhuri2009logistic} and the
SVM~\citep{rubinstein2012svm,chaudhuri2011erm}. In complex applications
such as privatizing a software function, however, \code{target}'s global
sensitivity may not be readily available. For such cases, \pck implements the
sensitivity sampler of \citet{rubinstein2017sampler} which forms a
high-probability estimate of \code{target} sensitivity by repeated probing of
sensitivity on random neighboring database pairs, leveraging tools from
empirical process theory. Like sensitivity estimates, resulting privacy holds
with high probability.

\begin{definition}[\citealp{hall2012random}]
A mechanism $M$ preserves \term{$(\epsilon,\gamma)$-random differential privacy}
(with a corresponding form for $\epsilon,\delta,\gamma$) if
$\forall R\subseteq\cR, \Pr{M(D)\in R}\leq\exp(\epsilon)\cdot\Pr{M(D')\in R}$ holds
with probability at least $1-\gamma$ over random database pairs $D,D'$.
\end{definition}

While weaker than $\epsilon$-DP, RDP is arguably more natural than
$(\epsilon,\delta)$-DP: The later safeguards all databases but not
unlikely responses, while RDP protects against all responses but not
pathological databases (as defined by the database sampling distribution). The
sampling distribution can be anything meaningful \eg uniform, a Bayesian prior,
a density from data privately fit by the Bernstein
mechanism~\citep{alda2017bernstein}, etc.

The \code{DPMech} method \code{sensitivitySampler(object, oracle, n, m, gamma)}
requires: a mechanism \code{object}; a function \code{oracle} which outputs
i.i.d. random databases of given size \code{n} which should match the
size of input data supplied later in calls to \code{releaseResponse()}; a
sensitivity sample size \code{m}; and desired privacy confidence \code{gamma}.
Either (but not both) of \code{m}, \code{gamma} can be omitted: the omitted
resource will be optimized automatically. For example \code{m} taken small
(few hundred) reflects limited sampling time; small \code{gamma} (\eg 0.05)
prioritizes privacy. The sensitivity sampler calls appropriate \code{DPMech}
\code{sensitivityNorm()} which implements $\Delta(D,D')$ for the mechanism's
norm and stored \code{target}. New subclasses of \code{DPMech} need only
implement this method in order to take advantage of the sensitivity sampler.
After \code{sensitivitySampler()}, subsequent \code{releaseResponse()} results
have a privacy parameter slot of type \code{DPParamsGam} indicating response
RDP.

\paragraph{Example: Exponential mechanism.} All \code{DPMech} subclasses are
sensitivity-induced private and can be sensitivity sampled; we demonstrate the
exponential mechanism here. Exponential privately optimizes an
application-specific objective (or score, utility) $s(r)$ of candidate
response $r\in\cR$, with response distribution proportional to
$\exp(\epsilon \cdot s(r) / (2\Delta))$. Typically $s$ depends on
input $D$, and so \code{DPMechExponential} is initialized with \code{target}
that takes $D$ and outputs a score function. That is, $\cB=\reals^{\cR}$ is a
real-valued function space on \cR and the class's \code{sensitivityNorm()}
implements the sup-norm (\cf \citealp{rubinstein2017sampler}). In practice,
users supply \code{target} as an \R closure as demonstrated below. Given
global \code{sensitivity} of \code{target}, the mechanism preserves $\epsilon$-DP;
if \code{sensitivitySampler()} estimates sensitivity with some \code{gamma},
then RDP is preserved at confidence $\gamma=$\code{gamma}.

Applying these ideas to find the most frequent a--z character within a
dataset of top-10 computer scientists from Semantic Scholar, subject to privacy of
individuals. The exponential mechanism privately maximizes total frequency. But
without bounded name lengths, this function has unbounded global sensitivity. We
therefore use sensitivity sampler for $(1,0.1)$-RDP, with an oracle that samples
representative U.S. names based on \code{randomNames}.

<<samplerExponential, include=TRUE, echo=TRUE, results='markup'>>=
library(randomNames) ## a package that generates representative random names
oracle <- function(n) randomNames(n)
D <- c("Michael Jordan", "Andrew Ng", "Andrew Zisserman","Christopher Manning",
       "Jitendra Malik", "Geoffrey Hinton", "Scott Shenker",
       "Bernhard Scholkopf", "Jon Kleinberg", "Judea Pearl")
n <- length(D)
f <- function(X) { function(r) sum(r == unlist(base::strsplit(X, ""))) }
rSet <- as.list(letters) ## the response set, letters a--z, must be a list
mechanism <- DPMechExponential(target = f, responseSet = rSet)
mechanism <- sensitivitySampler(mechanism, oracle = oracle, n = n, gamma = 0.1)
pparams <- DPParamsEps(epsilon = 1)
r <- releaseResponse(mechanism, privacyParams = pparams, X = D)
cat("Private response r$response: ", r$response,
  "\nNon-private f(D) maximizer:  ", letters[which.max(sapply(rSet, f(D)))])
@

\vspace{-2em}

%Since mechanisms, notably Laplace and exponential, tend to use the norm $\|\cdot\|_{\cB}$ only for computing sensitivity $\Delta(D,D')$, and tend to assume one particular norm, \pck represents a minimal interface to the norm via \code{DPMech} method \code{sensitivityNorm()}. In implemented generic mechanisms that require a fixed norm, this norm is built into the corresponding method. For (future) user-defined mechanisms, function-dependent norms can also be supplied at initialization by extended the class constructor.

%pressure phenotype, will consider just the \Sexpr{1+2} individuals with

%%<<summary_cross, fig.height=8>>=
%%hist(rnorm(100))
%%@

%\section{R and package versions used}
%
%<<sessionInfo, include=TRUE, echo=TRUE, results='markup'>>=
%sessionInfo()
%@

% Acknowledgements should go at the end, before appendices and references

\acks{B. Rubinstein and F. Ald\`a acknowledge the support
of the Australian Research Council (DE160100584) and the
DFG Research Training Group GRK 1817/1 respectively.}

\vskip 0.2in
\bibliography{diffpriv}

\end{document}