%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{diffpriv} \documentclass[twoside,11pt]{article} % Any additional packages needed should be included after jmlr2e. % Note that jmlr2e.sty includes epsfig, amssymb, natbib and graphicx, % and defines many common macros, such as 'proof' and 'example'. % % It also sets the bibliographystyle to plainnat; for more information on % natbib citation styles, see the natbib documentation, a copy of which % is archived at http://www.jmlr.org/format/natbib.pdf \usepackage{jmlr2e} \usepackage{xspace} % Definitions of handy macros can go here \newcommand{\pckPlain}{diffpriv\xspace} \newcommand{\pck}{\textsf{\pckPlain}\xspace} \newcommand{\R}{\textsf{R}\xspace} \newcommand{\term}[1]{\emph{#1}\xspace} \newcommand{\cf}{\emph{cf.}\xspace} \newcommand{\eg}{\emph{e.g.},\xspace} \newcommand{\ie}{\emph{i.e.},\xspace} \newcommand{\cB}{\ensuremath{\mathcal{B}}\xspace} \newcommand{\cD}{\ensuremath{\mathcal{D}}\xspace} \newcommand{\cR}{\ensuremath{\mathcal{R}}\xspace} \newcommand{\reals}{\ensuremath{\mathbb{R}}\xspace} \renewcommand{\v}[1]{\ensuremath{\mathbf{#1}}} \renewcommand{\Pr}[1]{\ensuremath{\mathrm{Pr}\left(#1\right)}} \newcommand{\Exp}[1]{\ensuremath{\mathbb{E}\left[#1\right]}} \newcommand{\indic}[1]{\ensuremath{\mathbf{1}\left[#1\right]}} \newcommand{\code}[1]{\texttt{#1}\xspace} % Heading arguments are {volume}{year}{pages}{submitted}{published}{author-full-names} \jmlrheading{18}{2017}{1--5}{7/17}{}{}{Benjamin I. P. Rubinstein and Francesco Ald\`{a}} % Short headings should be running head and authors last names \ShortHeadings{\pckPlain: An R Package for Practical Differential Privacy}{Rubinstein and Ald\`{a}} \firstpageno{1} \begin{document} %%\SweaveOpts{concordance=TRUE} \title{\pckPlain: An R Package for Easy Differential Privacy} \author{\name Benjamin I. P. Rubinstein \email brubinstein@unimelb.edu.au \\ \addr School of Computing and Information Systems\\ The University of Melbourne, Parkville, VIC 3010, Australia \AND \name Francesco Ald\`{a} \email francesco.alda@rub.de \\ \addr Horst G\"ortz Institute for IT Security and Faculty of Mathematics\\ Ruhr-Universit\"at Bochum, D-44801 Bochum, Germany} \editor{TBD} \maketitle \begin{abstract}% <- trailing '%' for backward compatibility of .sty file The \R package \pck provides tools for statistics and machine learning under differential privacy. Suitable for releasing analyses on privacy-sensitive data to untrusted third parties, differential privacy has become the framework of choice for privacy-preserving learning. \pck delivers: (a) implementations of generic mechanisms for privatizing non-private target functions, including the Laplace, Gaussian, exponential and Bernstein mechanisms; (b) a recent sensitivity sampler due to \citet{rubinstein2017sampler} that empirically estimates the sensitivity of non-private targets---obviating mathematical analysis for exact sensitivity bounds needed for most generic mechanisms; (c) an extensible framework for implementing differentially-private mechanisms. Together, the components of \pck permit easy high-utility privatization of complex analyses, learners and even black-box software programs. \end{abstract} \begin{keywords} differential privacy, empirical process theory, R, open-source software \end{keywords} \section{Introduction} <>= library(knitr) opts_chunk$set(fig.width=12, fig.height=4, fig.path='', comment="#>", warning=FALSE, message=FALSE, tidy=FALSE, size="small") options(width=60) set.seed(53079239) # install package if necessary: #if(!require("qtl")) install.packages("qtl", repos="http://cran.us.r-project.org") @ Differential privacy~\citep{dwork2006calibrating} has quickly become a key framework for semantic guarantees of data privacy when releasing analysis on privacy-sensitive data to untrusted third parties. A great deal of its popularity is owed to a suite of generic mechanisms for privatizing non-private target functions of data \ie statistics, estimation procedures, and learners. Common to these generic mechanisms is the requirement that the non-private target's sensitivity to dataset perturbation is known and bounded. In all except the most trivial analyses, bounding sensitivity is prohibitively involved. This paper describes the \pck \R package that implements generic mechanisms for differential privacy, along with our recent sensitivity sampler that replaces exact sensitivity bounds with empirical estimates~\citep{rubinstein2017sampler}. As a result, \pck privatizes a wide range of procedures under random differential privacy~\citep{hall2012random}, automatically without mathematical analysis and in many cases achieving high utility. \pck is available from \texttt{https://github.com/brubinstein/diffpriv} under an open-source license. \section{Generic Mechanisms for Differential Privacy} \label{sec:generic} Fundamental to differential privacy is a privacy-sensitive \term{dataset} (or \term{database}) $D\in\cD^n$ on \term{domain} \cD. In \pck a dataset can be a \texttt{list}, \texttt{matrix}, \texttt{data.frame}, \texttt{vector}. We say that a pair of databases $D,D'\in\cD^n$ is \term{neighboring} if they differ on exactly one record. While individual records should not be revealed, we aim to release aggregate information on $D$ with a mechanism. A \term{mechanism} $M: \cD^n \to \cR$ is a random-valued function of databases taking values in a response set \cR; implemented in \pck as virtual class \code{DPMech}. \begin{definition}[\citealp{dwork2006calibrating}] For $\epsilon>0$, mechanism $M: \cD^n \to \cR$ preserves \term{$\epsilon$-differential privacy} if for all neighboring pairs $D,D'\in\cD^n$, measurable response sets $R\subseteq\cR$, $\Pr{M(D)\in R}\leq \exp(\epsilon)\cdot\Pr{M(D')\in R}$. Alternatively for $\delta\in(0,1)$, relaxed \term{$(\epsilon,\delta)$-differential privacy} holds if $\Pr{M(D)\in R}\leq \exp(\epsilon)\cdot\Pr{M(D')\in R}+\delta$. \end{definition} Mechanism $M$ preserves differential privacy if its response distributions are close on neighboring pairs: an adversary cannot determine an unknown record from $M$ responses, even with knowledge of the other $n-1$ records. Privacy parameters $\epsilon$ ($\epsilon,\delta$) are encapsulated in class \code{DPParamsEps} (\code{DPParamsDel} respectively). Virtual \code{DPMech} generic method \texttt{releaseResponse(mechanism, privacyParams, X)} takes privacy parameters along with sensitive dataset. Most generic mechanisms in differential privacy share a number of properties leveraged by the \pck package as follows. \paragraph{Privatizing a target function.} Many mechanisms $M: \cD^n\to\cR$ seek to privatize a non-private \term{target} function $f: \cD^n\to\cB$, with range \cB often (but not always!) coinciding with \cR. Accordingly \code{DPMech} objects are initialized with \code{target} slot of type \code{function}. A mechanism's \code{releaseResponse()} method calls \code{target} to form privacy-preserving responses. \paragraph{Normed target range space.} Target $f$'s output space \cB is typically imbued with a norm, denoted $\|\cdot\|_{\cB}$, needed for measuring the sensitivity of $f$'s outputs to input perturbation. \pck flexibly represents this norm within \code{DPMech} objects as described next. \paragraph{Sensitivity-induced privacy.} Many mechanisms achieve differential privacy by calibrating randomization to the sensitivity of the target function $f$. Targets that are relatively insensitive (sensitive) to perturbing input $D$ to neighboring $D'$ need relatively little (large) response randomization. On a pair of neighboring databases $D,D'\in\cD^n$ the \term{sensitivity} of $f$ is measured as $\Delta(D,D')= \|f(D) - f(D')\|_{\cB}$. \term{Global sensitivity} is the largest such value $\overline{\Delta}=\sup_{D,D'}\|f(D)=f(D')\|_{\cB}$ over all possible neighboring pairs. As we discuss in \citep{rubinstein2017sampler}, a broad class of generic mechanisms, taking sensitivity $\Delta$ as a parameter, are \term{sensitivity-induced private}: for any neighboring pair $D,D'\in\cD^n$ if $\Delta(D,D')\leq\Delta$ then the mechanism $M_{\Delta}$ run with parameter $\Delta$ achieves $\Pr{M_{\Delta}(D)\in R}\leq\exp(\epsilon)\cdot\Pr{M_{\Delta}(D')\in R}$ for all $R\subseteq\cR$. When run with $\Delta=\overline{\Delta}$, this condition holds for all neighboring pairs: $M_{\overline{\Delta}}$ satisfies $\epsilon$-differential privacy. Similarly for $(\epsilon,\delta)$-differential privacy. This is essentially how differential privacy is proved for most generic mechanisms, included those in \pck. \code{DPMech} objects can be initialized with a \code{sensitivity} argument, stored in an S4 slot of the same name. If the user provides a manually-derived global sensitivity bound $\overline{\Delta}$, then \code{releaseResponse()} responses preserve $\epsilon$- or ($\epsilon,\delta$)-DP (depending on the specific mechanism). We now demonstrate this use case. \paragraph{Example: Laplace mechanism.} \pck implements Laplace~\citep{dwork2006calibrating}, Gaussian~\citep{dwork2014algorithmic}, exponential~\citep{mcsherry2007mechanism}, and Bernstein~\citep{alda2017bernstein} mechanisms as \code{DPMech} subclasses \code{DPMechLaplace}, \code{DPMechGaussian}, \code{DPMechExponential}, and \code{DPMechBernstein}. An exponential example is included below. \code{DPMechLaplace} releases numeric vectors, and requires that \cB be numeric $\reals^d$ for some $d$ and adopts the $L_1$ norm (sum of absolutes) for $\|\cdot\|_{\cB}$. The mechanism releases vectors in $\cR=\cB=\reals^d$ by adding an i.i.d. sample of $d$ Laplace-distributed random variables with means 0 and scales $\overline{\Delta}/\epsilon$ to $f(D)$ to achieve $\epsilon$-differential privacy. \code{DPMechGaussian} also privatizes numeric responses but under $L_2$-sensitivity and weaker $(\epsilon,\delta)$-DP; \code{DPMechBernstein} leverages the Laplace mechanism to release multivariate real-valued functions. We next demonstrate Laplace privatization of the sample mean on bounded data in $\cD^n=[0,1]^n$, for which \cB dimension \code{dims} $d$ is one. Global sensitivity is readily bounded as $1/n$: For any neighboring pair $D,D'\in[0,1]^n$, $\Delta(D,D')=n^{-1}\left|\sum_{i=1}^n D_i - \sum_{i=1}^n D'_i\right|$. Since $n-1$ records are identical, and all records are in $[0,1]$, this is $|D_n - D'_n|/n \leq 1/n$. <>= library(diffpriv) f <- function(X) mean(X) ## target function n <- 100 ## dataset size mechanism <- DPMechLaplace(target = f, sensitivity = 1/n, dims = 1) D <- runif(n, min = 0, max = 1) ## the sensitive database in [0,1]^n pparams <- DPParamsEps(epsilon = 1) ## desired privacy budget r <- releaseResponse(mechanism, privacyParams = pparams, X = D) cat("Private response r$response:", r$response, "\nNon-private response f(D): ", f(D)) @ \vspace{-1em} \section{Sensitivity Sampling for Random Differential Privacy} \label{sec:sampler} When \code{target} global sensitivity is supplied as \code{sensitivity} within \code{DPMech} construction, responses are differentially private. Global sensitivity is known for \emph{idealizations} of \eg coefficients for regularized logistic regression~\citep{chaudhuri2009logistic} and the SVM~\citep{rubinstein2012svm,chaudhuri2011erm}. In complex applications such as privatizing a software function, however, \code{target}'s global sensitivity may not be readily available. For such cases, \pck implements the sensitivity sampler of \citet{rubinstein2017sampler} which forms a high-probability estimate of \code{target} sensitivity by repeated probing of sensitivity on random neighboring database pairs, leveraging tools from empirical process theory. Like sensitivity estimates, resulting privacy holds with high probability. \begin{definition}[\citealp{hall2012random}] A mechanism $M$ preserves \term{$(\epsilon,\gamma)$-random differential privacy} (with a corresponding form for $\epsilon,\delta,\gamma$) if $\forall R\subseteq\cR, \Pr{M(D)\in R}\leq\exp(\epsilon)\cdot\Pr{M(D')\in R}$ holds with probability at least $1-\gamma$ over random database pairs $D,D'$. \end{definition} While weaker than $\epsilon$-DP, RDP is arguably more natural than $(\epsilon,\delta)$-DP: The later safeguards all databases but not unlikely responses, while RDP protects against all responses but not pathological databases (as defined by the database sampling distribution). The sampling distribution can be anything meaningful \eg uniform, a Bayesian prior, a density from data privately fit by the Bernstein mechanism~\citep{alda2017bernstein}, etc. The \code{DPMech} method \code{sensitivitySampler(object, oracle, n, m, gamma)} requires: a mechanism \code{object}; a function \code{oracle} which outputs i.i.d. random databases of given size \code{n} which should match the size of input data supplied later in calls to \code{releaseResponse()}; a sensitivity sample size \code{m}; and desired privacy confidence \code{gamma}. Either (but not both) of \code{m}, \code{gamma} can be omitted: the omitted resource will be optimized automatically. For example \code{m} taken small (few hundred) reflects limited sampling time; small \code{gamma} (\eg 0.05) prioritizes privacy. The sensitivity sampler calls appropriate \code{DPMech} \code{sensitivityNorm()} which implements $\Delta(D,D')$ for the mechanism's norm and stored \code{target}. New subclasses of \code{DPMech} need only implement this method in order to take advantage of the sensitivity sampler. After \code{sensitivitySampler()}, subsequent \code{releaseResponse()} results have a privacy parameter slot of type \code{DPParamsGam} indicating response RDP. \paragraph{Example: Exponential mechanism.} All \code{DPMech} subclasses are sensitivity-induced private and can be sensitivity sampled; we demonstrate the exponential mechanism here. Exponential privately optimizes an application-specific objective (or score, utility) $s(r)$ of candidate response $r\in\cR$, with response distribution proportional to $\exp(\epsilon \cdot s(r) / (2\Delta))$. Typically $s$ depends on input $D$, and so \code{DPMechExponential} is initialized with \code{target} that takes $D$ and outputs a score function. That is, $\cB=\reals^{\cR}$ is a real-valued function space on \cR and the class's \code{sensitivityNorm()} implements the sup-norm (\cf \citealp{rubinstein2017sampler}). In practice, users supply \code{target} as an \R closure as demonstrated below. Given global \code{sensitivity} of \code{target}, the mechanism preserves $\epsilon$-DP; if \code{sensitivitySampler()} estimates sensitivity with some \code{gamma}, then RDP is preserved at confidence $\gamma=$\code{gamma}. Applying these ideas to find the most frequent a--z character within a dataset of top-10 computer scientists from Semantic Scholar, subject to privacy of individuals. The exponential mechanism privately maximizes total frequency. But without bounded name lengths, this function has unbounded global sensitivity. We therefore use sensitivity sampler for $(1,0.1)$-RDP, with an oracle that samples representative U.S. names based on \code{randomNames}. <>= library(randomNames) ## a package that generates representative random names oracle <- function(n) randomNames(n) D <- c("Michael Jordan", "Andrew Ng", "Andrew Zisserman","Christopher Manning", "Jitendra Malik", "Geoffrey Hinton", "Scott Shenker", "Bernhard Scholkopf", "Jon Kleinberg", "Judea Pearl") n <- length(D) f <- function(X) { function(r) sum(r == unlist(base::strsplit(X, ""))) } rSet <- as.list(letters) ## the response set, letters a--z, must be a list mechanism <- DPMechExponential(target = f, responseSet = rSet) mechanism <- sensitivitySampler(mechanism, oracle = oracle, n = n, gamma = 0.1) pparams <- DPParamsEps(epsilon = 1) r <- releaseResponse(mechanism, privacyParams = pparams, X = D) cat("Private response r$response: ", r$response, "\nNon-private f(D) maximizer: ", letters[which.max(sapply(rSet, f(D)))]) @ \vspace{-2em} %Since mechanisms, notably Laplace and exponential, tend to use the norm $\|\cdot\|_{\cB}$ only for computing sensitivity $\Delta(D,D')$, and tend to assume one particular norm, \pck represents a minimal interface to the norm via \code{DPMech} method \code{sensitivityNorm()}. In implemented generic mechanisms that require a fixed norm, this norm is built into the corresponding method. For (future) user-defined mechanisms, function-dependent norms can also be supplied at initialization by extended the class constructor. %pressure phenotype, will consider just the \Sexpr{1+2} individuals with %%<>= %%hist(rnorm(100)) %%@ %\section{R and package versions used} % %<>= %sessionInfo() %@ % Acknowledgements should go at the end, before appendices and references \acks{B. Rubinstein and F. Ald\`a acknowledge the support of the Australian Research Council (DE160100584) and the DFG Research Training Group GRK 1817/1 respectively.} \vskip 0.2in \bibliography{diffpriv} \end{document}