Differential privacy methods. Let us start with a simple question. How does differential privacy work? The main idea is to use suited algorithms that add a sufficient amount of noise to a data set in order to guarantee that nothing specific is being revealed about an individual from the data. The key idea is that due to the added noise the impact of adding or removing a single individual data point is small relative to the noise. As a result the overall outcome of an analysis is not changed significantly anymore by an individual data point. In other words, if the inclusion or removal of any individual is lost in the noise then the individual cannot be identified anymore. The differential privacy approach can be applied to both an actual Big Data dataset but also statistical values that are calculated from such a data set. We will now consider three methods for differential privacy. The Laplace mechanism for Query-response systems. The randomized response approach for collecting responses in surveys or for gathering user data. And the exponential mechanism for working with databases. There are different types of methods for differential privacy, different mechanisms. One basic mechanism is the Laplace mechanism which is applied to anonymize statistical aggregates like an average. For example, if we want to know the average number of people having a certain disease. When we calculate utter mechanism the result can provide some kind of hint on the people in the group. For example if we if we divide the population into groups according to age, social or physical aspects and then provide the average number of having this disease for each subgroup then an attacker might learn a lot about the subgroups or even might be able to guess about individual people of the subgroup. Then we calculate in average, the Laplace mechanism adds a random number on to this average so it's a kind of noise which gets added according to the Laplace distribution and then we haven't guaranteed set average number does not provide a big amount of information on individuals. Let's explore this further considering the example of a dataset containing people with a particular illness. Without the Laplace mechanism an attacker could send queries for different subgroups for example people from Germany, people who are over 30 and so on. If an attacker combines the information from different queries he might be able over time to conclude information about an individual person. So by adding controlled noise to the function we want to compute for example an average of a subgroup we can ensure that it is very difficult to gain information about a person by combining results for different subgroups. Another approach is by using randomized response. This ensures that the privacy of the individuals is protected from when the data is generated. This method can for example be used if we ask a sensitive question in a survey with two optional answers yes or no. A randomized response procedure that helps to protect the privacy of survey participants can look like this. For each survey response two coins are flipped. Practically this will be done with software using a random number generator. In case the result of the flipped coins is two heads, we note down yes as answer. In case of two tails we note down no. In case of mixed results we note down the answer truthfully. We have now distorted the survey results in a statistically well-defined known way. The effect of this approach is that it is impossible to derive information about the response of individual participants but with a few statistical calculations we can still determine the real survey results. It needs to be noted that this approach works only for somewhat larger groups our participants. Another approach to achieve differential privacy is the exponential mechanism. A usage scenario for this method is when we want to publish an anonymized dataset. If we have a true sample of a population and we want to publish it we need some way to know the published dataset doesn't infringe the privacy of the participants of this dataset, and the exponential mechanism provides a method for generating realistic but artificial samples from the dataset so we can create a complete artificial dataset which has the same distribution patterns of the original dataset. It only contains random entries which correspond to artificial or virtual people, it doesn't correspond to any real people anymore. The exponential mechanism is based on a quality score for how well the output, which are the artificially generated entries, represent the input data which are the original true entries. Each possible database entry gets a score based on its likelihood derived from the input data. The exponential mechanism chooses outputs that have higher scores with a higher probability. Based on this a synthetic dataset is generated that allows for the same statistical conclusions as the original database but that preserves privacy as it does not provide real data and thus no specific information about individuals. In conclusion, differential privacy methods aim to make big datasets available for statistical analyses while protecting the privacy of individuals. This is done by making it impossible to get access directly or indirectly to the data of an individual. We have covered three methods for differential privacy. The Laplace mechanism, which adds noise to the function we want to compute thus making it difficult to derive information about an individual person from combining results of subgroups. Randomized response, which ensures that privacy of people is kept already at the moment when data is generated. And the exponential mechanism, which is used to publish anonymized datasets which have the same distribution patterns as the original dataset.