The Mystery of Data Sharing And Privacy Protection: What Differential Privacy Is

As the most important asset in the 21st century, data has increasingly attracted the attention of our whole society. Data sharing is an extremely important way to release data value. However, there seems to be a dilemma between data sharing and privacy protection. Balancing data sharing and privacy protection is a difficult problem. Fortunately, with continuous developments in computer technology and cryptography, this demand can be realized through the integration of a series of technologies. In this article, we will share one of the most common privacy computation technologies - differential privacy.

Concept of differential privacy

With regard to the concept of differential privacy, Wikipedia explains that it is a means of data sharing that can share only some statistical features and can describe the database without disclosing personal information. The intuitive idea behind it is this: if the impact of randomly modifying a record in the database is small enough, the obtained statistical characteristics cannot be used to deduce the content of a single record. This feature can be used to protect privacy.

The core technique behind differential privacy is to make the query result become a random variable by adding noise to the database. The less data the query requests, the more noise will be added to ensure the same degree of privacy.

Application of differential privacy

Let’s say you have a database of academic qualifications. In this database, there are 10 people with primary school education, 20 people with middle school education and 30 people with university education. The number of people with various degrees are searchable. Now another sample is registered in the database. After searching again, it is found that 31 people have college degrees. Thus, we can conclude that the educational background of the newly entered sample is a university degree. In this example, we find that even if we can't query the information of each sample, the statistical database may disclose the information of specific samples. Differential privacy is designed to solve the problem of data leakage caused by the above situation.

Let’s apply differential privacy to the example above. In this education database, by adding Laplace noise, the number of college degrees users queried about is 29.5. We then add a sample of college degrees, and the result is still about 29.5. The results of the two queries are very similar, so the newly entered sample information will be hidden.

In the GoodData blockchain, differential privacy is applied in the GoodData machine learning (ML) SDK to protect the data privacy of data owners. The original data shared by the data owner is encrypted and protected by differential privacy, which ensures that the data owner is the only node that has the original data.

The above only explains and gives examples of differential privacy from a non-technical perspective to help ordinary users better understand the technical principle behind privacy computation. Privacy computation is complex and rigorous. To realize data sharing and privacy protection, we also need the support and cooperation of many other technologies, which will be introduced in the following articles. You can subscribe here.