おお。面白い。

Masahiro Sakai - 2012-07-02T13:01:59+0000 - 更新日時: 2012-07-02T13:01:59+0000

おお。面白い。

Terence Tao さんが最初に共有した投稿Anonymity on the internet is a very fragile thing; every anonymous online identity on this planet is only about 31 bits of information away from being completely exposed. This is because the total number of internet users on this planet is about 2 billion, or approximately 2^{31}. Initially, all one knows about an anonymous internet user is that he or she is a member of this large population, which has a Shannon entropy of about 31 bits. But each piece of new information about this identity will reduce this entropy. For instance, knowing the gender of the user will cut down the size of the population of possible candidates for the user's identity by a factor of approximately two, thus stripping away one bit of entropy. (Actually, one loses a little less than a whole bit here, because the gender distribution of internet users is not perfectly balanced.) Similarly, any tidbit of information about the nationality, profession, marital status, location, hobbies, age, ethnicity, education level, socio-economic status, languages known, birthplace, appearance, political leaning, etc. of the user will reduce the entropy further. (Note though that information loss is not always additive; if knowing X removes 2 bits of entropy and knowing Y removes 3 bits, then knowing both X and Y does not necessarily remove 5 bits of information, because X and Y may be correlated instead of independent, and so much of the information gained from Y may already have been present in X).

One can reveal quite a few bits of information about oneself without any serious loss to one's anonymity; for instance, if one has revealed a net of 20 independent bits of information over the lifetime of one's online identity, this still leaves one in a crowd of about 2^11 ~ 2000 other people, enough to still enjoy some reasonable level of anonymity. But as one approaches the threshold of 31 bits, the level of anonymity drops exponentially fast. Once one has revealed more than 31 bits, it becomes theoretically possible to deduce one's identity, given a sufficiently comprehensive set of databases about the population of internet users and their characteristics. Of course, such an ideal set of databases does not actually exist; but one can imagine that government intelligence agencies may have enough of these databases to deduce one's identity from, say, 50 or 60 bits of information, and even publicly available databases (such as what one can access from popular search engines) are probably enough to do the job given, say, 100 bits of information, assuming sufficient patience and determination. Thus, in today's online world, a crowd of billions of other people is considerably less protection for one's anonymity than one may initially think, and just because the first 20 or 30 bits of information you reveal about yourself leads to no apparent loss of anonymity, this does not mean that the next 20 or 30 bits revealed will do so also.

Restricting access to online databases may recover a handful of bits of anonymity, but one will not return to anything close to pre-internet levels of anonymity without extremely draconian information controls. Completely discarding a previous online identity and starting afresh can reset one's level of anonymity to near-maximum levels, but one has to be careful never to link the new identity to the old one, or else the protection gained by switching will be lost, and the information revealed by the two online identities, when combined together, may cumulatively be enough to destroy the anonymity of both.

But one additional way to gain more anonymity is through deliberate disinformation. For instance, suppose that one reveals 100 independent bits of information about oneself. Ordinarily, this would cost 100 bits of anonymity (assuming that each bit was a priori equally likely to be true or false), by cutting the number of possibilities down by a factor of 2^100; but if 5 of these 100 bits (chosen randomly and not revealed in advance) are deliberately falsified, then the number of possibilities increases again by a factor of (100 choose 5) ~ 2^26, recovering about 26 bits of anonymity. In practice one gains even more anonymity than this, because to dispel the disinformation one needs to solve a satisfiability problem, which can be notoriously intractible computationally, although this additional protection may dissipate with time as algorithms improve (e.g. by incorporating ideas from compressed sensing).