Skip to content
This repository has been archived by the owner on Mar 16, 2023. It is now read-only.

Prevent browsing history detection #40

Open
arturjanc opened this issue Feb 4, 2021 · 3 comments
Open

Prevent browsing history detection #40

arturjanc opened this issue Feb 4, 2021 · 3 comments

Comments

@arturjanc
Copy link

This problem is partly discussed by #36 and is related to #38 (comment) but I want to make the threat scenario more explicit.

The security section in the explainer mentions revealing people's interests to the web, but it's important to note that FLoC may also potentially be reverse engineered to reveal the set of specific websites visited by the user.

For example, consider a user with a clean browsing profile who in the first few days after installing the browser visits their favorite news site, social network, and bank website. This will assign the user to a cohort with a random-seeming identifier; however, an attacker can also make a guess about the set of websites visited by the user, calculate the FLoC resulting from this history pattern offline, and compare the value to the user's actual FLoC. An attacker could compute a large set of likely FLoC values based on the popularity of websites, news articles published in a given period of time, content shared on social media, etc. Given that browsing patterns are not random, a motivated attacker can likely find matches for a large fraction of users. While the FLoC value doesn't give the attacker certainty that a user has visited specific set of sites, it can give them high confidence, especially if the attacker is willing to make some assumptions about which sites the user is likely to visit.

As mentioned in #38 (comment), the potential risk here is affected by the granularity of data taken into account during FLoC calculation. Less granularity (e.g. taking into account only the site of a visited page) reveals less information, but makes it easier to calculate a collision with the user's FLoC. More granularity (e.g. taking into account the full URL, or page contents) makes the FLoC harder to precompute, but may reveal sensitive cross-origin information if the attacker manages to find the right match.

We also know from past research that browsing preferences are relatively stable over time. This suggests that it may be easy for attackers to precompute FLoCs to find matches, and also increases the risk of reidentification: If I keep using the same bank, webmail and news site, but visit a few new viral websites linked to by my social network each week, I may get a new FLoC, but an attacker who knows which content is popular in a given week can infer what my bank & webmail websites are, linking my past and current profile.

This seems like an important problem to address. The main thing I can think of is to reduce the length of the FLoC so collisions are frequent enough to make it difficult to make inferences about the actual set of visited sites. Randomizing the FLoC (e.g. using a random seed for each user) seems unlikely to meaningfully help here because it will only require the attacker to do more work to compute the value.

@michaelkleber
Copy link
Collaborator

michaelkleber commented Mar 11, 2021

Hi Artur,

The main thing I can think of is to reduce the length of the FLoC so collisions are frequent enough to make it difficult to make inferences about the actual set of visited sites.

This is indeed exactly how FLoC is intended to work.

Based on your question, we ran some analysis on the data from the subset of Chrome users who sync their histories and flocks. Each cohort includes people with a lot of different browsing habits: the smallest number of distinct browsing histories in a cohort is 286.

Note that this means at least 286 different sets of web sites visited by actual people in the cohort, not a theoretical list that could be winnowed down by looking at which collections of sites were more likely to be visited together.

Of course this is an artifact of the particular choice of clustering algorithm we're using (which looks only at registerable domains). As we experiment with other clustering techniques, we do need to remain aware of this risk.

Edit: Apologies, the above number 286 was based on an earlier version of clustering, not the one used in the Origin Trial. The correct data is posted here; in fact each cohort contains users in that subset with at least 735 different browsing histories.

@npdoty
Copy link

npdoty commented Apr 14, 2021

In considering this attack, we should also note that some attackers may be able to combine some data they have about sites the user has visited (in addition to assumptions about popularity). If the attacker has an embedded script and the user's cookie on several sites, they can calculate a much narrowed delta between the user's observed cohort and their anticipated cohort (based on known traffic and popular data).

@michaelkleber
Copy link
Collaborator

Quite right — though note that this will be partly mitigated by the removal of 3rd-party cookies, of course!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
  翻译: