Using Facebook’s Own Data to Understand the Platform’s Role in Jan. 6

by Brooke Stephenson

ProPublica is a Pulitzer Prize-winning investigative newsroom. Sign up for The Big Story newsletter to receive stories like this one in your inbox.

Imagine you’re a journalist and you receive a collection of tens of millions of posts from more than 100,000 Facebook groups. You think there’s got to be a story — maybe several — in that cache. But how do you find it?

A team of reporters from ProPublica and the Washington Post was faced with just such a problem in June, when the newsrooms obtained a unique dataset on Facebook groups compiled by CounterAction, a firm that studies online disinformation.

Computational journalist Jeff Kao and reporter Craig Silverman from ProPublica, along with Jeremy B. Merrill and Craig Timberg from The Washington Post, found that between Election Day 2020 and the Jan. 6 siege of the Capitol, Facebook groups exploded with at least 650,000 posts attacking the legitimacy of Joe Biden’s victory.

The four journalists’ reporting provides some of the clearest evidence yet that Facebook was an important source of misinformation that led to the Jan. 6 attack. Here’s a look at how they did it.

Reporters started with a collection of data on public Facebook groups CounterAction had been monitoring because members had posted links to websites with a strong connection to U.S. politics, or because the groups had members in common with other groups CounterAction was already monitoring. (The dataset did not include private Facebook groups, which are closed to everyone except their members.) This included up to 18 months’ worth of posts from each group. The dataset was so massive, a personal computer simply couldn’t process it; the reporting team had to run its analyses on the cloud.

Reporters initially just wanted to know what the data could tell them about how Facebook was treating political groups on its platform. They consulted with experts and sources, then analyzed the dataset in various ways, looking for patterns. As the summer of 2021 came to a close, Jeff and Craig found something promising.

They saw a spike in the rate at which Facebook removed groups from the platform right before the election, a dramatic drop-off after the election and then another spike again around Jan. 6. The first spike in group removals seemed to show Facebook was capable of efficiently removing misinformation when it was determined to.

For example, reporters identified more than 300 QAnon groups in the CounterAction dataset. All had been removed by October 2020, when Facebook announced a total ban of QAnon.

But the speedy removals can be largely credited to a group task force within Facebook’s Civic Integrity Unit, which, according to former members, was disbanded soon after the election.

Facebook banned Stop the Steal groups on Nov. 5, 2020, but the rate of group removals appeared to drag after the election.

As rioters stormed the Capitol, Facebook was again taking down groups at a rate not seen since before the election.

With this in mind, the reporters narrowed their focus to the period between Election Day and Jan. 6, and made a point to look at Stop the Steal groups specifically, since they had a clear connection to the Jan. 6 attack.

Slowly but surely, the investigation was beginning to take shape.

Why Facebook Groups?

Misinformation on Facebook isn’t limited to groups. But the reporters had reason to believe they were a promising starting point.

For one thing, groups have been important to Facebook’s growth for a long time. The company has heavily promoted groups ever since Mark Zuckerberg made them his strategic priority in 2017. Jeff even remembers seeing ads for them on the New York City subway.

But they’re also one of the platform’s most toxic products.

Because of Facebook’s quest for engagement, Jeff explained, “many of the most popular groups are either clickbait content groups or political groups. And what’s politically engaging? It’s borderline bannable behavior.” In a March internal Facebook report, first published by Politico, Facebook identified “harmful” and “violating” narratives — content worth banning — as the “Worst of the Worst Hate,” “Violence & Incitement,” and “Vaccine Misinformation.”

“Facebook itself did a study saying when you set limits on what users should be allowed to see on the platform, the dynamics of the platform are such that people will walk right up to that line and try not to cross it,” Jeff said. The closer people can walk that line of prohibited content, like stoking political anger with misinformation, the farther their posts travel.

Many of the 650,000 group posts reporters identified as challenging the legitimacy of Biden’s election fell into this category. Facebook’s own report warned that these harmful narratives may have had “substantial negative impacts including contributing materially to the capital riot and potentially reducing collective civic engagement and social cohesion in the years to come.”

Drew Pusateri, a spokesperson for Meta, Facebook’s newly renamed parent company, declined to comment on specific posts for our January story, but said the company does not have a policy forbidding posts or comments that attack the legitimacy of the election. He said Meta has a dedicated groups integrity team and an ongoing initiative to protect people who use groups from harm.

Whittling It Down

To zoom in on the political groups with harmful content, reporters had to get a more accurate number of groups within the CounterAction dataset that were actually political. Floating around amid, for example, QAnon and Biden voter groups, were groups of knitters or runners or other groups unrelated to politics that might have ended up being monitored by CounterAction because one member shared a political link during election season.

Tools like keyword searches, which freed reporters from reviewing each group by hand, still cast too wide a net. For example, a search for all the groups that included posts with the word “Trump” would still catch a lot of online chatter in groups that aren’t political. They needed a way to calculate the scale of the problem that didn’t require them to read through all 100,000 groups.

Enter machine learning.

Jeff said when he is trying to figure out if a problem is suited for machine learning, he uses something he calls the “intern test.” “If a reasonably competent intern that you hired off the street who had no prior knowledge could be trained to do it fairly quickly — you show them a few examples like: this is inciting violence, this is not — then the machine learning algorithms could probably do it to a specific accuracy,” Jeff explained, as part of my ex-intern soul died a little out of sheer irrelevance.

Sorting groups into political and non-political buckets passed the intern test.

But in order to teach the machine how to do it, the team first had to do their own grunt work to provide the examples.

Many of the groups in the dataset disappeared from public view during this project. Based on the groups’ content and the time they disappeared, reporters believed many were taken down by Facebook. More than 5,000 groups that contained meaningful activity (meaning more than 10 of their posts had been flagged by CounterAction) that the reporters analyzed were no longer online as of August 30, 2021. They hand-labeled the political ones: If a group’s name and description showed that it had been created to represent or discuss U.S. politics, or a social movement with strong ties to politics, it went in the “political” bucket. Ultimately, the reporters found about 2,500 of these groups, including those for the QAnon conspiracy theory, militia groups and the Stop the Steal movement.

These groups essentially became the machine’s study guide.

The machine (a text classification model in this case) combed through the remaining groups and compared their posts to those from the 2,500 political groups Jeff had already shown it. Then it predicted how likely each new group was to be political. If the model said there was more than a 50% chance a group was political, the group went into the “political group” bucket. The model identified over 27,000 likely political groups from posts between Election Day and Jan. 6.

Then, like any good teacher, the reporters checked the machine’s work.

They went through a sample of the groups themselves, and determined the model had a precision rate of about 79%, meaning a little over 1 in 5 of the groups that the machine identified turned out to be false positives.

A C+: Under different circumstances, that wouldn’t be anything to display on the refrigerator door.

But the reporters were perfectly happy with it. They didn’t need A+ work, just an estimate accurate enough to cut down the size of the dataset they were working with.

Finding Insurrection Posts in a Haystack

The reporters then enlisted some more help, this time a text-analysis technique called TF-IDF.

Basically, if you give it a bunch of text, TF-IDF will pull out all the words that are used the most often, with more weight given to the most unusual ones.

“Words like ‘the,’ or ‘a’ or ‘and — they don’t really tell you anything about the content of a post, so those would get massively downweighted,” Jeff said.

Reporters sorted for all the Facebook groups with “Stop the Steal” in their names, since these were groups they could be sure sought to delegitimize Biden’s victory.

Then they fed this collection of Stop the Steal groups into TF-IDF, at which point it scratched its head and asked, “Which words make a group a Stop the Steal group?”

It pulled a list of important identifying words or phrases, like “mail-in ballot fraud,” “stop the steal,” “every legal vote” and “treason.” Reporters picked through everything TF-IDF brought up and pulled out the terms that were meaningfully linked to election delegitimization theories; they found about 60.

They also found 86 terms that were linked to delegitimization theories, but only when they appeared alongside other terms. For example, “absentee ballots” on its own doesn’t suggest a post is toxic, but “absentee ballots” plus “fraud” does.

From there, the team did a keyword search based on the terms TF-IDF had identified, and came back with around 1.03 million posts that likely referenced delegitimization.

They again checked the program’s work by hand, going through a sample of the posts the keyword tool had surfaced and checking if they were actually related to election misinformation. The model had a precision rate of about 64%. (False positives included mainstream news articles about extremism, debunkings of fraud claims and references to other countries’ elections.)

To determine a rough number of delegitimizing posts, reporters multiplied the number of “likely” harmful posts (1.03 million) by the estimated proportion of actually toxic posts (64%).

Which brought them to their final estimate: more than 650,000 posts attacking the legitimacy of the 2020 election.

That number is almost certainly an undercount. As mentioned, this analysis only examined posts in a portion of all public groups, and did not include comments, posts in private groups or posts on individuals’ profiles.

Only Facebook has access to all the data to calculate the true total — and it hasn’t done so publicly.