Getty Images

Graph technology helps battle election misinformation

Backed by a grant from Neo4j, including free use of its platform, Syracuse University is exposing coordinated networks that spread election misinformation on social media.

Misinformation has clouded recent U.S. presidential elections and threatens to again in November.

Syracuse University's Institute for Democracy, Journalism and Citizenship (IDJC) is trying to fight that misinformation. And backed by a $250,000 grant from graph database specialist Neo4j -- including free use of the vendor's technology -- the IDJC is doing so.

On Wednesday, the IDJC's ElectionGraph Project released a report titled "Inauthentic Influencers: A Deep-Dive on Outside Groups Buying Social Media Ads That Mention Presidential Candidates" that used technology to examine election misinformation in ads on Facebook and Instagram.

Generative AI and other advanced AI technologies are one means of spreading misinformation. In January, some New Hampshire voters received AI-generated robocalls from a voice that sounded like President Joe Biden that instructed them not to vote in the state's primary. Social media sites such as Facebook, Instagram, TikTok and X, formerly Twitter, are other vehicles for spreading misinformation.

X and TikTok do not make advertising data available to researchers. However, Meta, the parent company of Facebook and Instagram, does. To examine the pervasiveness of election misinformation, the IDJC used Neo4j's graph database technology to research Facebook and Instagram ads from September 2023 to April 2024.

Graph databases specialize in discovering relationships between data points in ways traditional relational databases cannot. Graph databases enable data points to simultaneously connect with multiple other data points, while relational databases only enable connections between a data point and one other at a time. That simultaneous connection to multiple data points helps users efficiently discover relationships between data points that they might not be able to discover in a relational database.

Among the IDJC's findings was that while most ads mentioning Biden and former President Donald Trump were legitimate, coordinated networks of seemingly disconnected pages had combined to spend millions of dollars to spread misinformation about the upcoming election. For example, a network composed of seemingly separate groups -- identified by shared phone numbers and other common attributes, such as variations on the word liberty in their titles -- spent $1.5 million on ads.

Jennifer Stromer-Galley, senior associate dean and professor, Syracuse University School of Information StudiesJennifer Stromer-Galley

Jennifer Stromer-Galley, a senior associate dean and professor at Syracuse's School of Information Studies, led the research in conjunction with IDJC Director Margaret Talev, a Syracuse professor specializing in American politics, and Research Director Johanna Dunaway, a Syracuse political science professor.

In advance of the release of the IDJC's report, Stromer-Galley discussed the project, including the vital role Neo4j's graph technology played in enabling researchers to discover misinformation spread by coordinated networks among seemingly disparate groups.

Editor's note: This Q&A has been edited for clarity and conciseness.

What was the impetus for using graph technology to research potential election misinformation on Facebook and Instagram?

Jennifer Stromer-Galley: I've been studying political campaigns forever. In the last two U.S. presidential election cycles, I focused on the [messaging by] presidential candidates and some of the down-ballot races, and we built an interactive dashboard for journalists and the public to analyze. I have computational classifiers that measure the content of messages in different dimensions -- is it an attack, is it on a policy matter and things like that.

In 2020, I really wanted to also look at other actors, other organizations, that are engaged in electioneering and messaging. I've done research on misinformation and conspiracy theories. But analytically, it is quite challenging to deal with the large volume of data that comes when you start expanding from candidate messaging to all these other actors. It was a limitation in 2020, and we didn't tackle it.

What enabled you to tackle looking at messaging from other actors in 2024?

Stromer-Galley: Margaret Talev found out that Neo4j was launching a competitive grant for studying misinformation in the 2024 presidential election. When she and I met, she mentioned that there was this potential grant opportunity, and I immediately jumped on it. I had written a grant proposal that would have leveraged Neo4j for a very different project, so I had some sense of the power of knowledge graphs to study networks -- interconnected elements of data -- that are hard to surface in traditional relational databases.

We wrote a proposal, and we were selected.

The intent of the grant was to leverage Neo4j to identify misinformation on social media and the actors behind those efforts.
Jennifer Stromer-GalleySenior associate dean and professor, Syracuse University School of Information Studies

What is the intent of your research and the publication of the report and dashboard?

Stromer-Galley: The intent of the grant was to leverage Neo4j to identify misinformation on social media and the actors behind those efforts.

One of the challenges we have as researchers is there's a bit of a needle-in-a-haystack problem. There are numerous different places one could potentially look to see if one can find misinformation, but it's not like there's a misinformation classifier out there. There's not something that you can use and it finds all the misinformation that's floating around. It does require a fair amount of manual labor. It also requires access to data, and that's a huge challenge. As the [social media] platforms clamp down on access to their data, it is much harder this election season than it was even as recently as 2020 to collect relevant data that might help surface the information.

So, how did you collect that data?

Stromer-Galley: There are known websites that are sharing this information, so we considered querying these platforms for posts that were outlinking to known problematic websites. But there are limitations to that, including that what constitutes a problematic website is in the eye of the beholder. For example, conservative audiences will say that Breitbart is not misinformation, that it's the true news that doesn't have left-wing bias. But Breitbart shows up on a lot of lists because some of the information they post is conspiracy theory ideation. One could say the same thing about Huffington Post [on the left]. My aim as a researcher is to be a neutral observer in this process, so to say Breitbart is in as problematic and Huffington Post is out would be problematic in itself.

What we decided to do instead was leverage access we have with the Meta ad library API. Meta makes available an API to approved organizations -- Syracuse University has a licensing agreement to access the API. So, we've been doing research [through the API] since 2018. We have access to these ads, and I have a lot of knowledge about how the ads are structured and the technical aspects of studying that data, so we decided to start there.

We collected the ad data through the API, and then we also collected the available information about who is running these ads, such as the telephone number, email address and physical address.

What did that data enable you to discover?

Stromer-Galley: We leveraged that data to discover linkages and to identify that there are Facebook pages that look to be independent -- they have a name and list who is responsible for the page -- but when you look further, the page will share an administrator's email address with another page that also looks independent. That's what we uncovered.

There have been outlets that have discovered pieces of these networks, but we were able to really see that these networks are interconnected.

How is graph technology better for exposing election misinformation than other technologies?

Stromer-Galley: Graph technology is highly efficient at helping to identify or see relationships in your data that otherwise would be quite hard to discover.

We are novices to Neo4j in terms of using [its programming language] Cypher and understanding the setup of the data. That took us some time because we really had to think about the data schema. We ourselves were still thinking about the data we would need to do this project and discovered that while we were getting the ad data from the ad library API, we had to augment that with page data. Otherwise, we couldn't quite make the connections across the pages. As a result, it took us some time to iterate on the schema. But that's where the work was.

Once we had the schema, it was like snapping a finger and suddenly we had this network. It was magic.

What's an example of that graph technology 'magic'?

Stromer-Galley: While we were trying to figure out the schema, we did a little manual analysis using traditional methods to look at all the shared email addresses, all the shared phone numbers. In a spreadsheet, you can start to organize the data based on all the shared email addresses, but you can't then also see which ones of those share phone numbers, and then of those, which ones share postal addresses. That's just not easy in a relational database structure.

That's the power of graph approaches. They're really efficient.

Could you have done this research with a relational database -- perhaps over more time -- or was graph technology the only way?

Stromer-Galley: Yes, I think we could have. But I am not the most technical of our team members. My husband, Jon, is a software engineer, and he worked at Oracle for a number of years. He lives and breathes databases and was the first to say, 'Wow, this is amazingly fast.' I am sure that Jon could have massaged the data to get us to the same point, but being able to visualize it and see it in a network graph as opposed to rows and columns does further make concrete what you're looking at. That could have been done in R or with some other visualization approach, but that would have taken a lot more time.

If we hadn't had Neo4j, it would have been much, much harder.

Having not previously worked with Neo4j's graph technology, how long did it take to learn the platform before the 'magic' appeared?

Stromer-Galley: We had a Ph.D. student on the project who is modestly technical, but Jon has 30 years in the field. It was just the two of them at the beginning -- that's two part-time folks. We got started in earnest in February, and we had our 'Aha!' moment at the end of April.

Neo4j, as part of the grant, provides technical support. We have two Neo4j staffers that we can go to with technical questions, and they've been offering office hours to the students [who are helping us on the project] to give them tips on how to structure the data and how to query the data, which sped us up a lot.

If we had to do it all on our own, it would have probably taken us twice as long to get to the 'Aha!' moment.

Were there any hitches along the way or between April and now using Neo4j to conduct your research?

Stromer-Galley: We did have some hitches. I mentioned we had a Ph.D. student on the project, and then in May I onboarded some master's students. They've been getting up to date on Cypher. But one of the things they say is that conceptually, it is a very different way of thinking about data than the way they have been trained in their data science programs. That's been the hard part. As they're writing code, Cypher is conceptually very different than Python, R or Java. That's been a bit of a challenge.

And I could fuss about their data visualizations -- there are much better visualization tools out there -- so it would be nice if there were easy interactive dashboards. We built that on our own. But that's not Neo4j's fault. That's just my expectations as a user.

It's now July -- will you produce more reports between now and the election in November?

Stromer-Galley: We're working on a smaller report. The report we're talking about is 40 pages and could have been 70, so we decided to carve out and will work on a smaller report around Facebook pages that either self-identify or that we have identified as news media and are running ads. We're wondering if they're legitimate. Are they running ads as advertising to draw people back to their Facebook pages and websites, or are they doing something else?

We've identified a set of ads on these sites that are attacks, so what's going on there? We're going to dive more into that.

And will you use Neo4j's graph technology to help research that report?

Stromer-Galley: Absolutely.

We're also planning a report for early October. I have a classifier that I built in 2022 that identifies ads or messages on social media that talk about election integrity. They're ads that are either defending or raising concerns about subjects like ballot boxes, foreign operatives, failures in election machines. We want to dive into that because early voting will have started by then, and we want to see to what extent there are coordinated efforts around advertisers running ads about election integrity and who's behind it. We might also do a report in October that is an update to the current report.

The last report will be after the election and will be a roundup of sorts.

With the report now out, how can it educate voters about election misinformation?

Stromer-Galley: That's part of the puzzle. We are offering to work with reporters and newsrooms. The dashboard will stay live and free forever. I think Facebook is struggling to stay on top of these proliferating networks -- and that's not an attack on Facebook, because this is hard. We've alerted them to the report, and they are very interested in what we found, but we just started the conversation with them.

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

Dig Deeper on Data management strategies