Fairness and Information

binary confederate rip

Southern information theorists after the civil war realized that although they could no longer exclude former slaves from the polls, they could exclude people based on other criteria like, say, education, and that these criteria happen to be highly correlated with formerly being a slave who was not allowed education. Republican information theorists continue to exploit this observation in ugly ways to exclude voters, like disenfranchising voters without street addresses (many Native Americans).

In the era of big data, these deplorable types of discrimination have become more insidious. Algorithms determine things like what you see on your Facebook feed or whether you are approved for a home loan. Although a loan approval can’t be explicitly based on race, it might depend on your zip code which may be highly correlated for historical reasons.

This leads to an interesting mathematical question: can we design algorithms that are good at predicting things like whether you are a promising applicant for a home loan without being discriminatory? This type of question is the heart of the emerging field of “fair representation learning”.

This is effectively an information theory question. We want to know if our data contains information about the thing that we would like to predict, but that is not informative about some protected variable. The contribution of a great PhD student in my group, Daniel Moyer, to the growing field of fair representation learning was to come up with an explicit and direct information-theoretic characterization of this problem. His results will appear as a paper at NIPS this year.

He showed that an information-theoretic approach could be more effective with less effort than previous approaches which rely on an adversary to test whether any protected information has leaked through. He also showed that you can use this approach in other fun ways. For instance, you can imagine counter-factual answers like what the data would look like if we changed just the protected variable and nothing else. As a concrete visual example, you can imagine that our “protected variable” is the digit in a handwritten picture. Now our neural net learns to represent the image, but without knowing which specific digit was written. Then we can run the neural net in reverse to reconstruct an image that looks similar to the original stylistically but with any value of the digit that we choose.


Fig. 3 from our paper.

Going back to the original motivation about fairness, even though we can define it in this information-theoretic way it’s not clear that this fits a human conception of fairness in all scenarios. Formulating fairness in a way that meets societal goals for different situations and is quantifiable is an ongoing problem in this interesting new field.

Source: Apparent Horizons

No blog-iversary update

It’s officially been a year since my last blog. There have been so many exciting new things going on that it’s been hard to take time out for some nice big picture blog posts. Here are a few areas that I have the best of intentions for getting to.

Source: Apparent Horizons

Macro-causality and social science

Consider a little science experiment we’ve all done, to find out if a switch controls a light. How many data points does it usually take to convince you? Not many! Even if you didn’t do a randomized trial yourself, and observed somebody else manipulating the switch you’d figure it out pretty quickly. This type of science is easy!

One thing that makes this easy, is that you already know the right level of abstraction for the problem: what a switch is, that it has two states, and that it often controls things like lights. What if the data you had was actually a million variables, representing the state of every atom in the switch, or in the room?

Even though, technically, this data includes everything about the state of the switch, it’s overkill and not directly useful. For it to be useful, it would be better if you could boil it back down to a “macro” description that is just a switch with two states. Unfortunately, it’s not very easy to go from the micro description to the macro one. One reason for this is the “curse of dimensionality”: a few samples of a million dimensional space is considered very under-sampled, and directly applying machine learning methods using this type of data often leads to unreliable results.

As an example of another thing that could go wrong, imagine that we detect, with p<0.000001, that atom 173 is a perfect predictor of the light being on or off. Headlines immediately proclaim the important role of atom 173 in production of light. A complicated apparatus to manipulate atom 173 is devised only to reveal… nothing. The role of this atom is meaningless in isolation from the rest of the switch. And this hints at the meaning of “macro-causality” – to identify (simple) causal effects, we first have describe our system at the right level of abstraction. Then we can say that flipping the switch causes the light to go on. There is a causal story involving all the atoms in the switch, electrons, etc., but this is not very useful.

Social science’s micro-macro problem

Social science has a similar micro-macro problem. If we get “micro” data about every decision an individual makes, is it possible to recover the macro state of the individual? You could ask the same where the micro-variables are individuals and you want to know the state of an organization like a company.

Currently, we use expert intuition to come up with macro-states. For individuals, this might be a theory of personality or mood and include states like extroversion, or a test of depression, etc. After dreaming up a good idea for a macro-state, the expert makes up some questions that they think reflect that factor. Finally, they ask an individual to answer these questions. There are many places where things can go wrong in this process. Do experts really know all the macro states for individuals, or organizations? Do the questions they come up with accurately gauge these states? Are the answers that individuals provide a reliable measure?

Most of social science is about answering the last two questions. We assume we know what the right macro-states are (mood, personality, etc.) and we just need better ways to measure them. What if we are wrong? There may be hidden states underlying human behavior that remain unknown. This brings us back to the light switch example. If we can identify the right description of our system (a switch with two states), experimenting with the effects of the switch is easy.

The mapping from micro to macro is sometimes called “coarse-graining” by physicists. Unfortunately coarse-graining in physics usually relies on reasoning based on the physical laws of the universe, allowing us, for instance, to go from describing a box of atoms with many degrees of freedom to a simple description involving just three macro-variables: volume, pressure, and temperature.

Finding ways to automate coarse-graining for complex systems, as arise in social science, is one of the main goals of my research. One simple idea, the CorEx principle, has motivated a lot of this work. The principle says that “a good macro-variable description should explain most of the relationships among the micro-variables.” We have gotten some mileage from this idea, finding useful structure in gene expression data, social science data, and (in ongoing work) brain imaging, but I suspect it’s far from enough to completely solve this problem.  Coarse-graining that allows us to simplify the causal description of our system (as in the light switch example) seems like a fruitful angle to push this research further, and I hope to make or see more progress on this question in the future. (A few ideas about this that I’m aware of are here: 1 2 3 4 5. I would love to know of things I’ve missed! )

Source: Apparent Horizons

The “Grue” problem (and deep learning)

The Grue language doesn’t have words for “blue” or “green”. Instead Grue speakers have the following concepts:

grue: green during the day and blue at night

bleen: blue during the day and green at night

(This example is adapted from the original grue thought experiment.) To us, these concepts seem needlessly complicated. However, to a Grue speaker, it is our language that is unnecessarily complicated. For him, green has the cumbersome definition of “grue during the day and bleen at night”.

How can we wipe the smug smile off this Grue speaker’s face, and convince him of the obvious superiority of our own concepts of blue and green? What we do is sneak into his house at night and blindfold and drug the Grue speaker. We take him to a cave deep underground and leave him there for a few days. When he wakes up, he has no idea whether it is day or night. We remove his blindfold and present him with a simple choice: press the grue button and we let him go, but press the bleen button… Now he’s forced to admit the shortcomings of “grue” as a concept. By withholding irrelevant extra information (the time of day), grue does not provide any information about visual appearance. Obviously, if we told him to press the green button, he’d be much better off.

We say that grue-ness and time of day exhibit “informational synergy” with respect to predicting the visual appearance of an object. Synergy means the “whole is more than the sum of the parts” and in this case, knowing either the time of day or the grue-ness of an object does not help you predict its appearance, but knowing both together gives you perfect information.

Grues in deep learning

This whimsical story is a very close analogy for what happens in the field of “representation learning”. Neural nets and the like learn representations of some data consisting of “neurons” that we can think of as concepts or words in a language, like “grue”. There’s no reason for generic deep learners to prefer a representation involving grue/bleen to one with blue/green because either will have the same ability to make good predictions. And so most learned representations are synergistic and when we look at individual neurons in these representations they have no apparent meaning.

The importance of interpretable models is becoming acutely apparent in biomedical fields where blackbox predictions can be actively dangerous. We would like to quantify and minimize synergies in representation learning to encourage more interpretable and robust representations. Early attempts to do this are described in this paper about synergy and another paper demonstrates some benefits of a less synergistic factor model.

Revenge of the Grue

Now, after making this case, I want to expose our linguo-centrism and provide the Grue apologist’s argument, adapted from a conversation with Jimmy Foulds. It turns out the Grue speakers live on an island that has two species of jellyfish: a bleen-colored one that is deadly poisonous and a grue-colored one which is delicious. Since the Grue people encounter these jellyfish on a daily basis and their very lives are at stake, they find it very convenient to speak of “grue” jellyfish, since in the time it takes them to warn about a “blue during the day but green at night jellyfish”, someone could already be dead. This story doesn’t contradict the previous one but highlights an important point. Synergy only makes sense with respect to a certain set of predicted variables. If we minimize synergies in our mental model of the world, then our most common observations and tasks will determine what constitutes a parsimonious representation of our reality.


I want to thank some of the PhD students who have been integral to this work. Rob Brekelmans did many nice experiments for the synergy paper. He has provided code for the character disentangling benchmark task in the paper. Dave Kale suggested key aspects of this setup. Finally Hrayr Harutyunyan has been doing some amazing work in understanding and improving on different aspects of these models. The code for the disentangled linear factor models is here, I hope to do some in depth posts about different aspects of that model (like blessings of dimensionality!).

Source: Apparent Horizons

Twitter bots for good, and information contagion!

Our latest work, titled “Evidence of complex contagion of information in social media: An experiment using Twitter bots” was published in Plos One on September 22, 2017!

In this study, in collaboration with Bjarke Mønsted, Piotr Sapieżyński, and Sune Lehmann from the Denmark Technical University (DTU), we studied the effects of deploying positive interventions on Twitter using social bots.

The DTU team developed and deployed 39 Twitter bots, which connected within the community of users of San Francisco, during the second half of 2014. Starting in early October 2014 and throughout the rest of the year, the bots, some of which accrued thousands of followers, started to introduce positive memes, (listed in the table), to foster public health, fitness behaviors, and doing social good.

By using mathematical modelling in combination with statistical techniques, we used the data we collected to study how information spreads on Twitter. In particular, we seek to understand whether information passes from person to person like an epidemic spreading (or simple contagion), where each exposure to a virus (or likewise a meme) yields an independent probability of contracting the given disease, or otherwise whether being exposed to the meme multiple times from multiple sources greatly enhances the probability of that meme being adopted/retweeted by a user (complex contagion). 

Our analysis shows that the complex contagion hypothesis is the most likely to fully capture information diffusion dynamics on Twitter. By means of our experiment, in which Twitter users naturally partitioned themselves in groups following one bot, two bots, three bots, etc., we were capable of recording the number and sources of exposures of memes for each user in our pool, and therefore estimate, for the first time in a setting similar to a semi-controlled experiment, what factors play a role in information diffusion online: it appears that, for the type of positive memes we introduced, seeing them from multiple sources greatly enhanced the probability of retweeting the meme.

We hope to use what we learned from this study to improve our ability to deliver online interventions in the future!

You can read the rest of the study on Plos One!

Cite as:

Mønsted B, Sapieżyński P, Ferrara E, Lehmann S (2017) Evidence of complex contagion of information in social media: An experiment using Twitter bots. PLOS ONE 12(9): e0184148. https://doi.org/10.1371/journal.pone.0184148

 Press coverage:

  1. Researchers find that Twitter bots can be used for good – Tech Crunch
  2. Twitter Bots Can Encourage Decent Conduct, Not Just Fake News – News18
  3. Twitter bots for good: USC ISI study reveals how information spreads on social media – EurekAlert!


Source: Emilio

Diffusion of ISIS propaganda on Twitter

My latest work titled “Contagion dynamics of extremist propaganda in social networks” has been published on Information Sciences. The study aims at modeling and understanding the diffusion of extremist propaganda, in particular content in support of ISIS, on social media like Twitter.

Starting from a list of twenty-five thousand annotated accounts that have been associated with ISIS and suspended by Twitter, we obtained a large Twitter dataset of over one million posts these users generated. We studied network and temporal activity patterns, and investigated the dynamics of social influence within ISIS supporters. 

To quantify the effectiveness of ISIS propaganda and determine the adoption of extremist content in the general population, we drew a parallel between radical propaganda and epidemics spreading. We identified information broadcasters and influential ISIS supporters and showed that they generate highly-infectious cascades of information contagion.

To read further, please refer to the published journal version. The paper is also available on arxiv.

Cite as:

Emilio Ferrara. Contagion dynamics of extremist propaganda in social networks. Information Sciences (2017) doi:10.1016/j.ins.2017.07.030

Source: Emilio

#MacronLeaks, bots, and the 2017 French election

My latest work investigates the #MacronLeaks disinformation campaign that occurred in the run up to the 2017 French presidential election.

Using a large dataset containing nearly 17 million tweets posted by users in the period between the end of April, and May 7, 2017 (Election Day), I first isolated the campaign that was carried out to allegedly reveal frauds and other illicit activities related to moderate candidate Emmanuel Macron, and in support of far-right candidate Marine Le Pen.

New yet simple machine learning techniques devised specifically to analyze the millions of users appearing in this dataset revealed a large social bot operation and pointed to nearly 18 thousand bots deployed to push #MacronLeaks and related topics. The campaign attracted significant attention on the eve of Election Day, engaging overall nearly 100 thousand users in the time span of a few days.

The analysis revealed important new insights about social bot operations and disinformation campaigns on online social media:

  1. Many bot accounts that supported alt-right narrative in the context of #MacronLeaks were originally created shortly prior to the 2016 U.S. presidential election and used to support the same views in the context of American politics. The accounts went dark after November 8, 2016, only to re-emerge at the beginning of May 2017 to push #MacronLeaks, attack Macron, and support the far-right candidate Marine Le Pen. This corroborates a recent hypothesis about the existence of black markets for reusable political botnets. 
  2. The audience engaged with #MacronLeaks was mainly English-speaking American userbase, rather than French users. Their prior interests prominently feature support for Trump and Republican views, as well as more extreme, alt-right narratives. This suggests a possible explanation for the scarce success of the disinformation campaign: French users, those more likely to vote in support of Macron, were not mobilized nor significantly engaged in discussing these document leaks.

The paper, titled “Disinformation and Social Bot Operations in the Run Up to the 2017 French Presidential Election”, is set for publication on August 7, 2017 in the peer-reviewed journal First Monday. To learn more about this work, read the preprint paper available on SSRN.

Cite as:

Emilio Ferrara. Disinformation and Social Bot Operations in the Run Up to the 2017 French Presidential Election. First Monday, 22(8), 2017


  1. Fake news bots are so economical, you can use them over and over – Harvard NiemanLab
  2. Pro-Trump Twitter bots were also used to target Macron, research shows – The Verge
  3. There’s a Bit of Overlap Between Bots Trying to Manipulate American and French Elections – New York Magazine
  4. Research links pro-Trump, anti-Macron Twitter bots – The Hill
  5. The Same Twitter Bots That Helped Trump Tried to Sink Macron, Researcher Says – VICE

Press in non-English media

  1. Macron Leaks : Les bots pro-Trump utilisés dans la campagne de désinformation – Le Monde (in French)

Source: Emilio

Gene expression updates

The work with Shirley Pepke on using CorEx to find patterns in gene expression data is finally published in BMC Medical Genomics.

Shirley wrote a blog post about it as well. She will present this work at the Harvard Precision Medicine conference and we’ll both present at Berkeley’s Data Edge conference.

The code we used for the paper is online. I’m excited to see what people discover with these techniques, but I also can see we have more to do. If speed is an issue (it took us two days to run on a dataset with 6000 genes… many datasets can have an order of magnitude more genes), please get in touch as we have some experimental versions that are faster. We are also working on making the entire analysis pipeline more automated (i.e. connecting discovered factors with known biology and visualizing predictive factors.) To that end, I want to thank the Kestons for supporting future developments under the Michael and Linda Keston Executive Directorship Endowment.


Source: Apparent Horizons

Millions of social bots invaded Twitter!

Our work titled Online Human-Bot Interactions: Detection, Estimation, and Characterization has been accepted for publication at the prestigious International AAAI Conference on Web and Social Media (ICWSM 2017) to be held in Montreal, Canada in May 2017!

The goal of this study was twofold: first, we aimed at understanding how difficult is to detect social bots on Twitter respectively for machine learning models and for humans. Second, we wanted to perform a census of the Twitter population to estimate how many accounts are not controlled by humans, but rather by computer software (bots).

To address the first question, we developed a family of machine learning models that leverages over one thousand features characterising the online behaviour of Twitter accounts. We then trained these models with manually-annotated collections of examples of human and bot-controlled accounts across the spectrum of complexity, ranging from simple bots to very sophisticated ones fueled by advanced AI. We discovered that, while human accounts and simple bots are very easy to identify, both by other humans and by our models, there exist a family of sophisticated social AIs that systematically escape identification by our models and by human snap-judgment.

Our second finding reveals that a significant fraction of Twitter accounts, between 9% and 15%,  are likely social bots. This translates in nearly 50 million accounts, according to recent estimates that put the Twitter userbase at above 320 million. Although not all bots are dangerous, many are used for malicious purposes: in the past, for example, Twitter bots have been used to manipulate public opinion during election times, to manipulate the stock market, and by extremist groups for radical propaganda.

To learn more, read our paper: Online Human-Bot Interactions: Detection, Estimation, and Characterization.

Cite as:

Onur Varol, Emilio Ferrara, Clayton Davis, Filippo Menczer, Alessandro Flammini. Online Human-Bot Interactions: Detection, Estimation, and Characterization. ICWSM 2017


Press Coverage

  1. CMO Today: Marketers and Political Wonks Gather for SXSW – The Wall Street Journal
  2. Huge number of Twitter accounts are not operated by humans – ABC News
  3. Up to 48 million Twitter accounts are bots, study says – CNET
  4. R u bot or not? – VICE
  5. New Machine Learning Framework Uncovers Twitter’s Vast Bot Population – VICE/Motherboard
  6. A Whopping 48 Million Twitter Accounts Are Actually Just Bots, Study Says – Tech Times
  7. Study reveals whopping 48M Twitter accounts are actually bots – CBS News
  8. Twitter is home to nearly 48 million bots, according to report – The Daily Dot
  9. As many as 48 million Twitter accounts aren’t people, says study – CNBC
  10. New Study Says 48 Million Accounts On Twitter Are Bots – We are social media
  11. Almost 48 million Twitter accounts are bots – Axios
  12. Twitter user accounts: around 15% or 48 million are bots [study] – The Vanguard
  13. Rise of the TWITTERBOTS – Daily Mail
  14. 15 per cent of Twitter is bots, but not the Kardashian kind – The Inquirer
  15. 48 mn Twitter accounts are bots, says study – The Economic Times
  16. 9-15 per cent of Twitter accounts are bots, reveals study – Financial Express
  17. Nearly 48 million Twitter accounts are bots: study – Deccan herald
  18. Study: Nearly 48 Million Twitter Accounts Are Fake; Many Push Political Agendas – The Libertarian Republic
  19. As many as 48 million accounts on Twitter are actually bots, study finds – Sacramento Bee
  20. Study Reveals Roughly 48M Twitter Accounts Are Actually Bots – CBS DFW
  21. Up to 48 million Twitter accounts may be Bots – Financial Buzz
  22. Up to 15% of Twitter accounts are not real people – Blasting News
  23. Tech Bytes: Twitter is Being Invaded by Bots – WDIO Eyewitness News
  24. About 9-15% of Twitter accounts are bots: Study – The Indian Express
  25. Twitter Has Nearly 48 Million Bot Accounts, So Don’t Get Hurt By All Those Online Trolls – India Times
  26. Twitter May Have 45 Million Bots on Its Hands – Investopedia
  27. Bots run amok on Twitter – My Broadband
  28. 9-15% of Twitter accounts are bots: Study – MENA FN
  29. Up To 15 Percent Of Twitter Users Are Bots, Study Says – Vocativ
  30. 48 million active Twitter accounts could be bots – Gearbrain
  31. Study: 15% of Twitter accounts could be bots – Marketing Dive
  32. 15% of Twitter users are actually bots, study claims – MemeBurn
  33. Almost 48 million Twitter accounts are bots – Click Lancashire

Press in non-English media

  1. Bad Bot oder Mensch – das ist hier die Frage – Medien Milch (in German)
  2. Studie: Bis zu 48 Millionen Twitter-Nutzer sind in Wirklichkeit Bots – T3N (in German)
  3. Der Aufstieg der Twitter-Bots: 48 Millionen Nutzer sind nicht menschlich – Studie – Sputnik News (in German)
  4. Studie: Bis zu 48 Millionen Nutzer auf Twitter sind Bots – der Standard (in German)
  5. “Blade Runner”-Test für Twitter-Accounts: Bot oder Mensch? – der Standard (in German)
  6. Bot-Paradies Twitter – Sachsische Zeitung (in German)
  7. 15 Prozent Social Bots? – DLF24 (in German)
  8. TWITTER: IST JEDER SIEBTE USER EIN BOT? – UberGizmo (in German)
  9. Twitter: Bis zu 48 Millionen Bot-Profile – Heise (in German)
  10. Studie: Bis zu 15 Prozent aller aktiven, englischsprachigen Twitter-Konten sind Bots – Netzpolitik (in German)
  11. Automatische Erregung – Wiener Zeitung (in German)
  12. 15 por ciento de las cuentas de Twitter son ‘bots’: estudio – CNET (in Spanish)
  13. 48 de los 319 millones de usuarios activos de Twitter son bots – TIC Beat (in Spanish)
  14. 15% de las cuentas de Twitter son ‘bots’ – Merca 2.0 (in Spanish)
  15. 48 de los 319 de usuarios activos en Twitter son bots – MDZ (in Spanish)
  16. Twitter, paradis des «bots»? – Slate (in French)
  17. Twitter compterait 48 millions de comptes gérés par des robots – MeltyStyle (in French)
  18. Twitter : 48 millions de comptes sont des bots – blog du moderateur (in French)
  19. ’30 tot 50 miljoen actieve Twitter-accounts zijn bots’ – NOS (in Dutch)
  20. 48 εκατομμύρια χρήστες στο Twitter δεν είναι άνθρωποι, σύμφωνα με έρευνα Πηγή – LiFo (in Greek)
  21. 48 triệu người dùng Twitter là bot và mối nguy hại – Khoa Hoc Phattrien (in Vietnamese)

Source: Emilio

Complex System Society 2016 Junior Scientific Award!

I was selected as recipient of the 2016 Junior Scientific Award by the Complex System Society!

The award readsEmilio Ferrara is one of the most active and successful young researchers in the field of computational social sciences. His works include the design and application of novel network-science models, algorithms, and tools to study phenomena occurring in large, dynamical techno-social systems. They improved our understanding of the structure of large online social networks and the dynamics of information diffusion. He has explored online social phenomena (protests, rumours, etc.), with applications to model and forecast individual behaviour, and characterise information diffusion and cyber-crime. 


Source: Emilio