By Jason Weiner
#DeleteFacebook might have trended in response to news coverage of how consultants for the Trump campaign employed data mining to target voters on digital media. The techniques and tools now made infamous by Cambridge Analytica are not, in themselves, trade secrets or proprietary technology though. In fact, just about anyone with a couple of data science courses could learn to employ them given a similar data set; the theory and practice are laid out in just a couple of academic journal articles.
The seminal paper in this field of research was published by Proceedings of the National Academy of Sciences in February 2013: “Private traits and attributes are predictable from digital records of human behavior.” The core finding of the paper, authored by Michal Kosinksi, David Stillwell and Thore Graepel, was that Facebook Likes can be used to predict a wide array of personal and personality characteristics.
Stillwell’s contribution to the project was the development of the myPersonality Project Facebook app. The app started as a personal project of Stillwell’s and became a focus of research possibilities after it grew popular. Users voluntarily took a personality test that measures the Big 5 dimensions of personality and then, optionally, provided access to their Facebook profiles. Currently, the resulting database contains over 6 million test results and 4 million Facebook profiles – all provided with the consent of the users and restricted to academic research. (The data notoriously used by Cambridge Analytica followed a similar model of data collection – albeit paying subjects through Amazon’s Mechanical Turk platform – but harvested information not just on survey takers but their friends as well.)
The outcomes of personality profile tests, as well as the demographic and behavioral information provided in survey responses, formed dependent variables (the things an algorithm attempts to predict) for a series of analyses establishing the predictive power of Facebook Likes. Kosinski and Graepel performed an analysis that shaped data on Facebook Likes of 58,466 participants into independent variables that could be used to predict dichotomous variables such as gender, race, sexual orientation and political party identification as well as continuous variables such as age, density and size of Facebook network, Big 5 personality characteristic scores for openness, conscientiousness, extraversion, agreeableness and emotional stability along with intelligence and satisfaction with life scores, also based on established psychological assays.
The results reported are illustrated in Figure 1 below, which are drawn from the publication by Kosinski and his co-authors. On the left, the effectiveness of predicting dichotomous variables is illustrated as the Area Under Curve (AOC) statistic of ROC analysis, which plots true positive rate as a function of false positive rate; the closer AOC is to one, the more effective the test. For continuous variables, illustrated on the right, the models’ effectiveness were measured by the correlation coefficient of the predicted value with the observed value; for personality characteristics, the model predictions based on Facebook data were compared with the test-retest reliability of the personality profiling tool.
Figure 1. The charts illustrate validation measures of models predicting demographic and personality data based on Facebook Likes. The values for dichotomous variables, in the left panel, report the area under the curve that describes correct predictions as a function of incorrect predictions; a value of one is perfect prediction. Continuous variables are illustrated on the right. The bottom three bars show the correlation of the predicted value with the actual value reported or measured. The darker portion of the top bars illustrate the correlation of the model’s prediction with the psychological test for assessing the trait; the lighter bars report the psychological test’s accuracy as measured by the proportion of the time a second application of the test reports the same result as the first.
In other words, at the time of this early research, a Facebook user’s Likes alone could predict gender and ethnic origin (Caucasian vs. African-American) with near total reliability while the same data could predict openness with close to the reliability of the standard psychological assay. The research methods in this study, which was completed in 2012, can only have been refined since.
While these findings were striking at the time and have been profoundly influential among the research community, the methods used are well-known and widely accessible. In fact, they’ve since been spelled out in a 2016 paper, “Mining Big Data to Extract Patterns and Predict Real-Life Outcomes,” published in Psychological Methods, and Michal Kosinski is a primary author. Using the open source statistical program R and anonymized data from myPersonality, the article walks readers through the steps required to render thousands of digital data points into independent variables that can be used to make predictions about real-life characteristics. Those methods, dimensional reduction and regression, should be in the tool box of any aspiring data scientist.
Three data files form the core of the tutorial’s analysis: 110,728 anonymized users with three demographic traits and five personality scores; 1,580,284 Facebook Likes; and 10,612,236 pairs of user ID and Facebook Like ID. To process this data, the article advises the reader to construct a matrix of users and likes, essentially a table in which each row is a user, each column is a Like and the intersection of each row and column is a zero or one indicating whether that user profile features that Like. The result is a matrix with many zero-value user-like intersections, poor material for inferring outcomes based on relationships. (For an idea of how sparse the matrix was, consider that the median number of users per Like in the original matrix is one, i.e. more than half of the Likes were liked by only one user.) Normally, some experimentation with threshold dimensions for users per Like and Likes per user is needed but the authors suggest trimming the matrix to only those rows in which a user is paired with at least 50 Likes and to columns where each Like has been selected by at least 150 users. The resulting matrix still has 3,817,840 rows so more data reduction is needed.
To facilitate further reduction, the authors choose techniques representative of two families of the technique – singular value decomposition (SVD) and Latent Dirichlet Allocation (LDA) cluster analysis – both of which are available through functions that are widely available in statistical analysis programs. The outcome of each dimensional reduction technique is a group of Likes that, taken together, explain a large amount of the variation among the underlying components and can therefore more efficiently facilitate analysis of relationships to a dependent variable like user age or intelligence score. The article devotes significant discussion to the correct number of dimensions or clusters to aim for; the authors argue for experimentation while advising more dimensions is better for prediction while fewer are advisable for explanation. (Generally speaking, this adheres to the advice of statisticians that predictive models are judged only by their performance so more opacity, which a larger number of predictor variables would inevitably add, is permissible.) By far, the most mathematically complicated methods are required in this step of dimensional reduction but that should be relatively unsurprising as the largest practical issues in mining big data are parsing signal from noise – essentially the purpose of SVD and cluster analysis techniques.
Figure 2. The correlation of LDA clusters with dependent variables from the prediction models are illustrated in this correlation matrix. The y-axis shows dependent variables (Neu: Emotional Stability, Agr: Agreeableness, Ext: Extraversion, Con: Conscientiousness, Ope: Openness, Political: Political Party) while the color and intensity of each row’s intersection with a cluster indicates the sign (red: negative, blue: positive) and magnitude of the correlation (darker is a higher correlation coefficient).
Cluster 5 shows a high positive correlation with age, suggesting this dimension gathers older users as well as users with lower openness and higher conscientiousness scores. Pulling the top ten Likes in a cluster allows one to infer commonalities and this exercise results in the list in Figure 3 for LDA5, whose contents highlight the difficulties with making an inference based on the contents of a cluster: After all, how many older, female, conservative Usher/Adam Sandler/Taylor Swift fans do you know? (Maybe more than you think you do, if we correctly understand the story the data tells.)
Figure 3. LDA5 comprises an array of clustered Likes, including these ten most central to the cluster.
Prediction is a different animal than inference, however, requiring no effort on the data scientist’s part to understand why a relationship is predicted, only that the predictions work. For this, Kosinski and company recommend logistic regression to predict dichotomous variables and linear regression to predict continuous variables. Importantly, they divide their data into sections for cross-validation – training the model on a sizable fraction of the data but holding back a portion to measure the accuracy of predicted versus observed outcomes. More flexibility is afforded to the practitioner in choosing the number of independent variables when modeling for prediction rather than inference. Clusters and decomposed-value vectors both remain viable data reduction techniques for prediction. In fact, the authors suggest iterating through clusters in order to find the best fit and then to measure the model performance measured computationally rather than based on hypothesizing a connection among the clustered elements. There’s not much else, essentially, to the technique needed to convert digital fingerprints such as Facebook Likes (or, for instance, groceries purchased) into predictions about demographics and personality.
“So what?” you may ask, though maybe you already suspect the answer if you read this far. What does it matter if a researcher can predict someone’s personality from digital fingerprints? Kosinski and Stillwell again formulate part of the answer, teaming in 2017 with Sandra Matz and Gidi Nave, in the Proceedings of the National Academy of Sciences to deliver evidence of “Psychological targeting as an effective approach to digital mass persuasion.” Their finding was that, using nothing but the techniques and data outlined above, “persuasive appeals matched to people’s extraversion or openness-to-experience level resulted in up to 40% more clicks and up to 50% more purchases than their mismatching or unpersonalized counterparts.”
That’s a remarkable increase in persuasive power from employing a collection of information no more ordered or comprehensive than what exists on virtually everyone with a digital life. Cambridge Analytica’s edge (and the censure their actions have received) stemmed from the means by which they accumulated their data, violating Facebook Terms of Service by employing the data for commercial purposes and collecting it from people who had not given consent. But the techniques they used are uncomplicated and widely documented, and there is no reason to suspect they aren’t as readily applicable to the items in your grocery basket, the articles you read on Google News or any other coherent collection of digital fingerprints – all of which is being compiled and linked with behavioral and personality indicators at the speed of computation. The fuckery has just begun.