Pete Cuppernull (pcuppernull@uchicago.edu)
On March 17, 2018, The Guardian and The New York Times published articles about the acquisition of Facebook user data by political consulting firm Cambridge Analytica. The firm collected the data of up to 87 million Facebook users without their knowledge, which it then used while providing various services to political campaigns. Shortly thereafter, Facebook CEO Mark Zuckerberg testified before Congress regarding the company’s policies on user data. After the hearing, only 28% of Facebook users believed the company was committed to user privacy, down from 79% the year prior.[1] These events sparked wider conversations on data privacy and the connection between social media and politics.
These conversations continue today. There was significant concern regarding the integrity of the 2020 U.S. presidential election, where many questioned the political neutrality of internet giants like Facebook, Google, and Twitter. Documentaries including The Social Dilemma prompted countless viewers to delete social media applications off their phones. And many conservatives in the United States have recently flocked to alternative social media sites in a search for uncensored and more equitable platforms.
This project will seek to trace changes of views on data privacy and the perception of the societal impact of social media platforms, specifically Facebook, using text data from Reddit. I find evidence that Reddit users’ view of Facebook and social media changed in 2018, and that these changes were largely concentrated among liberal users. This evidence heavily relies on deep neural network-powered text generation models.
Data and Corpus
I collect a corpus of user content posted to three ideologically distinct subreddits: r/Conservative, the largest community for political conservatives on Reddit; r/Democrats, the primary subreddit for discussions of the U.S. Democratic party; and r/Socialism, which is dedicated to the discussion of socialism as an ideology and is not specific to the United States. I collect all text-based user content posted between January 2017 and September 2019. In this project, I consider all individual comments on a post as separate observations. I used Google BigQuery to collect all posts to each subreddit and removed comments that had either been taken down by the original author (and appeared as “[deleted]” in the data set) or were made by a user whose account had since been removed. After removing these observations, the final corpus consisted of 1.73 million total comments posted across the three subreddits.
Text Generation
There are a variety of challenges to inference when it comes to measuring the views of users towards social media companies. While sentiment analysis of relevant posts might provide some sort of measure of user opinions on various topics, it would be challenging to differentiate between textually similar but semantically distinct topics – for example, it may be difficult to disentangle whether users like or dislike the user experience of the platform from whether they like or dislike the platform’s policy on user data.
I therefore propose an alternative method of examining Reddit users’ views of data privacy and social media: text generation models. Using text generation, we can ask a model to produce a response to any input prompt. In this project, I use GPT-2 within the text generation framework provided by Hugging Face. GPT-2 is a transformers model with a causal language modeling objective – in effect, the model has the capability to guess the next word in a sentence. GPT-2 was trained on a large corpus of English language text, and I fine-tune the model using my corpus from Reddit. For each subreddit and each month between January 2017 and September 2019, I fine-tune a separate GPT-2 model such that the model learns to generate text in line with the text posted by Reddit users in that subreddit-month. In total, I create 52 separately trained GPT-2 models. Such an approach allows me to pose the same prompts to the succession of models and observe how the distribution of generated text from each prompt varies between subreddits and over time. This strategy serves as a “survey” of user responses on my topics of interest.
A primary shortcoming of this approach is in how I have structured my corpus. Because each comment is viewed separately, I do not take into account the context of the thread to which a given comment was posted. In future iterations of this project, I plan to rework the training data to better account for the context of user comments.
Prompts Posed to Text-Generation Models
After creating the models, I posed eight prompts following which the models generated text:
- “My view on Facebook is”
- “I think Facebook”
- “When it comes to data privacy,”
- “When it comes to data privacy, Facebook”
- “Social media’s impact on society is”
- “The impact of social media on society is”
- “Facebook’s impact on society is”
- “Facebook’s impact on democracy is”
Text generation models typically perform best at self-contained tasks, such as generating a single word or a short phrase following an input string. It is often challenging to elicit significant variation when producing multiple open-ended responses to the same input string. Often, longer strings will remain almost entirely the same, with a single word changing between them.
Since my strategy was to collect multiple responses from each prompt as a sort of “survey” of users, the lack of substantial variation across responses posed a challenge – in effect, I was only producing restatements of the “top response”. To address this challenge, I employed a two-stage text generation approach. In the first stage, I produced 10 separate short response strings, only 7 words longer than the prompt. Given that this task is more “self-contained”, there tended to be greater variation in the short responses. I then took the 10 short responses as inputs for a second stage of text generation. The variance in these input strings produced a level of variance in the longer output strings which could not be achieved in a single stage of text generation. Figure 1 further illustrates this process.
Table 1 contains a sample of responses to the prompt “When it comes to data privacy” using this two-stage procedure. These results come from the model trained on data from r/Conservative in January 2019.
Prompt | Stage 1: Short String | Stage 2: Long String |
When it comes to data privacy | When it comes to data privacy, it’s not a good idea to have | When it comes to data privacy, it’s not a good idea to have a government that is so beholden to the private sector. I’m not saying that the government should be able to do anything about it, but |
When it comes to data privacy | When it comes to data privacy, there are a lot of people who | When it comes to data privacy, there are a lot of people who are willing to compromise on the issue. I’m not saying that the government should be able to compel companies to provide data to the FBI, but I’m saying it should |
When it comes to data privacy | When it comes to data privacy, there are a lot of things that | When it comes to data privacy, there are a lot of things that can be done to protect your data. The most obvious is to make sure that your personal information is kept private. If you have a personal data breach, you can’t |
Table 1: Sample Text Generation Results
Embeddings
Once I obtain the data set of generated text, I explore variation in the responses by embedding the responses in high-dimensional space, which allows us to observe the “movement” of the responses through this space over time. I employ two main embedding strategies. Both strategies use Doc2Vec, which represents each response as a fixed-length feature vector and overcomes many of the shortcomings of bag-of-words approaches.[2] My first strategy is to use Doc2Vec to embed the responses in two-dimensional space, which allows for easy visualization thereafter. With my second strategy, I use Doc2Vec to project the responses into 50-dimensional space. I then use Uniform Manifold Approximation and Projection (UMAP) to reduce the 50-dimensional embedding to a two-dimensional vector space. Finally, I plot the UMAP results.
Two-Dimensional Doc2Vec Embeddings
Below I present a series of visualizations of the two-dimensional Doc2Vec embeddings. I begin by focusing on the results from r/Conservative. Each figure selects a different prompt and displays the results of the embedding locations across each time period.
Figure 2 shows the two-dimensional Doc2Vec embeddings of the responses generated by the prompt “The impact of social media on society is…”. At first glance, there appears to be minimal correlation between the location of the embedded response and the time window of Reddit data on which the model was trained. Figures 3 and 4 display the responses for two additional prompts: “Facebook’s impact on society is…” and “When it comes to data privacy, Facebook…”, respectively.
Figures 3 and 4 exhibit similar results. There is no obvious clustering of embedding locations across time.
Next, I hypothesized that there may be too much variation in the individual responses in order to visually observe movement over time. I therefore take the average embedding location of each response per month and plot the results. Figures 5, 6, and 7 display the average monthly embedding locations for the same three prompts within the r/Conservative models:
Again, there does not appear to be any apparent clustering in the embedding locations as a function of time. These results suggest one of two things. First, it may be the case that there is too much noise in the responses such that any “signal” is not readily apparent. This is a probable outcome given the shortcomings of text generation models with longer form response generation. Second, these results could suggest that there is not a correlation between the views of Reddit users on these topics and time. It is necessary to further examine the results before making either claim.
Finally, given our theoretical priors on the impact of the Cambridge Analytica scandal on user opinions regard data privacy, social media, and governance, I specifically compare the embedding results between February 2018 and April 2018, the months before and after the original story broke. Figure 8 displays the results for the prompt “Facebook’s impact on democracy is…”:
In this plot, we can see a relatively clear distinction between the results of the two months – the results of February 2018 appear on the right ride of the plot, while those of April 2018 appear on the left. While it remains possible that this distinction between the two months is a result of random chance, this finding would suggest that the responses generated by the models trained on these two months of data are systematically different.
Next, I wish to further explore the extent to which there may be correlation between the embedding locations and time – it is possible that there is a time trend that is not readily apparent from the plots. I turn to a simple OLS regression framework where I regress each embedding dimension (i.e. considering the X and Y axes as dependent variables in separate regressions) on time in two separate specifications:
Where Embedding is the embedding location across the X or Y dimension, Time is the month of data on which the model that produced the embedding was trained (where January 2017 = 1 and September 2019 = 32), and ε is the error term. Specification 1 would capture any linear relationships between time and either of the embedding dimensions, whereas Specification 2 would additionally capture any quadratic relationships.
Table 2 shows the results of 32 separate regressions. For each of the eight prompts, I show Specifications 1 and 2 with both the X and Y dimensions as dependent variables. I highlight the results with estimates that are statistically significant above a 95% confidence threshold:
The results of Table 2 largely confirm the initial findings of the plotted results, that there is minimal correlation between time and the generated text from the models. While there are a handful of statistically significant results across the specifications, we should expect to see a small number of results reach conventional levels of significance even in a dataset of purely random noise. As it stands, there is not enough evidence to make a claim that there was a change in the views of users within r/Conservative on our topics of interest.
Comparison to Other Subreddits
In the rest of the analysis, I compare the results for r/Conservative to two ideologically different subreddits: r/Democrats and r/Socialism. As a reminder, r/Democrats serves as a forum specifically with reference to the Democratic party in the United States, while r/Socialism is a “community for socialists to discuss current events in our world from anti-capitalist perspectives”, not specific to the United States.[3] For these subreddits, I followed the same process for collecting data, fine-tuning GPT-2 models, and generating text responses. r/Democrats and r/Socialism are substantially smaller than r/Conservative, with 269,461 and 349,111 members, respectively, compared to 815,374 of r/Conservative.[4] Upon creating the monthly models, I determined that there was too little training data for these subreddits to generate semantically interesting responses. As such, I opted to create quarterly models successively trained on three-month periods of data from each subreddit.[5]
Here, I seek to examine two questions. First, do we see within-subreddit variation over time? Second, do we see a distinction in the responses between subreddits at a given time?
Beginning with Figure 9, I dive further into the prompt “Facebook’s impact on society is”. This figure displays the two-dimensional Doc2Vec embeddings for the responses of each of the three subreddits, broken out by quarter.
At first glance, it does not appear as if there is a substantial distinction between the results of each of the three subreddits. Considering the panels for Q1 2018, Q2 2018, and Q3 2018, around the time of the report of the Cambridge Analytica scandal (March 17, 2018) and the subsequent congressional hearing of Mark Zuckerberg (April 10, 2018), there is no clear separation of the locations of the embedded responses across subreddit lines. However, we can observe the embeddings generally moving to the right side of the Q2 2018 panel, further so than any of the previous panels. This would suggest that the generated responses exhibit similar semantic changes across all the subreddits during this time period. A conclusion of this result would be that the opinion of Reddit users regarding the societal impact of Facebook changed during the spring of 2018 in a way that did not correlate with their political ideology. While speculative, it appears as if this result persists through the Q3 2018 and Q4 2018 panels, suggesting a “sticky” change in opinion that did not fade immediately following the public revelations of the scandal.
In Figure 10, I present the same results while removing the data for r/Conservative. This allows us to better examine any semantic differences between the r/Democrats and r/Socialism subreddits:
Similar to Figure 9, we can observe a movement to the right side of the panel for the embedding positions of the responses for Q2 2018. What is also notable is that there is an observable distinction between responses of r/Democrats and r/Socialism in this same panel, where the r/Socialism responses are positioned above and to the left of the r/Democrats responses. We can also observe a distinction across subreddit lines in Q4 2017 and Q1 2019 – the distinction in 2017 (before the Cambridge Analytica scandal) would suggest that there were causes other than the scandal for the separation in responses between subreddits, although Q2 2018 does represent the cleanest break in the embedding locations of any of the time periods.
Considering that there is more cohesion in the embedding locations within the r/Democrats and r/Socialism subreddits compared to the r/Conservative subreddit, we may hypothesize that there is less variation in opinion of the impact of Facebook on society among individuals who have liberal political ideologies. However, there are several possible alternative explanations for this empirical result. First, r/Conservative is a larger subreddit that captures a range of right-leaning political views, whereas r/Democrats and r/Socialism speak to more specific political preferences. Thus, we should expect greater variation in opinion among posters to r/Conservative on most topics. Second, because r/Conservative is a more popular subreddit, a post from the subreddit is more likely to reach to “front page” of Reddit, where the post becomes easily visible to all Reddit users. Such posts would solicit comments from a wider variety of users who do not necessarily espouse a conservative political ideology, which would produce greater variation among the generated text from models trained on r/Conservative comments. I will not address these alternative explanations here, but hope to in future iterations of this project.
To assess whether this distinction between r/Democrats and r/Socialism is present in Q2 2018 for the other prompts in the data set, Figure 11 presents the Q2 2018 results for all eight prompts.
In viewing the results for Q2 2018 for the other prompts, we observe that our original prompt, “Facebook’s impact on society is”, likely offers the clearest distinction between the r/Democrats and r/Socialism subreddits. But, the top left and top right panels – “Facebook’s impact on democracy is” and “My view on Facebook is”, respectively – also offer somewhat clear distinctions between the subreddits for this time period. This suggests that when it comes to users’ view of the high-level impact of Facebook on society and government, there is variation in opinion across ideological lines within left-leaning users.
Furthermore, we do not see a clear distinction between subreddits for the prompts “Social media’s impact on society is” and “The impact of social media on society is”, which effectively substitute out “Facebook” for “social media”. This result suggests that the distinction we observe across the three Facebook-specific panels is a function of users’ views of Facebook, not social media more broadly. Finally, the last two Facebook-specific panels at the bottom of Figure 11 – “When it comes to data privacy, Facebook” and “When it comes to Facebook, the government should” – do not exhibit as clear distinctions as the top three panels. This suggests that while users may have particular views on the societal impact of Facebook, they do not have solidified policy views with respect to Facebook. This finding sits in line with recent research on the economics of privacy and social data, which largely implies that while users may have a preference for greater user privacy, it is challenging to develop viable policy suggestions to address their concerns.[6]
The top left panel of Figure 11, which represents the prompt “Facebook’s impact on democracy is”, appears to offer some clustering of responses between subreddits. Figure 12 displays the quarterly results for this prompt for r/Conservative, r/Democrats, and r/Socialism:
Similar to Figure 9, we do not see any apparent clustering of the results between the three subreddits. The results from r/Conservative again appear to exhibit more variation than the other two subreddits. Accordingly, I remove the results from r/Conservative in Figure 13.
Here, there appears to be less distinct clustering overall, but the results of Q2 2017, Q3 2018, and Q1 2019 all appear to show a degree of noisy clustering. I do not propose any explanation as to why we should observe clustering in those time periods — I believe it is likely that these results in particular are due to random chance. If this is the case, it would suggest that among ideologically liberal Reddit users, they tend to have more grounded opinions with respect to Facebook’s impact on society (Figure 10) than Facebook’s impact on democracy (Figure 13).
Alternative Embedding and Dimension Reduction Strategy
The above represents a single strategy of visualizing the results of the text generation procedure. Below, I employ an alternative approach. Rather than using Word2Vec to produce two-dimensional embeddings, I instead use it to produce a 50-dimensional embedding of each generated response. After creating the new embedding, I use Uniform Manifold Approximation and Projection (UMAP) to reduce the 50-dimensional embeddings to a two-dimensional vector space.
Figure 14 shows the results of UMAP dimension reduction on the responses produced by the prompt “Facebook’s impact on democracy is”.
We can see that in each month, the projected answers appear roughly in an inverted “U” shape. For each panel to exhibit the same shape would suggest that there was little semantic change in the responses across the quarterly models, or that there was little change in the opinion of Reddit users on the topic. That is largely what we see here – however, there are minor changes for the results from Q1 2017, Q2 2018, and Q4 2018. While I do not have a sound theoretical explanation for why the responses in Q1 2017 would be systematically different than the rest of the quarters, it is possible that the somewhat unique results in 2018 are due to changing public opinion following the Cambridge Analytica scandal.
With the two-dimensional Word2Vec embeddings, we found suggestive evidence of a change in the responses to the prompt “Facebook’s impact on society is”. I show the quarterly results of the UMAP representations of this prompt in Figure 15 below.
In Figure 15, the final three quarters of 2018 appear to be systematically different than the rest of the quarters in the data set. Especially considering the distinction between Q1 2018 and Q2 2018 (either side of the Cambridge Analytica scandal), there seems to be a substantial shift in the positions of the UMAP results. The fact that we see a significant change for this prompt, which considers Facebook’s impact on society, and we see at most a marginal change within Figure 14, which considers Facebook’s impact on democracy, suggests that the revelations regarding user data prompted individuals to form new opinions of Facebook’s impact on society more so than on democracy.
Figure 16 considers the results in Q1 2018 and Q2 2018 in more detail. Below, I display the UMAP projections for each subreddit across the two quarters.
Figure 16 provides further evidence for one of our earlier findings – that during 2018, left-leaning Reddit users exhibited greater changes in opinion with respect to Facebook’s impact on society than right-leaning users. In the top two panels, we see a significant shift in the locations of the responses generated by the models trained on data from r/Democrats. There is a similar result in the bottom two panels for the r/Socialism models. The r/Conservative results do seem to exhibit as small shift as well, albeit to a lesser extent. In sum, it appears as if the opinions of all users moved in a similar direction, but the movement was greater among those posting to left-leaning subreddits.
To further assess this finding, I conduct a series of t-tests on the results to assess the statistical significance of the changes in UMAP locations between periods. I conduct a t-test for each subreddit’s X and Y embedding locations, as well as a final round of tests that aggregate all subreddits. The t-tests evaluate the probability that the means of the populations that produced the two samples – in this case, the results from Q1 2018 and Q2 2018 – are the same. A significant result indicates that we can reject the null hypothesis that the population means are the same; in other words, we would have evidence that there was semantic change in Reddit users’ responses regarding Facebook’s impact on society.
Subreddit | Dimension | T Statistic | P Value |
r/Democrats | X | -5.987 | <0.001*** |
r/Democrats | Y | 3.773 | 0.004** |
r/Conservative | X | -1.674 | 0.100 |
r/Conservative | Y | 0.818 | 0.417 |
r/Socialism | X | -3.702 | 0.003** |
r/Socialism | Y | 3.529 | 0.002** |
All Subreddits | X | -4.820 | <0.001*** |
All Subreddits | Y | 3.525 | <0.001*** |
Table 3: T-Tests of Generated Texts from Q1 and Q2 2018[7]
The results of the t-test provide support for our earlier intuitions. We see that the results for the r/Democrats and r/Socialism subreddits exhibit positional changes along both the X and Y axes that are statistically significant at least at the 99% confidence level. The results from r/Conservative do not reach conventional levels of significance. Again, these findings suggest that there was a change in the view of Facebook’s impact on society in the first half of 2018, around the time of the Cambridge Analytica scandal and the subsequent congressional hearing, but that this change was largely concentrated among those with left-leaning political ideologies.
I also conduct Kolmogorov-Smirnov tests on the samples, which assess the probability that two samples are drawn from the same underlying distribution. In effect, this moves one step beyond the t-test by considering the shapes of the distributions of responses in addition to the means. Table 4 displays the results for each dimension of each subreddit.
Subreddit | Dimension | D | P Value |
r/Democrats | X | 0.9 | <0.001*** |
r/Democrats | Y | 0.9 | <0.001*** |
r/Conservative | X | 0.2 | 0.594 |
r/Conservative | Y | 0.233 | 0.393 |
r/Socialism | X | 0.6 | 0.052 |
r/Socialism | Y | 0.6 | 0.052 |
All Subreddits | X | 0.42 | <0.001*** |
All Subreddits | Y | 0.4 | <0.001*** |
Table 4: Kolmogorov-Smirnov Test of Generated Texts from Q1 and Q2 2018[8]
The results in Table 4 largely fall in line with those of Table 3, with one notable difference – we do not retain as much confidence that the positional changes between quarters for the r/Socialism results are due to a change in the underlying distribution. Of course, “standard” significance levels are arbitrary, but given my initial suspicions of the possibility that these results are driven by random noise, I am hesitant to state with much confidence that the Q1 2018 and Q2 2018 results for r/Socialism were drawn from different underlying distributions.
Qualitative Findings
I will now briefly examine the text generated by the models from each subreddit in Q1 2018 and Q2 2018 for the prompt “The impact of Facebook on society is”. For the models trained on r/Democrats, which exhibited the greatest movement in the visualizations above, the number of responses that characterized the impact of Facebook as significant increased from 3/10 to 9/10 from Q1 2018 to Q2 2018. Some responses in Q2 described this impact with adjectives like “undeniable” and “huge”. For the models trained on r/Socialism, 7/10 of the results from Q1 2018 characterized Facebook’s impact on society as insignificant, using words such as “negligible”, “very small”, and “limited”. However, for Q2 2018, 7/10 results characterized Facebook’s impact as significant, describing it as “staggering”, “huge”, and “enormous”. These results help contextualize the movement we observed in the embedding locations above.
From the visual representations, we also observed that there was similar movement within the r/Conservative subreddit to the left-leaning subreddits, but to a lesser extent. The qualitative results confirm these findings. Within the 30 results generated from r/Conservative in each quarter, 14 viewed Facebook’s impact as substantial in Q1 compared to 19 in Q2. Again, this a small shift in a similar direction as the left-leaning subreddits, representing a 36% increase in the number of results characterizing the impact as significant (compared to a 200% increase in r/Democrats and 133% increase in r/Socialism). In sum, it is possible that the movement in the visual representations can generally be attributed to the growth of the view that Facebook has a sizable impact on society.
Discussion
The results of this study remain largely speculative and it will require further investigation to make more concrete claims. However, there are a handful of notable findings that I will highlight below.
First, it does appear that there were changes in Reddit users’ views on the societal impact of Facebook, particularly during the first half of 2018. These shifts are concentrated largely among the models trained by posts from liberal subreddits. It also appears that these changes are unique to users’ views on Facebook specifically, as there were no trends observed among the generated texts from prompts regarding “social media”.
Second, the embedding locations of the responses generated by the models trained on liberal subreddits generally exhibited less variance than those trained on conservative subreddits. This would suggest that liberal users had more uniform opinions on topics including data privacy and the role of social media in society than conservative users. Of course, this claim remains speculative and would require further validation.
Third, the changes in the responses of the text generation models appear to be sticky over time. As we observed in Figures 12, 13, and 15, the changes in the embedding locations that started in Q2 2018 persisted until the end of the year. By the time a full year had passed, the embedding locations largely returned to pre-Q2 2018 levels.
There are a variety of procedural improvements I intend to make in future iterations of this project. First, I hope to make the project truly multimodal – I plan to create joint embeddings of text and user networks on Reddit to give the models more inferential power. I also intend to use the newly released GPT-3 model in place of the current GPT-2 model. With regard to the generated text, I would seek to generate more responses to more prompts so that we can build greater confidence in our findings with a larger sample size. I could do this by simply generating more short strings for each prompt, combining the results of semantically similar prompts (for instance, the results of “Social media’s impact on society” and “The impact of social media on society” could certainly be evaluated together), or considering a wider range of prompts. I could also look beyond Reddit data to validate my results by conducting a similar study using Twitter posts or articles from ideologically distinct news outlets. While all of these steps could further improve this study, I believe the power of this framework is in its potential to be extended to a wider range of challenging social science questions.
[1] https://www.nbcnews.com/business/consumer/trust-facebook-has-dropped-51-percent-cambridge-analytica-scandal-n867011
[2] https://arxiv.org/abs/1405.4053
[3] https://www.reddit.com/r/socialism/
[4] Membership numbers as of June 7, 2021.
[5] For each year, Q1 includes January – March, Q2 includes April – June, Q3 includes July – September, and Q4 includes October – December.
[6] See, for example, Acemoglu et al. (2019), “Too Much Data”; Bergemann, Bonatti, and Gan (2020), “The Economics of Social Data”
[7] *** p < 0.001, ** p < 0.01, * p < 0.05
[8] *** p < 0.001, ** p < 0.01, * p < 0.05