Aabir Abubaker Kar (aabir@uchicago.edu) & Nak Won Rim (nwrim@uchicago.edu)
0. Introduction
The machine learning and deep learning community have rapidly embraced the idea of representation learning, as much from the utility as from the elegance. Learned representations are simply arrays of numbers, also called vectors or embeddings, of pretty much anything that can be stored as data such as words, audio files, images. Mathematically, it makes sense that a data point can be better represented in the context of its other data points by ascribing to it a vector. This allows different axes or dimensions of the vector to take on semantic meanings (albeit in ways that are difficult to interpret) – allowing for conceptions such as the ‘gender axis’ for word embeddings (e.g., Kozlowski, Taddy & Evans, 2019).
It is important to note that these semantic meanings may be inferred implicitly by co-occurrence relationships. In the case of Word2Vec (Mikolov et al., 2013), word vectors get pulled closer together when they are used in similar contexts (in the neighborhoods of similar words in the text) and get pushed apart when they are not. The emergent ‘gender axis’ is a happy coincidence.
From this, the questions naturally arise – Can we embed explicit “knowledge” or relations, not implicitly acquiring it? Can these embeddings detect “fake news” from true facts? Can these embeddings provide us with insights, new ways of looking at old problems, or even new solutions?
Furthermore, another thing to consider is that general “knowledge” in the world right now is corrupted. Centuries-old implicit biases are hidden inside the data we have, and many people are actively making and consuming “alternative facts” or “fake news” intentionally and intentionally. Therefore, however hard we try to curate our data to contain only true facts, there will always be some hidden fake facts within the data. Indeed, many works in machine learning and deep learning show that bias in the data can be incorporated into the model (e.g., Rekabsaz, Kopeinik & Schedl, 2021). Because of this, we need the models to be “skeptic” to fake facts – the model should be able to learn the true embedding of the facts even though it is learning through data that has a mix of facts and false facts labeled as true.
Building on these backgrounds, this project uses a family of algorithms called Knowledge Graph Embeddings to explore algorithmic analogs to information (knowledge), disinformation, skepticism, and credibility prediction. Specifically, we ask the following questions:
- How does an algorithmic model of true facts perform on the binary classification of an unseen mix of true and false facts?
- What if the model is trained on a mix of true and false facts?
- What types of false facts are most likely to be ‘believed’ (i.e., classified as true)?
- Can we devise better ways of inferring what true facts support or negate other facts?
- Can these insights lead us towards a framework for algorithmic skepticism?
Although it will be impossible to answer all these questions in a single class project, we will be touching upon many of the questions suggested above. To do this, we first need to codify the representations of facts or knowledge. Thus, we will begin with a whirlwind tour of Knowledge Graphs, which is a way of conveniently representing facts and knowledge.
1. What Are Knowledge Graphs?
How can we represent a fact or a piece of knowledge? There could be many ways, but one simple solution is to write out the relationship between entities. For example, let’s consider the factual statement “Quentin Tarantino directed Pulp Fiction” (statement chosen solely by one of the authors’ passions toward the movie). There are two entities in this statement, “Quentin Tarantino” and “Pulp Fiction,” which are connected by the relationship “directed.” Thus, this factual statement could be represented as a triple: (Quentin Tarantino, Film Director, Pulp Fiction). Note that names like “Directed by” could be more intuitive for the relation, but we are using a name that is closer to the relation name in the FB15K-237 dataset, which is “/film/director/film.” We describe the dataset in more detail in subsequent sections.
The triple mentioned above is an example of (head entity, relation, tail entity) triple. The name makes intuitive sense – these triplets store the relation associating two entities (a head entity and a tail entity). Note that this is not the only name for these triples – these are also called (subject, predicate, object) triples or (subject, verb, object) triples. As the (head entity, relation, tail entity) triple is the most common terminology, we will be using this term and its abbreviation HRT triple throughout this post.
We can often easily find many of these triples, even for a single head entity or tail entity. Continuing on the example of using Quentin Tarantino as a head entity, there could be many other triples like (Quentin Tarantino, Nationality, United States of America), or (Quentin Tarantino, Place of Birth, Knoxville). The beautiful thing about these triples is that they can easily be converted into a graph data structure. The entities can be seen as nodes in the graph, and the relation between them can be the edges. We call these structures Knowledge Graphs. Now let’s see an example of these Knowledge Graphs based on the triples I mentioned before (and some more arbitrary triples):
From the above example, we can see that the Knowledge Graphs are multidigraphs. In other words, it is a graph structure that has directed edges (e.g., (Quentin Tarantino, Film Director, Pulp Fiction) is not the same as (Pulp Fiction, Film Director, Quentin Tarantino)) and that the graph allows parallel edges (e.g., (Quentin Tarantino, Film Director, Pulp Fiction) and (Quentin Tarantino, Performance, Pulp Fiction) can co-exist within the graph). With this structure, we can see that the Knowledge Graph can visualize the relationship between entities in compact, human interpretable ways. Furthermore, it can represent diverse types of relations and diverse types of entities within the same structure. These characteristics made Artificial Intelligence researchers and Deep Learning researchers have great interest in utilizing this structure to embed knowledge and apply it to tasks such as web search.
2. From Knowledge Graphs to Embeddings
As mentioned in the introduction, converting objects into a numeric vector is a common practice in machine learning and deep learning. Thus, the natural next step after constructing the Knowledge Graph is to convert the graph into a numeric vector space. Many Knowledge Graph Embedding algorithms were developed to find lower-dimensional vector representations for each entity and each relation in the Knowledge Graph. These include TransE (Bordes et al., 2013), DistMult (Yang et al., 2015), HolE (Nickel, Rosasco, Poggio, 2015), ConvE (Dettmers, Minervini, Stenetorp & Riedel, 2018), SimplE (Kazemi and Poole, 2018), KG2Vec (Wang et al., 2021), and many others. These algorithms have different structures and goals, but they are essentially designed to capture the characteristics of Knowledge Graphs using vector representations.
Once we have the Knowledge Graph Embeddings, it becomes easy to adapt them to downstream tasks like link prediction (what is the most likely relationship between a head and a tail entity?), node classification (can the relationships between entities help us classify them?), question answering (what is the most likely tail object of a given relationship with a given head entity?). The most powerful feature of these embeddings is that they can potentially learn ‘generalized’ structures for inferring facts that were never even trained on. By contrast, a regular Knowledge Graph lookup can only match nodes and edges that are in the graph.
In general, a knowledge graph embedding model has five parts: a look-up layer, a negative sampler module, a scoring function f, a loss function L, and an optimizer. We input a batch of HRT triples in the form of their indices. The look-up layer looks up their embeddings by indices, and then the scoring function computes a score for these true HRT triples t+. The negatives generation module generates synthetic false HRT triples t- for each of the input true triples, and we also compute scores for them. Then we use both the scores of true triples t+ and the scores of synthetic false triples t- to compute the loss and optimize it, such that the embeddings maximize the score for true triples and minimize the score for synthetic false triples. A commonly used loss is pairwise margin-based hinge loss introduced by Bordes et al. (2013):
Different Knowledge Graph Embedding models differ along various axes, such as scoring function, neural architectures, tensor representations, loss functions, negative sampling, with the main difference being the scoring function. Many of these methods have been implemented by open-source Python packages like AmpliGraph, TorchKGE, scikit-kge, pykg2vec, and openKE. We will introduce three Knowledge Graph Embedding models we used for this project: a translational embedding model named TransE and two bilinear models named DistMult and HolE.
TransE
TransE (Bordes et al., 2013) is a translational model that attempts to learn entity and relationship embeddings (of the same dimension) such that for true HRT triples, their embeddings enjoy
For example, if we train the TransE model to a knowledge graph that contains (Quentin Tarantino, Film Director, Pulp Fiction) triple, we want the sum of resulting vector embedding for “Quentin Tarantino” and “Film Director” to be close to the vector embedding for “Pulp Fiction”.
We therefore attempt to minimize the following quantity (with appropriate modifications for using margin loss):
DistMult
DistMult (Yang et al., 2015) is a bilinear algorithm. Here, the relationship is represented as a tensor factorization problem
where Mrelation is a diagonal matrix. The scoring function is simply as the dot product of the triple embeddings as follows:
For example, let’s consider (Quentin Tarantino, Film Director, Pulp Fiction) triple again. The entity “Quentin Tarantino” and “Pulp Fiction” are still represented as a vector, but the relation “Film Director” is now represented as a diagonal matrix (a matrix in which the values outside the main diagonal are all zero). Then, the algorithm tries to find vectors and matrices that tries to maximize the value for “Quentin Tarantino” ⋅ “Film Director” ⋅ “Pulp Fiction” (along with other HRT triples in the knowledge graph)
HolE
HolE (Nickel, Rosasco, Poggio, 2015) is a holographic method that shows promising results for more complex ‘compositional’ representations. It uses a tensor product to obtain all pairwise multiplicative interactions as useful features for training the embeddings. It also introduces the idea of circular correlation that is beyond the scope of this post. Note that HolE is a computationally expensive method due to the computation of all the ‘circular’ correlations.
3. Experiment 1: Effect of Corrupted Data on Knowledge Graph Embeddings
Since we have gotten some intuitions about Knowledge Graphs and Knowledge Graph Embeddings, let’s move on to the first experiment. For the first experiment, we aim to examine two things: (1) How well does the Knowledge Graph Embeddings (trained purely on true facts) perform on the binary classification of an unseen mix of true and false facts? and (2) What happens to the performance when we train the Knowledge Graph Embedding on a mix of true and false facts?
Dataset: Structured SVO corpus
For the first experiment, we used the Structured SVO corpus (Jenatton, Roux, Bordes & Obozinski, 2012). This dataset consists of (subject, verb, direct object) triple extracted from Wikipedia. It consists of 1.3M triples, with 30K entities and 4.5K relation types. Table 1 shows a very small sample of the structured SVO corpus.
head entity | relation | tail entity |
---|---|---|
institute | redesign | curriculum |
whaler | pull | whale |
human | have | allergy |
constitution | make | decree |
film | use | cinematography |
Experiment Design
Step 1: Training Knowledge Graph Embedding with Corrupted Data
First, we split the Knowledge Graph into three separate sets: a set to be used in training the Knowledge Graph Embeddings, a set to be used in training the logistic regression model, and a set to be used for testing the logistic regression model. Then, for the set to be used in training the Knowledge Graph Embeddings, we created three different versions: (1) a version with no corruption, (2) with 80% true facts and 20% false facts, and (3) with 50% true facts and 50% false facts. The Knowledge Graph Embedding models (TransE, DistMult, and HolE; all models used 200 epochs, 250 dimensions, and 0.0005 learning rate) were trained separately on these versions. An immediate question that arises here is: how do we corrupt the triples to gain the false facts? For the first experiment, we used the Bernoulli negative sampling, a conventional negative sampling method to corrupt the triples. We will explain this strategy in more detail after explaining the second step of the experiment.
Step 2: Knowledge Graph Embedding Evaluation using Logistic Regression
After the Knowledge Graph Embedding is trained, we use two sets of true HRT triples which the model has not seen to evaluate the embeddings. First, 50% of the triples in the unseen sets are corrupted using Bernoulli negative sampling (again, refer to the next section on the specifics of this strategy) to be false. Then, vectors for the entities and relations are looked up in the trained knowledge graph embedding model for each triple for both sets, and the entities vectors and relation vector for each triple get concatenated into a single vector. These sets of concatenated vectors and the truth labels are used to train and test the Logistic Regression model separately to get the training accuracy and the test accuracy. The test accuracy is the main metric to see how well the trained Knowledge Graph Embedding estimates the credibility of new unseen fact claims.
In addition to the evaluation task mentioned above, we also tried using the pre-trained GloVe (Pennington, Socher & Manning, 2014) model (with two dimensions: 50 and 100) for the lookup model as a baseline to compare to. In other words, for each word for the entities and relations, we look up the vectors for that word in the GloVe model, concatenate them and use them as inputs. The GloVe is an unsupervised method for attaining word vector embedding based on word co-occurrence in the corpus. Thus, GloVe is the model which retains the contextual similarities of the words, but not necessarily the explicit relations between the words. Therefore, comparing the results from the GloVe model could serve as an interesting comparison (and a baseline) to a model trained explicitly on relationship knowledge.
Negative Sampling Strategy: Bernoulli Negative Sampling
Before jumping into Bernoulli negative sampling we mentioned in the experiment design section, let’s talk about negative sampling in general first. Negative sampling refers to the process of creating false HRT triples. Although we only mentioned negative sampling in the context of corrupting the data for the training and evaluation task, negative sampling is also needed in training the Knowledge Graph Embedding (recap that we mentioned a negative sampler module was one of the components of Knowledge Graph Embedding). This is because Knowledge Graphs are constructed only of true HRT triples, but the model needs to contrast the true HRT triples with false HRT triples to learn the proper embedding (an analogy can be that the model needs to learn what is far away as well as what is close in order to know the position of the objects). Therefore, for the current pipeline, negative sampling is used in three parts: (1) for finding negative examples for the optimizer to learn as false during the training of Knowledge Graph Embedding, (2) Corrupting the training set to input into the model, and (3) corrupting the training set and testing set for the logistic regression.
The easiest and most conventional method to do the negative sampling is to corrupt the true triples. This is based on the locally closed world assumption, which states that the Knowledge Graph is locally complete and any absence of fact means it is necessarily false. Thus, by replacing the head or the tail entity in the original (head entity, relation, tail entity) triple to some other entity inside the Knowledge Graph, we can obtain a corrupted, false triple (head entity′, relation, tail entity) or (head entity, relation, tail entity′). Although it is technically possible to corrupt the relation as well, it is conventional to corrupt the head or the tail entity. One thing to note is that even if we replace the head or the tail entity with another entity, the HRT triple could still be True. Thus, we will have to check whether the corrupted triple did not occur in other parts of the Knowledge Graph.
To think about corrupting the HRT triples more systematically, we can see that this is a two-step process: (1) choosing whether to corrupt the head or the tail entity and (2) choosing what entity should be substituted in to create the false triple. Let’s first explore some methods that are being used for the first step. Note that in this project, we focus less on the first step, and we use the Bernoulli negative sampling for all cases.
Uniform Negative Sampling for Head or Tail Decisions
As with all binary decision problems, the simplest way is to do a fair coin toss. The uniform negative sampling does exactly that (for those who are not too familiar with probability distributions, “uniform” is just a fancy way to say equal probability for all possible values). You toss a coin, and if it lands head, you corrupt the head entity. If it lands tail, you corrupt the tail entity. In other words, there always is a 50% chance of corrupting the head entity and a 50% chance of corrupting the tail entity, independent of any other factors within the Knowledge Graph.
Bernoulli Negative Sampling
Although uniform sampling is a simple and reasonable way to choosing between corrupting the head or the tail entity, there certainly can be a more nuanced way to make the decision. Bernoulli Negative Sampling suggested by Wang, Zhang, Feng, and Chen (2014) uses a strategy that utilizes the mapping property of the relation to make the decision. The main intuition behind this is that if the relationship is mostly one-to-many, we will get fewer false-negative samples (false samples that are actually true) if we corrupt the head entity more often, and when the relationship is mostly many-to-one, we will get fewer false-negative samples if we corrupt the tail entity more often. Let’s consider the (Quentin Tarantino, Film Director, Pulp Fiction) example again to get a more concrete sense of this intuition.
As with many other creative works, it is quite common for a director to direct multiple movies (across time spans), but it is much rare for a movie to be directed by multiple directors. This makes this relationship as a one-to-many relationship. Connecting to the running example, Quentin Tarantino directed multiple movies other than Pulp Fiction (Reservoir Dogs, Inglourious Basterds, etc), but Pulp Fiction was only directed by Quentin Tarantino. Thus, if we search the Knowledge Graph, there will be multiple HRT triples that have Quentin Tarantino as the head entity and Film Director as the relation, but there will be only one HRT triple that has Film Director as the relation and Pulp Fiction as the tail entity. Because of this, we have a higher chance of getting a false-negative triple when we corrupt the tail entity than when we corrupt the head entity. This difference gets even bigger if we consider that the locally closed world assumption is quite unlikely to be true. It is quite unlikely that Pulp Fiction will have another director that got omitted from the Knowledge Graph, but it is more plausible that Quentin Tarantino directed some less popular movie that is not included in the Knowledge Graph. Therefore, corrupting the tail entity will have a higher chance of producing false-negative by omission from the Knowledge Graph.
To implement this intuition in the sampling method, we need to calculate two values for each relation: the average number of tail entities per head entity (tph) and the average number of head entities per tail entity (hpt). As the name suggests, tph is the average number of tail entities connected to a head entity by a relation (e.g., number of movies connected to Quentin Tarantino through relation Film Director) and hpt is the average number of head entities connected to a tail entity by a relation (e.g., number of directors connected to Pulp Fiction through relation Film Director). When the tph is much higher than hpt, then it means that the relationship is mostly one-to-many, so the head entity should be selected more. Vice versa holds for the case where the hpt is much higher than the tph. Also, when the tph and the hpt are about the same, we should be corrupting the head and tail entity with similar probability. This can be achieved by corrupting the head entity with probability of (tph / (tph + hpt)) and corrupting the tail entity with probability of (hpt / (tph + hpt)). This is exactly what the Bernoulli negative sampling does.
Uniform Negative Sampling for Entity Selections
So far, we have seen how Bernoulli negative sampling chooses between corrupting the head entity or the tail entity. But how does it choose a different entity to replace the head or the tail entity (the second step depicted in Figure 7)? They take a very simple approach of choosing any entity in the Knowledge Graph, regardless of where that entity appeared, with equal probability across all entities. In other words, it is doing a uniform negative sampling again, but the choice is not binary.
Results
The results from Experiment 1 are presented in Table 2. Note that results from HolE with different truth share is not presented here as it did not converge in 200 epochs. As mentioned before, the main evaluation metric is the test accuracy, which captures how well embeddings trained on Knowledge Graph serve as features for predicting the truth of unseen (test) facts.
One thing that immediately catches the eyes is that GloVe embeddings are doing terribly. It is showing accuracy near 50%, which is the chance accuracy expected with random choice or singleton predictions. This suggests that explicit relation knowledge between the entities is needed to some degree to be used as a feature for truth detection. On the contrary, all three KGE models show promising performance (around 77% test accuracy) when trained on 100% true data. This suggests that entity and relation embeddings trained using Knowledge Graph could potentially be used as some sort of “fake news checker”.
Finally, the models do surprisingly well even when trained in corrupted data. There is only a marginal drop in the accuracy when 20% of the training data is corrupted for both TransE and DistMult. Even more surprisingly, DistMult shows a strong performance even at 50% corruption, implying that the DistMult model is more robust (or more “skeptical”) to the corrupted data in the training data. Although we do not know exactly why, we guess that the difference may lie in the different nature of translational and bi-linear models. We intend to investigate this result further in future research.
Discussion: Need For a More Difficult Dataset and Evaluation Task
Although the results that the models do quite well on the distinguishing true and false facts from unseen “facts”, and that the model seems to be robust to corruption to some degree (until 20% corruption for TransE and until 50% corruption for DistMult) are very promising, we identified two factors that potentially confound this result. Specifically, we found that the dataset we used may not have been corrected for inverse relations and that the evaluation task might be too easy. We discuss these two possibilities a little bit more in this section.
Inverse Relations
Inverse relation is a potential property of Knowledge Graph where semantically same or very similar relations are represented as more than one edge. If a relation r and a relation r′ are exact opposite relations, then a relationship (h, r, t) can also be represented as (t, r′, h). As a more concrete example, (Quentin Tarantino, Directed, Pulp Fiction) triple can also be represented as (Pulp Fiction, Directed by, Quentin Tarantino) triple (note that we changed the relation notation slightly here for clarity, but this is same with the running example we have been seeing so far). Therefore, if we split the data into training and test set without considering these relations, the model could be tested on an unseen fact set which consists of facts that the model actually indirectly saw. Indeed, it has been shown that there is this test set leakage on WW18 and KB15K, two of the most standard benchmark datasets used in Knowledge Graph Embedding before this discovery (Toutanova & Chen, 2015) and that a simple rule model which just abuses this leakage can reach the state of the art results in these datasets (Dettmers, Minervini, Stenetorp & Riedel, 2018). As the structured SVO corpus did not control for this problem, it is quite likely that there are some inverse relations that the algorithm can use (e.g., through a hyponym and a hypernym relationship). Thus, we should be training the models on a more controlled dataset to make sure that this is not the case.
Need for More “Intelligent” Negative Sampling
Another issue we had was that the negative sampling used to corrupt the evaluation training/test set might be too simple. Let’s look back at the examples in Figure 10. We can see that most of the examples listed there are ridiculous – a county, a movie, or a city obviously cannot direct a movie. As this example shows, using uniform sampling for entity selection will most likely create ridiculous, obviously false triples. This is because the sampling does not take into account the plausibility of the resulting triple when sampling. For example, when corrupting the head entity in the (Quentin Tarantino, Film Director, Pulp Fiction) triple, switching “Quentin Tarantino” with “David Fincher” and switching “Quentin Tarantino” with “Los Angeles” has the same probability of happening even though (David Fincher, Film Director, Pulp Fiction) is a much more plausible negative than (Los Angeles, Film Director, Pulp Fiction). What makes this problem even worse is that the entities and relations stored within the graph get more diverse as the Knowledge Graph gets bigger, there will be more entities in the graph that are not related with the HRT triple that we want to corrupt. Thus, uniform negative sampling will be generating mostly very unplausible false triples in larger Knowledge Graphs. This means that when uniform negative sampling is used to make a test set for an evaluation task, the task could potentially be too easy to really discriminate the performance difference between the models. In other words, a model that can distinguish sense and nonsense, rather than truth and falsehood, can perform well in the current evaluation task. Because of this, we need to come up with more “intelligent” ways to sample the negatives. Ideally, the probability of sampling should increase as the plausibility of resulting triples increases (Figure 12).
4. Experiment 2: Results from More “Intelligent” Evaluation Task with More Controlled Knowledge Graph
For the second experiment, we try to address the two problems we identified discussed above. Specifically, we follow the same experimental design with a different dataset and use more “intelligent” negative sampling methods in creating the evaluating task training/test set. Note that we only had time to run three models (training a single model takes approximately 10~12 hours) and two of the models did not converge (TransE with corrupted data), so only result from one model (TransE trained with 100% true data using hyperparameters 200 epochs, 250 dimensions, and 0.0005 learning rate) is presented here.
Dataset: FB15K-237
FB15K-237 (Toutanova & Chen, 2015) is a variant of the FB15K dataset (Bordes et al., 2013) which contains textual mentions of Freebase entity pairs and knowledge base relation triples. The Freebase is a large knowledge base now incorporated into Google. The triples contained in the FB-15K are mostly extracted from Wikipedia and DBpedia. The difference between FB15K-237 from FB15K is that inverse relations are removed in the FB15K-237. Because of the vast amount of knowledge represented and because of it being already corrected for the inverse relations, FB15K-237 is one of the most standard benchmark datasets for Knowledge Graphs. The dataset contains 310K triples, with 14.5K entities and 237 relation types.
One thing to note about the FB15K-237 dataset is that the human interpretable entity names for the entities are not directly represented in the dataset. Thus, we queried the entity labels in Wikipedia and DBpedia to obtain the entity names using codes from https://github.com/villmow/datasets_knowledge_embedding. Note that we still were not able to obtain the names for 26 entities, and these entities were skipped when sampled from the BERT-based negative sampling (see below; this is the only negative sampling that uses the entity name data).
head entity | relation | tail entity |
---|---|---|
Cornell University | /organization/headquarters./location/mailing_address/citytown | Ithaca |
Hong Kong Film Award for Best Supporting Actor | /award/award_category/winners./award/award_honor/award_winner | Andy Lau |
Dilwale Dulhania Le Jayenge | /film/film/production_companies | Yash Raj Films |
Rensselaer Polytechnic Institute | /education/educational_institution/major_field_of_study | electrical engineering |
David Healy | /soccer/football_player/current_team./sports/sports_team_roster/team | Bury F.C. |
More “Intelligent” Negative Sampling Methods
To test more “intelligent” negative samplings, we use a method suggested by other researchers that could potentially achieve this goal, and test two new methods aimed at this goal. As mentioned before, the focus of the change is at the second step (choosing what entity should be substituted in to create the false triple) so we continue to use the Bernoulli negative sampling for the first step (choosing whether to corrupt the head or the tail entity). We also test the evaluation task training/test set created using Bernoulli negative sampling to use as a baseline.
Positional Negative Sampling
The first “intelligent” negative sampling we will use is the positional negative sampling suggested by Socher, Chen, Manning, and Ng (2013). This negative sampling technique still uses uniform probability in sampling from the set of the entities, but the possible entities set only consists of entities that appeared in the sample position and relation with the entity that is to be corrected. For example, when corrupting the “Quentin Tarantino” in the (Quentin Tarantino, Film Director, Pulp Fiction) triple, we sample an entity that appeared in the head entity position with the “Film Director” verb, such as “David Fincher” from (David Fincher, Film Director, Seven) triple. This naturally let us sample only plausible alternatives since that entity was connected to another entity with the same relation with the same position. Note that in the original paper, Socher et al. (2013) only conditioned on the position on the entity instead of also conditioning it on the relation too. However, this was because the Knowledge Graph dataset they were using had a fixed type of entities for each position, so only conditioning on the position created plausible negatives. Here, we follow TorchKGE’s approach where we also condition on the verb too.
BERT-based negative sampling
The second negative sampling we will use is the BERT-based sampling, originally suggested by us. The idea behind this sampling is that we want to sample entities that are semantically similar to the entity that will be replaced. To do this, we tokenize the entity names with Bidirectional Encoder Representations from Transformers (BERT; Devlin, Chang, Lee and Toutanova, 2018) tokenizer, input the tokens to the BERT model (last hidden layer of BERT-large without any fine-tuning), and average the token vectors to get a BERT embedding representation for the entity. Then, we calculate the cosine similarity between all entities and the entities to be replaced out. The sampling probability of each entity is proportional to the cosine similarity between the entity and the entity to be replaced out. In this way, we can use increase the probability of sampling entities that are closer in the BERT space.
Random Walk-Based Negative Sampling
The final negative sampling we will use is the random walk-based negative sampling. This negative sampling strategy just simply walks the Knowledge Graph for the pre-determined step (we use 2 here) starting from the entity to be replaced and uses the final node that it arrived in as the entity that is swapped in. One thing to note is that we ignored the directionality of the edges when doing the random walk (i.e., we are treating the Knowledge Graph as an undirected graph with parallel edges when doing the random walk). This sampling strategy does not guarantee that the sampled entity is necessarily semantically similar to the entity to be substituted out, but it ensures that the sampled entity is close to the entity in the Knowledge Graph space (if the designated walk length is small).
Results
The results from Experiment 2 are presented in Table 2. First, the performance of the logistic regression decreases with the new data set, implying that this dataset is indeed the more difficult dataset. It is unlikely that the removal of inverse relations is the sole factor in this decrease performance as the two Knowledge Graphs are qualitatively different, but the removal mostly likely accounts for some portion of the decreased performance. Secondly, more “intelligent” evaluation tasks all show decreased training and test accuracy compared to the naive Bernoulli Negative Sampling as expected. Very interestingly, when tested on evaluation task using Positional Negative sampling, the model shows only marginally less performance than that of the baseline evaluation task using Bernoulli negative sampling. On the contrary, the two other negative sampling tasks we tested show a severe decrease in performance, with the evaluation task using BERT-based negative sampling almost chance-level accuracy and the evaluation task using Random Walk showing only 54.82% training accuracy. This was counterintuitive for us since both of us evaluated that the positional negative sampling is making the most authentic corrupted triples. This implies that there is some complex nuance on how models distinguish true and false, perhaps difficult to interpret for humans. This could be a very interesting direction to look into more.
Setting aside the reason why some intelligent sampling shows more effect, the fact that evaluation tasks using more “intelligent” negative sampling show reduction in training and test accuracy implies that the naive evaluation task might make the model just learn the decision boundary between sense and nonsense, while the smarter task forces the model to learn a boundary between truth and more plausible falsehood. We would like to build further in this direction in future studies.
5. Summary and Takeaways of the Experiment Results
Before we conclude with how we can build upon these results to explore this space more deeply and how these results can be insightful beyond the algorithmic world, let’s summarize what we learned so far and what are some takeaways from the results.
First, we showed that Knowledge Graph Embeddings are able to learn useful features for truth classification of unseen facts, at least in naive evaluation tasks (even for a more difficult dataset). This is a natural extension of their established utility in link prediction/question answering. Furthermore, it was shown that this does not change much even if we corrupt the data to “fool” the model.
Second, compared to TransE, DistMult seems more robust to corrupted data on the naive evaluation task. Even when trained with 50% corrupted data, the model shows 74% test accuracy on the evaluation task. It will be interesting to see what drives this difference and investigate this generalizes into other translational and bilinear models.
Third, we show that the model performance drops significantly for harder evaluation tasks that use more “intelligent” negative sampling. This implies the model is able to distinguish sense and nonsense but might be not fully learning the boundary between truth and falsehood. It will be interesting to see if we can build models that perform well even in these harder tasks. Also, it will be interesting where the differences in the performance between the harder evaluation tasks arise.
6. Future Directions
Needless to say, there is a variety of directions where this project can expand upon. We list a couple of directions in this section.
Experimenting with More Models and Datasets
In the experiments, the Knowledge Graph Embedding algorithms we used were limited to TransE, DistMult, and HolE (only TransE for the second experiment). Immediate next steps would obviously involve running DistMult and HolE on the more difficult evaluation task with a different variety of corruptness of the data. Other algorithms, especially ones that are not a translational or a bilinear model, like ConvE (a convolutional embedding method; Dettmers, Minervini, Stenetorp & Riedel, 2018) would also be promising. Furthermore, we can see if these results generalize to other benchmark datasets. WW18-RR, which is one of the most conventional benchmark dataset along with FB15K-237, is the obvious next step, but it will be interesting to test it on other famous datsets too.
Tuning Hyperparameters to Improve Performance
As with all other deep learning models, there are a lot of hyperparameters to tune to improve performance. It would be interesting to vary hyperparameters like the learning rate, choice of the optimizer, as well as the loss function (perhaps a probabilistic loss like KL divergence) to see if performance improves. By identifying what parameters work best for each model and evaluation task, we could perhaps get a more insightful understanding of how these procedures affect the models.
Improving the Design of the Evaluation Task
While these experiments are able to validate our intuitions regarding the evaluation task, they also point to the fact that the task itself can be improved. Exploring other sampling methods for generating the evaluation set (for the logistic regression) is one way of doing this. Another way is to try other binary classifiers. We note that we also tested Ridge Regression and Multi-Layer Perceptron for the task, but they showed less interesting results thus not reported in this post (Ridge Regression mostly converged at majority class prediction showing chance accuracy, and Multi-Layer Perceptron showed severe overfitting).
Expanding the BERT-based Negative Sampling
One thing to note about the BERT-based negative sampling is that the one described here is not the only way to get the BERT embedding. We use the BERT-large without any fine-tuning via the transformers package, but we can use other BERT types (e.g., BERT-base) and/or fine-tune the model to get different embeddings. Furthermore, we can use different hidden layers to extract the embedding or use only the [CLS] token for the entity without averaging to get the embedding. Also, we can use less complicated models like Word2Vec or GloVe to calculate the semantic similarity. Finally, we can use an external data source, such as the Wikipedia entry for the entity, to get other types of similarity (e.g., use the images in the Wikipedia entry with ResNet model to get the image similarity instead of semantic similarity).
Using More Intelligent Negative Sampling in Different Parts of the Experiment
As mentioned in a previous section, negative sampling is used in three distinct part of the current pipeline: (1) for finding negative examples for the optimizer to learn as false during the training of Knowledge Graph Embedding, (2) corrupting the training set to input into the model, and (3) corrupting the training set and testing set for the logistic regression. So far, we only tested the use of more intelligent sampling on (3). It will be interesting to see how the performance changes as we apply the more intelligent negative samplings to (1) and (2).
Other Open Questions
Lastly, we note that there are more open questions that we want to address, but had not come up with a concrete direction yet. These questions include:
- Is the choice of negative sampling a calibration of an algorithm’s skepticism?
- Can we devise a smarter way to learn the most plausible negative samples during training?
- How can we more conclusively describe the effect of corrupted training data?
- What is the quantitative difference between learning that ridiculous triples are false, and learning that somewhat implausible ones are?
7. Social Science Implications
Finally, let’s take a step back and how our results can be insightful outside the algorithm world. So far, we’ve been deep in the weeds of the algorithm. We have interrogated how statistical and mathematical formalisms can learn representations of abstract concepts. We have attempted to trick these representations with false facts and to test their ability to false facts apart from true ones. The obvious limitation of this approach is the constraint on the data in terms of heterogeneity (wider, truly heterogeneous datasets encompassing facts from diverse domains are near impossible to collate and do not exist in the wild).
Even so, the idea that knowledge can have representations in space aligns with both, metaphorical and embodied ideas of human information processing and storage. Human champions of memory, such as mnemonists who commit to memory long lists of items, frequently report visualizing journeys through imaginary realms with visual markers triggering their recall. More biologically speaking, we know that concepts like memory, muscle memory, as well as skill-building, are frequently associated with the triggering of repetitive neural pathways. These are 3-dimensional paths in a 3-dimensional brain, implying that the organization of neural paths in space plays a role in cognition and memory.
Taking this metaphor forward, our results suggest that algorithms, while lacking the physical embodiment of the brain, are able to use the spatial metaphor not just to store the meanings and associations of entities and relationships (as well-validated by the numerous applications of Knowledge Graph Embeddings); but also to make inferences on unseen information, much like a human being may be able to judge new information as sensible or insensible, plausible or implausible, as well as true or false.
Future work may be able to use this paradigm to build an intuition for comparing the skepticisms of the algorithm and the human. In what cases is a human being more skeptical than an algorithm? What types of skepticism are observed in the machine? Can a more skeptical algorithm be of assistance to a less skeptical human being (whom some might call gullible)? Can a less skeptical algorithm be of assistance to a more skeptical one? In the long term, this also raises fascinating possibilities in the realm of machine-in-the-loop algorithmic fact-checking.
8. Code Availability
One of the end goals of this project is to wrap the methods and data into a neat, easily usable package. We have built and will be continuing to build the package weboftruth after the course. All the codes are open-source, and we welcome any contributions!
To try it out, just pip install it! pip install git+https://github.com/bakerwho/weboftruth
9. Acknowledgments
Although Nak Won come on this ride starting this quarter, Aabir had been playing with the project for a while and received bits of help and feedback from many truly helpful people. In addition to everybody involved in the course, we would like to thank Adarsh Mathew, Regina Catipon, Prof. Amitabh Chaudhary, Elena Orlova, Dennis Zheng, Prof. Karen Livescu, Freda Shi, Prof. Kevin Gimpel, and Prof. Chenhao Tan.
Finally, the weboftruth package was built upon the awesome package TorchKGE. The transformer package let us deal with BERT with ease. The FB15K-237 dataset was found at https://github.com/simonepri/datasets-knowledge-embedding/ and we used codes from https://github.com/villmow/datasets_knowledge_embedding to extract the entity names for FB15K-237. All visualization in this post was created using Google Slides and Photopea with Nak Won’s slightly out-of-fashion design sense. The base for the feature image is from https://pixabay.com/illustrations/network-technology-digital-4556932/ – and the last figure is from https://pixabay.com/illustrations/artificial-intelligence-brain-think-4389372/. We thank the owners of the images for releasing it without strict copyright!