Summary and Keywords
Hearers and readers make inferences on the basis of what they hear or read. These inferences are partly determined by the linguistic form that the writer or speaker chooses to give to her utterance. The inferences can be about the state of the world that the speaker or writer wants the hearer or reader to conclude are pertinent, or they can be about the attitude of the speaker or writer vis-à-vis this state of affairs. The attention here goes to the inferences of the first type. Research in semantics and pragmatics has isolated a number of linguistic phenomena that make specific contributions to the process of inference. Broadly, entailments of asserted material, presuppositions (e.g., factive constructions), and invited inferences (especially scalar implicatures) can be distinguished.
While we make these inferences all the time, they have been studied piecemeal only in theoretical linguistics. When attempts are made to build natural language understanding systems, the need for a more systematic and wholesale approach to the problem is felt. Some of the approaches developed in Natural Language Processing are based on linguistic insights, whereas others use methods that do not require (full) semantic analysis.
In this article, I give an overview of the main linguistic issues and of a variety of computational approaches, especially those stimulated by the RTE challenges first proposed in 2004.
We acquire much of our knowledge through linguistic input, be it written or spoken. When we read this example:
we update, if necessary, without much conscious reflection, our knowledge about cheetahs, North America, and Africa; we now know that there were cheetahs in North America 100,000 years ago, that there have been cheetahs in Africa in the period after 100,000 years ago, etc. We know this based on our understanding of the words and constructions used in the sentence. We have made one of more textual inferences.
A more complex example is the following: a friend tells us:
From this we can conclude that the Front National did not have control over any French regions before the elections, that they did not gain control over any in the elections, that this result was obtained because of some doings of the Socialists and the traditional Right, that it was difficult for those to achieve this result, and that the Front National was on the cusp of gaining this control. We also learn that that our friend judges this outcome to be a good thing and has a bad opinion of the Front National. It is clear that the choice of the lexical items and their frames of usage plays an important role in making these deductions available: if X prevents Y from doing something, the eventuality referred to in the complement clause did not come about, if X manages to do something, the eventuality referred to in the complement clause did occur. More subtly, the use of manage suggests that it was difficult for the entity referred to with the subject of the clause to insure that the situation its complement describes would obtain, and the use of prevent suggests that without the action of the entities referred to in the subject of that clause, the situation in its complement would have occurred.
Before going deeper into how these inferences arise, let us pause for a moment to consider what aspect of reality these inferences are about. Reading the first example in the newspaper, we would readily update our state of knowledge in the way described. But, this is the case because we take our daily paper to be a reliable source for this kind of information. This is a very general characteristic of textual inferencing that is sometimes overlooked in the discussion: texts are created by language users, and they encode the attitude of these users vis-à-vis the eventualities that they describe. A text as such cannot tell us whether an eventuality is factual or not. Such a judgment will always depend on an evaluation of the reliability of the source of the text. For that reason, it is better to speak of the ‘veridicity’ and not the ‘factuality’ of the eventualities referred to in text. The textual inferences will always be veridicity inferences: inferences about what the producer of the text took responsibility for, what she presented as being (not) true, or (not) possible, not about the state of the world itself.
The importance of the producer of the text, henceforth the writer, is more evident in the second example. The implications of using manage as indicating the difficulty of achieving the result, and of prevent as indicating the likelihood of the occurrence of situation, are based on evaluations of the writer. The inferences can be about the state of the world (as represented by the writer), whether events happened or entities exist, or about the evaluation that the writer associates with these states of the world, whether these events are desirable, difficult to realize, etc. In this overview, we will mainly consider the former because the term textual inference has come to refer mainly to those. The inferences about the evaluations made by the writer tend to be studied under the heading sentiment analysis. Another aspect, in which the use of textual inference is restricted, is that it tends to refer to inferences that can be made nearly immediately, that do not require deep thinking or several steps of deductive reasoning.
To summarize, textual inference in this article refers to inferences that are:
• About the veridicity of events.
• Based on the form of the linguistic expression.
• Not requiring several inference steps.
We make these types of inferences all the time, but in theoretical linguistics they have been studied only piecemeal as part of the elucidation of the meaning of linguistic expressions. It is only when attempts are made to build natural language processing systems with this capability that the need for a more systematic and wholesale approach to the problem is felt.
I will first sketch some of the linguistic underpinnings of textual inference and then give an overview of computational approaches. The linguistic discussion will rely on distinctions made in a truth-conditional view of meaning, as that was the most prevalent one when the study of textual inference started to gain importance in natural language processing (NLP), in the late 1990s into the early 2000s.
1. Linguistic Underpinnings
It is not only the reliability of the writer that needs to be taken into consideration, the writer will make her statement in a particular context. We evaluate the second example above with respect to the French regional elections at the end of 2015. The time of utterance (spoken or written) plays a role; in other cases, the location does as well, in the evaluation, for instance, of a statement such as It is cold.
On the basis of such considerations, linguistic theory makes a distinction between utterances and propositions. For an utterance to be evaluated with respect to its truth, we need to take inti account the conditions under which it is uttered, but it is possible to make abstraction of some of these conditions by defining a relation of entailment as holding derivatively between sentences: if X is true, then Y is true, where X and Y are the sentential expressions of propositions. This assumes that we keep a certain number of assumptions constant in evaluating the two sentences. We want to say that Sandy moved the vase entails The vase moved. If the first holds the second must also hold. To be able to do so, we have to assume that the two sentences are in the same language, that we are talking about the same vase and the same instance (same time, same location) of moving. Making such assumptions is not always warranted in real life, but it is often done in linguistic discussions, and I will adhere to it.
Observations in the semantics and pragmatic literature that pertain to inferences about the veridicity of a state of affairs mainly fall under discussions of entailments, presuppositions, and invited inferences.
When an entailment relation holds between two sentences, the entailed one cannot be negated while the entailing one is maintained without creating a contradiction, as exemplified in (3) and (4) (and as indicated by the # sign:
The term entailment is here reserved for what has also been called at-issue entailment (entailment of regularly asserted content, see Potts, 2005): relations between asserted elements in a text. One of the hallmarks of at-issue entailment is that it is not preserved under negation, as illustrated below in examples (5–8), (9–11) and (13–14).
A particularly rich set of entailments is comprised of those found between sentences that are in monotone, or, more generally, inclusion and exclusion relations. If somebody asserts (5), she has to also accept (6):
But somebody asserting (8) is not bound to hold (7). Conversely, however, (7) entails (8):
Some of these relations have been studied at least since Aristotle. Recently, there has been renewed interest in them, under the heading of Natural Logic (Sánchez Valencia, 1991; van Benthem, 2008), because they are a class of entailments that can be related in rather straightforward ways to surface sentence structure (encoded in a syntactic tree) without requiring that a complete semantic interpretation be calculated. Natural Logic extends the pattern matching on sentences of syllogistic logic by defining partial orderings on all syntactic types.
We can then define order-preserving or order-reversing functions on these orderings. For instance, with is upward monotone (or order-preserving): from X did Y with an axe, we can conclude that X did Y with a tool, as axes are subclasses of tools. For without, the function is order-reversing (or downward monotone): from X did Y without moving, we can conclude that X did Y without dancing, although to dance on its own entails to move.
The generalization over categories was formalized in van Benthem (1995). To give a flavor of how the relation between syntax and inference works, we present here a very compact overview of the system proposed by van Eijck (2007). Functional lexical categories, such as quantifiers, are associated with polarity maps that annotate the categories resulting from function application. For instance, every has the categorical grammar category, (S/VP)/CN, the category for quantifiers in general. The following examples illustrate that every is downward entailing in its first argument (CN = common noun) and upward monotone in its second argument (VP = verb phrase):
Example (9) entails (11), but it does not entail (10). Example (10) does entail (9). Van Eijck (2007) accounts for such facts by annotating the category of the quantifier with marker transformers, an i when the order is preserved and an r when it is reversed. The category for every becomes then (Si/VPi)/CNr. The categories for the various quantifiers will now be different from each other, for instance, some will now become (Si/VPi)/CNi. The preposition without, illustrated above, will be annotated with an r on its argument CN(VPi/VPi)/CNr. Syntactic trees can now be built as usual, but all nodes will receive marker transformers based on the those that come from the lexical categories, so when every combines with a common noun, the common noun will receive a r annotation. The marker transformers annotate the upward and downward monotone regions of the tree: we start with a +, an upward monotone sign at the S level, and then transfer this to lower nodes or switch it to –, depending on the i or r markers we find on the way, ending up indicating for each constituent whether its the contribution is upward or downward monotone, or neither. With these markings, proof-theoretical inference rules for the relations illustrated above (5–8) and (9–11) can be formulated.
Much work on linguistic entailment has concentrated on the behavior of a few lexical classes such as quantifiers, and phenomena such as class inclusions, but as pointed out in Karttunen (1971), the use of other lexical items also leads to entailments. Implicatives are a rich class of verbs and verbal expressions taking sentential or VP-complements that are in an entailment relation with their complements. We illustrated this in (2) with the use of manage and prevent: with manage we can conclude that the eventuality referred to in the embedded VP happened, whereas with prevent, we have to conclude that it did not happen. Depending on their intrinsic meaning and their behavior with respect to negation, six classes of implicatives can be distinguished. For instance, when one negates manage, as in (12):
we can conclude that the job did not get done, but with force this conclusion would not be warranted:
Whereas (13) entails that Lee left, (14) does not entail that Lee didn’t leave.
The work on monotonicity described above did not look at these relations but subsequent work by MacCartney (2009) extended the set of semantic relations to include both containment and exclusion relations, ending up with equivalence (couch ≡ sofa); forward entailment (e.g., canary ⊏ the Workshops on Inferencutional semantics, RTE, semantics, pragmatics, worldknowledge, ngs of the Workshops on Inferencbird); reverse entailment (e.g., Asian ⊐ Japanese); negation (e.g. human ^ non-human); alternation (e.g., canary | walrus), more formally: the intersection between x and y is empty and the union of both does not include everything; cover (e.g., animal ˘ non-human), more formally: the intersection between the two is not empty, and the union is the universal class); and independence (all other cases). The MacCartney relations were formalized in Icard (2012) (see also Moss, 2015) and cover both the classical monotonicity relations and the implicatives (See section 2.1, “The Recognizing Textual Entailment Challenges (RTE)” for an example of how these relations can be used in an inference system.)
1.1.3. Sundry Lexical Entailment and Decomposition
Example (5) illustrates a type of entailment that comes about through the regular relation between two meanings of a class of verbs such as move, break, sink. These regular entailment relations can be made explicit through (partial) meaning representations that decompose the meaning of the lexical items as a conjunction of a limited set of predicates. For example, the relation between the two meanings of move is one of causation. The relation can be represented as CAUSE(x, MOVE(y)) → MOVE(y). For pairs like buy and sell, an equivalence relation can be established: BUY(x,z,y) ≡ SELL(y,z,x).
Many decomposition schemas have been developed in discussions about lexical meaning but, in fact, only a few have been used to model entailments. Most research has focused on a few predicates or on general schemes with a few illustrations. Exceptions are the works of Fillmore and his collaborators, who spearheaded the work on lexical semantics (Fillmore, 2002; Levin, 1993. Fillmore’s work led to an organized resource that explicitly encodes many of these lexical entailments (and some other types of inferences) in FrameNet (Fontenelle, 2003). Levin’s work led to VerbNet.
1.2. Presuppositions and/or Conventional Implicatures
Not all the inferences are based on at-issue entailments in the sense just illustrated. Several constructions signal that a writer takes the material expressed in them for granted instead of asserting it. For instance, in (15), the cleft construction informs the reader of the identity of the culprit but assumes that she already knows that somebody broke the vase.
We will refer to such aspects of meaning as presuppositions. This term and conventional implicature are used in the linguistic literature for a rather heterogeneous set of phenomena, and both the terminology and the theory underlying it are, at this point, very much in flux. Karttunen and Peters (1979) identified a subset of presuppositions as conventional implicatures, leading to subsequent confusion. A good discussion of a possible way to distinguish the two notions can be found in Potts (2007). Beaver and Geurts (2014) provide a good recent overview of the various conceptions of what is called presupposition (see also Potts, 2015). We use the term presupposition here for phenomena that have the characteristic that the writer’s commitment to the truth of a proposition persists under negation and questioning. Whether one says (16) or (17):
the commitment to the fact that somebody stole the vase remains. As we are mainly interested in inferences about the state of the world, we ignore presuppositions that betray the attitude of the writer vis-à-vis the state of affairs described.
Given that the status of the embedded clause is presupposed in these constructions, they do not, in their standard use, lead to new inferences. In fact, they are often used to impart new information to the reader when the writer does not know the exact information state of the reader, as might be the case in (18) (from Potts, 2005):
Appositives and clefts are two examples of a large set of clause structures that signal that the writer does not consider their content as being at issue. Others are most temporal clauses (the exception being some before and until clauses), parentheticals, non-restrictive relative clauses, and as-clauses (see Potts, 2005, for discussion).
Specific lexical items can also give raise to presuppositions. With a sentence such as (19):
the speaker doesn’t want to convey that the earth is round but rather that Kim knows that fact. The fact itself is already taken for granted. Since Kiparsky and Kiparsky (1970), predicates with this behavior have been dubbed factives. Examples are know and regret. They, too, can be used to impart new information to the reader as in the following example.
However, not all uses of verbs that have been classified as factive do commit the writer to the truth of the embedded clause. For most factive verbs, it has been shown that, in third as well as in first and second person statements, and for all the criteria for presuppositional behavior given above, one can find exceptions. Beaver (2010) gives extensive examples of exceptions. Karttunen (2011) documents that the factive or non-factive interpretation of lucky depends on the context, as in the non-factive:
and Karttunen, Peters, Zaenen, and Condoravdi (2014) show that some speakers give non-factive interpretations of adjectival constructions that are generally considered to be factive. In recent literature (Simons, Beaver, Roberts, & Tonhauser, 2016), the classification of lexical items as factive or not has come under attack, Further research is necessary to isolate the factors that distinguish the factive and non-factive uses.
1.3. Invited Inferences, aka Generalized Conversational Implicatures or Pragmatic Inferences
Entailments and presuppositions are commitments that the writer cannot deny without being inconsistent, but linguistic expressions often lead us to draw conclusions that a writer can deny without contracting herself. If I tell you:
you might be inclined to ask me what I thought about it. But I could, without contradiction, go on with this:
When somebody tells us:
we tend to concluded that not all of them are. But again that can be denied:
Here too, there is variation in the terminology used to describe these inferences. A useful distinction is that between generalized and particular pragmatic inferences. Particular pragmatic inferences depend on the specific context to get their inferential force. Grice (1975), for instance, gives a famous example praising a candidate’s handwriting in a letter of recommendation. Pragmatic inferences are not, to my knowledge, focused on discussions about textual inference, because they assume too much analysis of the non-textual context.
A rather well-studied subset of generalized pragmatic inferences goes under the heading of Scalar Implicatures because the invited inference relies on the fact that the writer choses to characterize a state of affairs with a term that is part of a scale and, by choosing a weaker term, implicates that a stronger term would not be appropriate although, logically, the use of the weaker term does not rule out a state of affairs consistent with the stronger characterization. We see this at work in example (24). Strictly speaking it is true when all the students in the class are bright, but in this case, given that one could say:
the fact that one says (24) leads the reader to conclude that (26) would not be warranted, hence that not all the students in the class are bright.
A similar reasoning can be used to explain why characterizing something as warm implicates that it is not hot, that saying that you have four children implies that you don’t have five of them, that when something is characterized as possible, it is not understood as necessary. In all these cases the scalar implicature can be cancelled as, for instance, in (27):
The explanation for these inferences in general invokes Gricean principles (Horn, 2006), exploiting the rules that govern collaboration between writers/speakers and readers/hearers. These are also invoked for the inferences that occur when writers use circumlocutions when more direct characterizations are available: saying that something is not impossible betrays less confidence in its being true than saying that it is possible, John’s father’s wife is most likely not John’s mother, etc.
1.4. Modality, Verbs of Communication, of Propositional Attitude, and Presupposition Projection
In the preceding sections, we have assumed that the writer intends to present a state of the world not mediated by the views of other observers. In fact, writers often will not take responsibility for the situations described. They can express this through the use of a modal expression or through the attribution of the description to somebody else.
Modalities are expressed in a number of ways in natural language: verbal mood and tense, conditionals, adverbial expressions such as possibly, necessarily, embedding predicates such as I think, etc. Here we are mainly interested in the expressions of epistemic modality, indications of the degree of certainty with which a writer presents her views. These degrees of certainty have mainly been modeled in formal linguistics through possible world semantics (Kratzer, 1991). In this model, the modals represent different distance relations between the world described in the modal sentence and the world that is assumed to be actual. Lately, a competing view based on probabilities has been proposed (Lassiter, forthcoming). This model fits better with computational approaches to natural language, but under neither model is there a comprehensive map available of the force of the various modal expressions and their combinations.
By using a verb of communication (say, inform, suggest, etc.), a verb of propositional attitude (think, believe, etc.), or a performative verb (apologize, forbid, inform, promise, request, thank, etc.), a writer (henceforth, the primary writer) delegates the responsibility for the truth of the state of affairs described to another person to whom an utterance or a state of mind is attributed (henceforth, the secondary writer), as illustrated in (28):
We are again only interested in expressions of epistemic uncertainty, not in the many other types of judgments that can be communicated through the uses of these verbs. With say or think, the writer does not take responsibility for Mary’s leaving, but only for Bill’s saying or thinking so.
Verbs of propositional attitude fall into two categories with respect to the commitment of the primary writer that they encode. Words like believe, think, guess, etc., leave the responsibility for the described state-of-affairs to the secondary writer and do not betray any commitment of the primary writer whereas verbs like know and discover are factives (see section 1.2, “Presuppositions and/or Conventional Implicatures”).
The behavior of verbs of communication with respect to the veridicity status of the reported state of events is more complicated. Most are clearly attributing the commitment to the secondary writer (e.g., say, report, claim, add, announce), but others signal some primary speaker commitment (e.g., acknowledge, admit, confirm). Some researchers (Anand & Hacquard, 2014), however, have pointed out that these verbs behave differently from factives. Observe, for instance, the contrast between (29) and (30):
The commitment to the clause embedded under accept can be limited to a particular time or place, but that of know cannot.
Verbs of saying also differ in the degree of engagement that the primary writer assigns to the secondary writer. Compare (adapted from Martin & White, 2005):
These nuances can play a role when the reliability of the secondary writer as a source is evaluated, but, like the evaluation of the primary writer as a source, this evaluation will rely on mainly non-linguistic factors.
While issues related to secondary speaker commitment have been discussed in the literature, there is no systematic treatment of them available. A corpus annotated with native speaker judgments such as FactBank could form the basis for such a treatment, but FactBank is too small and too much centered on one genre (newswire text) to give a good view of what occurs in text in general.
Another problem that arises in embedded contexts is whether the presuppositions are those of the primary writer, of the secondary one, or of both, whether they ‘project’ or not. Consider again the sentence in (15), but now embedded under a verb of propositional attitude:
Does this sentence commit the writer to the view that somebody stole the vase? Does it commit her to the view that John thinks that somebody stole the vase? In technical terms, due to Langendoen and Savin (1971), under which operators do the presuppositions of the embedded clause (in casu, that somebody stole the vase) project upwards, and up to which level?
There is a substantial literature on this issue, but unfortunately, it is inconclusive. It is not certain that all phenomena we have called presuppositions behave the same way, and the literature tends not to distinguish those that lead to veridicity commitments from others. Potts (2005) considers the cases that he considers to be conventional implicatures as primary writer commitments, but that view is not shared by everybody (Potts, 2015). The more traditional view is that the embedded presuppositions of verbs of propositional attitude and verbs of communication do not project. Beaver and Geurts (2014) observe “for nearly four decades, the Holy Grail of presupposition research has been to explain the behavior of presuppositional expressions occurring in embedded positions. Given that the theoretically most challenging mode of embedding is within the scope of an attitude verb, one might expect that the interaction between presuppositions and attitude verbs should have received a lot of attention. But it hasn't.”
Performative utterances (Austin, 1962; Searle, 1989) are performances of the act named by the performative verb when the verb is used in the first person and in the present. In the third person, or in the past tense, these verbs are used to report the occurrence of such an utterance and behave like verbs of communication. If somebody utters (34), she has made a promise. Example (35) only reports that such a promise has been made.
But here again, in some cases, the writer assumes responsibility for the description of the state of affairs. On the basis of the following two statements, the reader would be entitled to conclude, respectively, that the ship now has the name The Daydream and that John and Mary are married.
The preceding subsections give an overview of the sentential semantic/pragmatic phenomena that are central to textual inference. But when we make such inferences, we take broader linguistic context into account. This context is often important for deciding whether a discourse entity is presented as existing or not. Context is also needed to interpret elliptical expressions. Here we consider these decisions as prerequisites for the inferencing proper and leave them out of the discussion, but this is a rather arbitrary decision. We refer to discussions in dynamic semantics (Heim, 1982; Kamp, 2011) for semantic approaches to them.
The fact that formal semanticists have a theory of how they want to represent the meaning of natural language text doesn’t imply that they have worked out complete concrete representations for most aspects of such texts. Most work on formal semantics is very fragmented, providing detailed formalizations of individual phenomena that are not easily integrated into an overall representation. One of the exceptions is the Discourse Representation Theory (DRT) framework that aims at an articulated, broad coverage representation of semantic and pragmatic phenomena (Geurts, Beaver, & Maier, 2016; van der Sandt, 1992). DRT is richer than the standard Montague semantics. It is dynamic and uses neo-Davidsonian (originally Davidsonian) representations, including presupposition projection and binding of discourse referents. Integrated semantic representations are also found in implementations of linguistics theories such as LFG, HPSG, and CCG.
Even if one would have a system that creates complete semantic representations for texts, it would not give us all the inferences. Inferences are in the mind of the reader and do not necessarily take a linguistic form themselves. The mind combines the inferences made on the basis of the form of linguistic expressions with those that come from other sources. Thus, to represent inferential knowledge, one would not only need the representation of the linguistic part but also a representation for the other types of knowledge and a way to combine them. Psychological research studies how human minds do this and Artificial Intelligence (AI) tries to build computational systems that achieve similar results. Earlier work in AI did not focus explicitly on the textual basis of knowledge and the inferencing, but lately the development of NLP has led to a framing of the inference task as linked to the processing of natural language. One can view the capability to make inferences as the ultimate test of a natural language understanding system.
2. Computational Approaches
In computational approaches, inferences need to be embodied in a form that the computational tools can access. Currently, this is most often a linguistic form. Thus computational linguistics does see inference as a relation between sentences, or more generally, textual units. The study of this relation has been undertaken from two different perspectives that have come closer over the years.
Entailment relations as described above have mainly been formalized in semantics in first order logic. This suggests a straightforward computational approach: add the implementation of this semantic formalization and of the inference types described above to a syntactic parser to create a system capable of making inferences. An attempt to provide such a computational environment was made in the *FraCaS project, which allowed the user to develop semantic representations according to various semantic theories current in the mid-1990s. The project also provided a list of the type of inferences it thought the system should be able to make, but the inferencing component was never completed. The system partially implemented in the FraCaS project can work when all the information needed to make the desired inference is provided in the premise. In practice, however, conclusions are often phrased quite differently from the premises and tend to include assumptions that are not made explicit in the text.
While one community, which I will call the computational semantics community, was implementing semantic theory for inference, another community, which I will call the Natural Language Processing (NLP) community, looked at the problem in a different way. The computational semantics community typically concentrates on the difficult problems, assuming that the simple ones will be solved in the process; the NLP community tends to start with what can be done and to build up from there, solving problems one by one, hoping that eventually the task will be completed. Moreover, that community tends to be happy with a solution that works in most of the cases without requiring complete accuracy. Looking at the problem from the angle of techniques used in information retrieval, it started by assuming that a sentence is a bag of words without any particular structure beyond, maybe, part of speech annotation. For some examples given above, such as (5) and (6) and some FraCaS examples, this can lead to a viable strategy. Consider, for instance, the following example (adapted from the first RTE task):
As a linguist, one analyzes this inference as relying on the presuppositional nature of appositives. But without doing any deep analysis, one notices that there is an important overlap in words between the premise and the conclusion, with the premise having more words than the conclusion. A simple inclusion relation can be defined on aligned common words and the stipulation that the must premise contain the conclusion. It is clear that this would only work for (some) upward monotone relations, but if such relations are frequent in the applications where inferences are needed, the approach would have merit.
What we just described can be seen as a computational semantics and an NLP approach to the problem of textual inference, although an early example of the bag-of-words approach (Monz & de Rijke, 2001) was actually presented at an ICoS workshop, where most of the work tends to be of the linguistic variety. The two approaches and variations and combinations of both were tested, from 2005 to 2013, in a series of competitions known as RTE Challenges, at first part of the PASCAL challenges, later managed by NIST.
2.1. The Recognizing Textual Entailment Challenges (RTE)
In 2004, Dagan, Glickman, and Magnini proposed a task they called Recognizing Textual Entailment (henceforth RTE), as recognizing “a directional relationship between pairs of text expressions, denoted by T (the entailing Text) and H (the entailed Hypothesis). The relation is supposed to hold “if humans would typically infer that H is most likely true.” The relation defined in this way is different from what the linguistic community would consider to be linguistic inference, let alone entailment, in that it is probabilistic in two ways: it is interested in what most people would do, and the judgment of these people doesn’t have to be certain—strong likelihood of the judgment is enough. It can be argued that this is the way that non-linguists use words such as infer, entail, conclude, etc. in daily life. The task differs from linguistic preoccupations by not trying to make a firm distinction between the knowledge that comes from the text itself and the knowledge a typical speaker of a language at a particular moment in time and at a particular place in the world might bring to the interpretation of the text (although specialist knowledge was excluded). Here is an example that illustrates the importance of world knowledge:
Solving (39) relies on the real world knowledge that being U.S. President excludes being the president of the European Commission. This knowledge is needed to exclude the anaphoric link between U.S. President George Bush and his.
Like logical entailment, the relation defined in the RTE task is directional and clearly distinct from paraphrase detection and from the much broader notion of textual similarity. The relation between the Text and the Hypothesis also follows linguistic usage in assuming that reference and time are kept constant from the one to the other.
This view on textual inference has been at the basis of eight RTE competitions (see Dagan, Roth, D., Sammons, M., & Zanzotto, 2013, for a detailed description), in which several research teams turned in systems that tried to match the annotations given to a set of T-H pairs. These were provided by a group of students who based their work on existing data sets for four NLP applications that the organizers considered as involving inferencing: Question Answering, Relation Extraction, Multi-document Summarization, and Information Retrieval. They were typically drawn from news sources. Only pairs with high inter annotator agreement were kept.
In the first competitions, the task was to classify the pairs according to whether the T allowed the H to be inferred or not, a two-way distinction. In RTE-4 and RTE-5, a three-way distinction was also allowed: was the inference likely (entailed), highly unlikely (contradicted), or unknown?
In the first competitions, RTE tasks did not require much discourse sensitivity; for instance, very little anaphora resolution was needed. In later challenges more anaphoric relations and some elliptical ones needed to be resolved.
Some early systems exemplify closely the computational semantics and the NLP approach sketched above.
A good example of a system inspired by linguistic insights is the one from Bos and Markert (2005). It is an elaboration of an approach Johan Bos, together with Patrick Blackburn, developed earlier by coupling DRT with theorem provers and model builders as a teaching tool (Blackburn & Bos, 2005). DRT, as a model theoretic system, relies on satisfaction in all models to establish the validity of a formula. To make this task computationally doable, the model checking needs to be turned into proof-theoretic task. For this, Bos and Blackburn and, later Bos and Markert, rely on the theorem provers and model builders developed in Artificial Intelligence community.
The system, used with the RTE-1 data, combines the statistical Combinatory Categorial Grammar parser (Curran, Clark, & Bos, 2007; Steedman & Baldridge, 2011) with Boxer, which turns the predicate argument representations from the CCG parser into DRT representations, and hooks them up with the Vampire theorem prover and the Paradox model builder. The theorem prover attempts to prove the hypothesis from the premise, whereas the model builder tries to construct a model for the premise and the negation of the hypothesis. If the model builder succeeds, one knows that there cannot be a proof and hence no entailment.
An important ingredient necessary in this approach is formalized background knowledge for the theorem prover. This is provided in the form of first-order axioms. This way of acquiring background information is incomplete and rather labor intensive. To make the system more robust, a distance calculation on the models produced by Paradox has been added; the idea being that if H is entailed by T, the model for the two together would not be informative compared to that for T alone and would not introduce new entities. So the size of the model for T+H compared to the size of T can be used to measure the likelihood of entailment.
Based on this analysis, features relevant for inference recognition were extracted: entailment and inconsistency as calculated by the theorem prover, domain size and model size and the absolute and relative difference in both between the T+H and T. The classification was done with decision trees.
The system obtained an accuracy of 56.2%. (The task was a yes/no classification, and the right answers were evenly divided between TRUE and FALSE. So a system that always returned no, or always yes, would have an accuracy of 50%). A mixed system, using shallow features based on WordNet relations and frequency, had an accuracy score of 61.2%. Descendants of this system are still being developed (Bjerva, Bos, J., Van der Goot, & Nissim, 2014).
A good example of an early NLP system is that described in Adams (2006). Here, the content words of both the text and the hypothesis are tokenized, then each word of the T is associated with each word of the H, and a similarity score is calculated. The similarity is based on two scores: the distance between the two words in WordNet (an exact word match has a score of 1) and the page count from a search engine for both terms, divided by that of the Text term (note that this relies on a similar intuition as the model size comparison used above in Bos and Markert). This forms the basis for the features that are used to calculate the probability of the inference. Apart from the similarity itself, the number of unmapped tokens and whether the number of negations (terms such as no, not) is even or odd are calculated, as well as the lexical edit distance, which counts the unmapped or negated tokens between two mapped ones. These extracted features are then given to a decision tree algorithm for training and evaluation. The system scores 62.6% on accuracy on the second RTE challenge.
The basic task of an RTE system is to decide whether H is semantically included in T. Following Dagan et al. (2013), we can distinguish a few components that are found in one or another way in most RTE systems:
1. Linguistic analysis of both T and H. In some systems this is very rudimentary, for instance simple tokenization and the omission of ‘stop’ words in the further analysis. In other systems this is very rich, including a syntactic parser or even a semantic representation, Named Entity Recognition, co-reference resolution, and semantic role labeling.
2. Enrichment. This component contains operations that are specific to the entailment task. The two text snippets are manipulated in such a way that the inferential relation between them becomes calculable. To compare the T and the H, some approaches transform the T into the H, for instance by calculating the edits it would take to transform the parsed T tree into the parsed H tree. Edits can be simple deletions or additions but can also include moving subtrees, replacing root nodes and the like. In other approaches, the tree rewrite operations that change the T tree into the H tree resemble old-fashioned transformations. Others still work on quasi-formal representations: a conjunction of predicate argument structures is produced for both snippets. Most of the information that can be derived from the linguistic discussion in the first section of this article will come into play here: for instance, the fact that manage to VP entails VP will result in an edit operation that deletes manage.
When the wording of T and H are different as in (41) (adapted from Dagan et al., 2013), the system has to figure out that LexCorp is an American Company (or that Hudson-based companies are American) as well as the relation between buying and owning, by combining lexical resources with world-knowledge resources.
3. Alignment of T and H. The representations that have been created for the two text snippets are compared, and a similarity score is calculated for both. It is often the case that only a part of the T is relevant to the H, so a subtask is to select the right part. When the linguistic analysis consists of nothing more than tokenization, the similarity measure will be calculated on the words, when the linguistic analysis has resulted in a structured representation, the similarity between these structures is calculated. Even in an example such as (39), this might not be as straightforward as it looks: the system has to compare all the lexical pairs and calculate which ones align in such a way that the conclusion is warranted. To do this safely, one would need a syntactic analysis but, as illustrated in Adams (2006), one can also bet that ignoring some (non-content) words, such as who, in this case, will not reduce the probability of the inference too much and will do the calculation on the basis of the string similarity.
4. Finally the classification of the T-H pair is done. This can be based on a simple subsumption or overlap relation, or it can be done by applying machine-learning techniques after extraction features from the representation. From a linguistic point of view, an important question is how the system handles negation and intensional contexts. In approaches that use machine learning, some features related to intensionality and negation will play a role without this being explicit. A few systems, however, address intensionality and negation directly, for instance, the PARC bridge system (Bobrow et al., 2005) and the system developed by Harabagiu, Hickl, and Lacatusu (2006) described below. The PARC system developed a representation that allowed for the direct formal characterization of intensional contexts. The representation distinguishes three embedded context relations: a veridical one where what is true in the embedded context is true in embedding context (e.g., forget that), an anti-veridical one in which what is true in the embedded context is not true in the embedding one (e.g., forget to), and an averidical one in which what is true in the embedded context might be true or not in the embedding one (e.g., forget whether). This approach closely follows the linguistic discussion in section 2, “Computational Approaches.”
An in-depth overview of the various approaches to the RTE challenges can be found in Dagan et al. (2013). Here, we highlight two other approaches that, like the Bos and Markert system described earlier, make interesting use of linguistic insights.
In the context of the RTE challenges, the Language Computer Corporation research developed a set of systems that combined extensive world knowledge with interesting linguistic features. Harabagiu, Hickl, and Lacatusu (2006) describe a version that incorporates an explicit treatment of negation and of some intensional phenomena to detect contradictions. After several preprocessing steps, including parsing, the system flags overt and implicit negations through a large list of possible negation-denoting terms, including verbs such as deny, fail, refuse, etc. It then detects the negated events through training on PropBank and NomBank, in conjunction with the negative markers making the assumption that the entire predicate-argument structure falls within the scope of the negative marker. For instance, negated entities are those that fall in the NP scope of a negative quantifier or a non-veridical one, such as few, some, many. Negated states are detected because the system has a handcrafted list of state-denoting terms, mainly nouns, extended through training with a Maximum Entropy-based classifier.
A quite successful variant of this system was developed by Hickl and Bensley (2007). It analyzes textual inference in a way that closely resembles the way we described it in the introduction: from a text, we infer a set of propositions that the speaker/hearer is committed to by virtue of making the utterance. After preprocessing, it sends the annotated passages to a Commitment Extraction module, which uses heuristics to create a set of commitments that are inferable from either T or H. The commitments of T and H are aligned, and a decision tree classifier is used to decide which pairs are likely inferences. These are further validated by a system that checks whether there are contradictions with other extracted commitments. If not, the inference between the T and the H is considered to be genuine. The system seems to benefit the most from its use of structural presuppositions to the type illustrated in section 1.2, “Implicatives.” These are identified heuristically and turned into commitments. It achieved 74.6% accuracy in RTE-4 evaluation.
The Nat log system (MacCartney & Manning, 2008) combines an edit-distance approach with the proof-theoretic natural logic calculation of monotonicity and inclusion relations described in section 1.1.1, “Monotonicity.” The system uses WordNet relations and a decision tree classifier trained on hand-annotated examples to determine natural logic entailment relations between lexical items.
After parsing of both the T and the H by the statistical Stanford phrase-structure parser, the entailments of each token (lexical item) of both sentences are projected upward through the tree. After the computation of the projectivity markings, the alignment between the T and the H is done by a sequence of edits over the spans. The possible edits are: deletion for T, insertion in H, substitution of an H span for a T span, and match of a T and an H span. These ordered edits define a transformation of T into H. For each atomic edit, a lexical entailment model assigns an entailment relation based on the lexical items involved. The effect is projected up the tree using the calculus of relations. The projection must take into account monotonicity. For example, the relation of dance and move is backward entailment (move ⊐ dance) but the substitution of dance for move in not move projects a forward entailment (not move ⊏ not dance) because of the downward monotonic context of the negation.
The entailment relations of each atomic edit are projected up through the tree to determine their effect on the relations of higher constituents according to a calculus of relations. For instance, a join of alternation (Simpy is a dog | Simpy is a cat) and negation yields forward entailment (Simpy is a dog ⊏ Simpy is not a cat).
The system did well on the FraCaS test suite, but on the RTE sets, the NatLog approach needed to be combined with other techniques to achieve any overall improvements.
As can be inferred from this discussion of RTE systems, the preoccupations of semanticists and pragmatists mainly feed into the enrichment component. Other linguistic tools, such as syntax and morphology, are relevant to the preprocessing. Lexical similarity (synonymy and hyponymy) plays a role in the alignment component.
Throughout the various competitions, there has not been one clear winner of the RTE challenges. It has become clear that a combination of complementary approaches is necessary, and most current approaches combine linguistic insights with knowledge acquisition techniques that exploit electronic text [e.g., VerbOcean] and linguistic resources such as: WordNet, VerbNet, FrameNet, Propbank, Nombank, FactBank.
Some of the promoters of RTE have turned to building a platform that allows different modules to be integrated in different configurations (Excitement; Magnini et al., 2014) and that aims at developing text suites in various languages, but the initiative seems to languish. The set of competitions was important because it made it possible to assess how successful NLP was in a Natural Language Understanding task.
The interest of the RTE experiment lies in the fact that it took on the textual inference task as a whole without domain restrictions. The task, however, had some limitations. A first one is that the data sets were small. Altogether the RTE tasks presented around 7,000 inference pairs. Although some attempts were made to make it at least semi-automatic, inter alia, by extracting headlines and first sentences from articles or pairing Wikipedia edits (Bos et al., 2014, for an Italian RTE task), the development of RTE pairs required much manual work. The development of crowd-sourcing and automatic methods to construct examples has made it possible to construct bigger data sets. SICK is a partially automatically assembled set that is a bit bigger than the whole of RTE (around 10,000 pairs) but much simpler. A much bigger set, SNLI, was released in 2015 by the Stanford University NLP group (Bowman, Angeli, Potts, & Manning, 2015). It contains 570,152 pairs with one human judgment and 56,941 with 5 judgments. It is based on captions for pictures for which Mechanical Turk workers gave additional entailed, neutral, or contradictory descriptions. This data set has the advantage of controlling identity of referents and events but, again, it seems overall to be easier than the original RTE data sets. The Denotation Graph entailment set (Young, Lai, Hodosh, & Hockenmaier, 2014) contains millions of examples. It was labeled completely automatically and is very noisy.
When RTE was conceived, most of the text considered interesting for automatic exploitation was authoritative, declarative text, such as news and scientific literature. This led to a narrow genre focus. In this respect, it is interesting to notice that, whereas presupposition phenomena are quite often exploited in the RTE literature (see e.g., above p. 19), the problem of presupposition projection is largely ignored. The data used explain this neglect. Most texts come from reliable sources. When one asks native speakers to judge the factuality of embedded clauses under verbs of communications, such as say, in statements extracted from newspapers and the like, one finds that readers will often assume that the event mentioned happened with (near) certainty (Manning, de Marneffe, & Potts, 2012). This result raises the question—should one be careful in distinguishing between the factors that contribute to veridicity and those that assign factuality, or should one see factuality assessment as the main goal? At this point in time, the NLP community seems to conflate factuality and veridicity. But when more diverse texts and situations are studied, it may come to see the need to reintroduce the distinction between them.
Another issue that the setup of the RTE contests ignored, and that has come more to the foreground in recent linguistic literature on semantics and pragmatics, is the possibility that semantic interpretations of the same text segment might differ from speaker to speaker. The training and test sets were carefully chosen to avoid such cases (see section 2.1, “The Recognizing Textual Entailment Challenges”). But electronically available text and the possibility to consult native speakers online has allowed research to go beyond the judgments of linguists and their friends. In a crowdsourcing experiment reported in Zaenen (2007, unpublished manuscript), some RTE judgments were confronted with the way naive native speakers judge the same pairs. While the general trend is not against the RTE judgments, there are nevertheless enough differences to suggest a change in methodology.
The RTE initiative appealed mainly to researchers with a statistical approach to NLP. The use of semantic insights in the various submissions was in general rather piecemeal and opportunistic, with the exception of the submissions of the team around Johan Bos and the specific NatLog submission from Stanford, which to, our knowledge, is the only instance where RTE led to linguistically original work. In most systems, the focus was on engineering issues.
Outside of RTE, implementations of linguistic frameworks (e.g., LFG and HPSG), or symbolic AI approaches (e.g., TRIPS) have been used for textual inference, in general in restricted domains. Like the Bos system, they start from a broad-coverage, domain-general, natural language processing system with a rich semantic representation. This general-purpose linguistic system is then combined with reasoners and various domain specific ontologies, databases, etc., for specific applications (see e.g., QUADRI and TRIPS). In these systems, semantic representations take the form, often underspecified, of logical forms (Copestake, 2009), and the separation between the linguistic component and the reasoning component is, in general, cleaner than in the systems developed within the RTE framework. The applications also focus more on certainty than on high probability.
3. How Should Textual Inference Be Circumscribed?
Inferencing on the basis of linguistic input is a task that requires the combination of linguistic knowledge and world knowledge. From a linguistic perspective, textual inference is about inferences that come about because of the use of specific linguistic expressions, but is it possible to distinguish those from other contributions to the inferential potential of an utterance?
The efforts to implement textual inference have shown how much world knowledge needs to be encoded and how much world knowledge and linguistic knowledge are intertwined. To give a rather subtle example: as humans we distinguish effortless between the use of go from X to Y in the movement sense and in the extent sense. We might think that we make the right distinction based on knowledge about which things move and which things have extents that are part of knowledge that can be linguistically encoded. But consider the following: This is a long train. It goes from one end of the station to the other. It has to be long because it goes through the whole of Europe. Here we interpret the two mentions of go in a different way in spite of the fact that they pertain to the same train. This can only be done because we have world knowledge about the size of trains and stations and of Europe. Even the best-typed lexical system will not give us this information.
As all lexicographers know, there is no agreement on sense inventories. Even when a resource such as WordNet distinguishes different senses they may not coincide with the distinction that needs to be made for an inference to go through or not. Dagan et al. (2013) observe that, in the context of games, winning implies playing, but in the context of war, winning implies fighting. It is not clear, however, that one would want to say that these are two different senses of win. In both cases winning implies participating. The knowledge about fighting or playing is part of world knowledge, but the boundary is difficult to draw and, however the knowledge is classified, it needs to be available. Similar problems arise in sublanguages, as in the medical domain, where one can conclude from X causes Y that Y is a symptom of X (Dagan et al., 2013).
The novelty of RTE was that it looked at the inference task as a whole, whereas endeavors such as FraCaS were very much centered on specific phenomena. We now appreciate how many different aspects of linguistic and real world knowledge are at issue even in the one-step, non-specialist inferences that the program targeted. But, as the program progressed, many groups came to feel the need for a more analytic approach. Attempts to decompose the tasks are documented in Sammons, Vydiswaran, and Roth (2010), Toledo et al. (2014), and Cabrio and Magnini (2014). Dagan et al. (2013) ends with a list of phenomena that need to be taken into account. It includes the need for knowledge sources about group memberships and the like, and the encoding of reasoning components, such as spatial and numeric reasoning. But it is remarkable how many of the items on the list are essentially linguistic in nature. They range from the need to have a better handle on relative properties and quantities to understand that 3 out of 10 might correspond to some but not to most, to the need for a better treatment of non-intersective modifiers, generics, modal constructions, and negation.
In spite of this need for better linguistic insight, there has been little collaboration between linguists and NLP researchers. In part, this is due to the fact that the aims of the two groups are different. For NLP researchers, it is crucial to have a good extensional grasp of the various phenomena, for instance, to know which items belong to a specific category, while linguists have a more intensional aim; they want to give an intrinsic characterization of the phenomenon.
But a more fundamental difference is that methodology is changing and with it the conceptualization of meaning. NLP researchers are using new methods, whereas most semanticists and pragmatists work in the traditional paradigm. Most of linguistic research summarized in the first part of this article proceeded under the assumption that inferential properties of lexical items and constructions can be detached from the context they occur in and encapsulated in categorically identifiable pieces of linguistic structure whose meanings are composed by means of a few simple operations to generate the meaning of the larger units. Any squishiness of natural language or of speaker judgments is seen as an abnormality that should disappear with better understanding. NLP research has embraced squishiness and is developing distributed representations to capture similarity of meaning through distributional similarities, specifically vector representations that capture (ratios of) co-occurrence probabilities (for overviews, see Clark, 2015; Erk, 2012). This works well to represent the meaning of content words and their combination with modifiers, but it is less clear how to do full compositionality (for proposals, see Baroni, Bernardi, & Zamparelli, 2014; Coecke, Sadrzadeh, & Clark, 2011): how does one represent truth (factuality, veridicity) through an operation of vectors?
Some current work aims to go beyond similarity and to capture the inference relations described in section 1, “Linguistic Underpinnings.” An example is the work by Bowman, Angeli, Potts, and Manning (2015) on the NatLog relations described in section 1.1.1, “Monotonicity.” Starting with the SICK and the SNLI corpora as training sets, recursive neural networks learn representations for the different NatLog relations. In such systems, the positive or negative certainties that are traditionally the hallmarks of valid inferences are just the extreme poles of a scale going from certain to impossible.
The use of these distributional models together with more and more massive data sets produces intriguing results. From a linguistic perspective, questions remain such as whether they can model the differences between at-issue entailments, presuppositions, and invited inferences, or whether these distinctions should be given up. If linguists want to engage with this NLP community, they will have to get involved in building models that take all aspects of inference into account and come to terms with probabilistic aspects of semantics and pragmatics.
Excitement, Excitement Open Platform for Textual Inferences.
FraCaS, the FraCaS Consortium, “Using the Framework.”
FrameNet, FrameNet Project, International Computer Science Institute.
HPSG, CSLI Linguistic Grammars Online (LinGO) Lab at Stanford University.
NIST, National Institute of Standards and Technology, Text Analysis Conference.
NomBank, annotation project at New York University.
SICK, the SICK data set: Sentences Involving Compositional Knowledge.
SNLI, Stanford Natural Language Inference Corpus.
TimeBank, Linguistics Data Consortium.
WordNet, a lexical database for English.
Textual inference as defined here is not a research topic on its own. Aspects of it are discussed as parts of semantics and pragmatics.
Beaver, D. I., & Geurts, B. (2014). Presupposition. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy. A source for the state of the art on Presupposition.Find this resource:
Clark, S. (2015). Vector space models of lexical meaning. In S. Lappin & C. Fox (Eds.), The handbook of contemporary semantic theory (pp.493–522). Malden, MA: Wiley-Blackwell. A very readable introduction for distributional semantics.Find this resource:
Cooper, R., Dobnik, S., Larsson, S., & Lappin, S. (2015). Probabilistic type theory and natural language semantics. Linguistic Issues in Language Technology, 10(4). Along with Goodman & Lassiter, an approach to traditional semantics.Find this resource:
Dagan, I., Roth, D., Sammons, M., & Zanzotto, F. M. (2013). Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies 23. San Rafael, CA: Morgan & Claypool. This work is the most extensive overview of the second part. It contains copious references.Find this resource:
Goodman, N., & Lassiter, D. (2015). Probabilistic semantics and pragmatics: Uncertainty in language and thought. In S. Lappin & C. Fox (Eds.), Handbook of contemporary semantic theory (pp. 655–686). Malden, MA: Wiley-Blackwell. Source for recent probabilistic approaches to more traditional semantics.Find this resource:
Goodman, N. D., & Stuhlmüller, A. (2013). Knowledge and implicature: Modeling language understanding as social cognition. Topics in Cognitive Science, 5(1), 173–184. On scalar implicatures, L. Horn is the leading specialist (see reference below). Recently, there is computational work in the framework of the Rational Speech Acts model.Find this resource:
Lappin, S., & Fox, C. (Eds.). (2015). The handbook of contemporary semantic theory. Malden, MA: Wiley-Blackwell. A recent handbook of semantic theory.Find this resource:
MacCartney, B. (2009). Natural language inference (Doctoral dissertation). Stanford University. The most readable introduction to Natural Logic is found in relevant parts of MacCartney’s thesis.Find this resource:
Maienborn, C., von Heusinger, K., & Portner, P. (Eds.). (2012). Semantics, Vol 1. Berlin: Mouton de Gruyter. Treats much of the material of the first section of this overview.Find this resource:
Moss, L. (2015). Natural logic. In S. Lappin & C. Fox (Eds.), The handbook of contemporary semantic theory (pp. 561–592). Cambridge, MA: Wiley-Blackwell. A more formal introduction.Find this resource:
Potts, C. (2005). The logic of conventional implicatures. Oxford Studies in Theoretical Linguistics 7. Oxford: Oxford University Press. A clear treatment of one type of presupposition, which Potts calls Conventional Implicatures.Find this resource:
Proceedings of Inference in Computational Semantics (ICoS). The workshops are a source for computational treatments of inference.
Adams, R. (2006). Textual entailment through extended lexical overlap. In Proceedings of the Second PASCAL Challenges Workshop on Recognizing Textual Entailment.Find this resource:
Allen, J., Chambers, N., Ferguson, G., Galescu, L., Jung, H., Swift, M., et al., W. (2007). PLOW: A Collaborative Task Learning Agent. Presented at the Twenty-Second National Conference on Artificial Intelligence (AAAI), Vancouver, BC.Find this resource:
Anand, P., & Hacquard, V. (2014). Factivity, belief, and discourse. In C. Crnič & U. Sauerland (Eds.), The art and craft of semantics: A festschrift for Irene Heim. MIT Working Papers in Linguistics.Find this resource:
Austin, J. L. (1962). How to do things with words. The William James Lectures, 1955, Cambridge, MA: Harvard University Press.Find this resource:
Baroni, M., Bernardi, R., & Zamparelli, R. (2014). Frege in space: A program of compositional distributional semantics. Linguistic Issues in Language Technology, 9.Find this resource:
Beaver, D. (2010). Have you noticed that your belly button lint color is related to the color of your clothing? In R. Bauerle, U. Reyle, & T. Zimmermann (Eds.), Presuppositions and discourse: Essays offered to Hans Kamp (pp. 65–100). Bingley, U.K.: Emerald.Find this resource:
Benthem, J. V. (1995). Language in Action. Cambridge, MA: MIT Press.Find this resource:
Benthem, J. V. (2008). A brief history of natural logic. Technical Report PP-2008-05, Institute for Logic, Language & Computation.Find this resource:
Bjerva, J., Bos, J., Van der Goot, R., & Nissim, M. (2014). The meaning factory: Formal semantics for recognizing textual entailment and determining semantic similarity. Proceedings of the 8th International Workshop on Semantic Evaluation.Find this resource:
Blackburn, P., & Bos, J. (2005). Representation and inference for natural language. Stanford, CA: CSLI.Find this resource:
Bobrow, D., Cheslow, B., Condoravdi, C., Karttunen, L., King, T. H., Nairn, R., et al. (2007). PARC’s bridge and question answering system. Proceedings of the GEAF 2007 Workshop. CSLI Online.Find this resource:
Bobrow, D., Condoravdi, C., Crouch, R., Kaplan, R., Karttunen, L.King, T., et al. (2005). A basic logic for textual inference. Inference for Textual Question Answering. Workshops of the 12th International Conference on Artificial Intelligence.Find this resource:
Bobrow, D., Condoravdi, C., Richardson, K., Waldinger, R., & Das, A. (2011). Deducing answers to English questions from structured data. In Proceedings of the 16th International Conference on Intelligent User Interfaces (pp. 299–302). New York: ACM.Find this resource:
Bos, J., & Markert, K. (2005). Recognizing textual entailment with logical inference. In Proceedings of the Human Technology Conference and the Conference on Empirical Methods in Natural Language Processing (pp. 628–635). Vancouver, BC.Find this resource:
Bos, J., Zanzotto, F. M., & Pennacchiotti, M. (2009). Textual Entailment at EVALITA 2009. In Proceedings of EVALITA 2009.Find this resource:
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In Conference Proceedings for EMNLP 2015. Lisbon, Portugal.Find this resource:
Bowman, S. R., Potts, C., & Manning, C. D. (2015). Recursive neural networks can learn logical semantics. Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality.Find this resource:
Cabrio, E., & Magnini, B. (2014). Decomposing semantic inferences. Linguistic Issues in Language Technology, 9.Find this resource:
Clark, S. (2015). Vector space models of lexical meaning. In S. Lappin & C. Fox (Eds.), Handbook of contemporary semantic theory (pp.493–522). Malden, MA: Wiley-Blackwell.Find this resource:
Condoravdi, C., Crouch, D., de Paiva, V., Stolle, R., & Bobrow, D. G. (2003). Entailment, intensionality and text understanding. In Proceedings of the HLT-NAACL 2003 Workshop on Text Meaning. Association for Computational Linguistics. doi:10.3115/1119239.1119245Find this resource:
Coecke, B., Sadrzadeh, M., & Clark, S. (2011). Mathematical foundations for a compositional distributed model of meaning. Linguistic Analysis, 36, 345–384.Find this resource:
Cooper, R., Crouch, D., Van Eijck, J., Fox, C., Van Genabith, J., Jaspers, J., et al. (1996). Using the Framework. The FraCaS Consortium. doi: 10.1.1.45.7694
Copestake, A. (2009). Slacker semantics: Why superficiality, dependency and avoidance of commitment can be the right way to go. In Proceedings of the 12th Conference of the European Chapter of the ACL. Association for Computational Linguistics.Find this resource:
Curran, J., Clark, S., & Bos, J. (2007). Linguistically Motivated Large-Scale NLP with C&C and Boxer. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions.Find this resource:
Dagan, I., Dolan, B., Magnini, B., & Roth, D. (Eds.). (2009). Textual entailment [Special issue]. Natural Language Engineering, 15(4).Find this resource:
Dagan, I., Glickman, O., & Magnini, B. (2006). The PASCAL recognizing textual entailment challenge. In J. Quiñonero-Candela, I. Dagan, B. Magnini, & F. d’Alché-Buc (Eds.), Machine Learning Challenges (pp.177–190). Berlin: Springer.Find this resource:
Dagan, I., Roth, D., Sammons, M., & Zanzotto, F. M. (2013). Recognizing textual entailment: Models and applications. San Rafael, CA: Morgan & Claypool.Find this resource:
Eijck, J. V. (2007). Natural logic for natural language. In B. ten Cate & H. Zeevat (Eds.), Logic, Language, and Computation; 7th International Tbilisi Symposium on Logic, Language, and Computation (pp. 216–230). Berlin: Springer.Find this resource:
Erk, K. (2012). Vector space models of word meaning and phrase meaning: A survey. Language and Linguistics Compass, 6(10), 635–653.Find this resource:
Fillmore, C. (2002). Form and meaning in language. Stanford, CA: CSLI.Find this resource:
Fontenelle, T. (Ed.). (2003). FrameNet and frame semantics. International Journal of Lexicography, 16(3), 231.Find this resource:
Geurts, B., Beaver, D. I., & Maier, E. (2016). Discourse representation theory. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy.Find this resource:
Grice, H. P. (1975). Logic and conversation. In P. Cole & J. Morgan (Eds.), Syntax and semantics, Vol.3: Speech acts (pp. 41–58). London: Academic Press.Find this resource:
Harabagiu, S., Hickl, A., & Lacatusu, F. (2006). Negation, contrast and contradiction in text processing. AAAI 6.Find this resource:
Heim, I. (1982). The semantics of definite and indefinite noun phrases (Doctoral thesis). University of Massachusetts, Amherst, MA.Find this resource:
Hickl, A., & Bensley, J. (2007). A discourse commitment-based framework for recognizing textual entailment. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing.Find this resource:
Horn, L. (2006). Implicature. In L. Horn & G. Ward (Eds.), The handbook of pragmatics (pp. 1–28). Malden, MA: Wiley-Blackwell.Find this resource:
Icard, T. (2012). Inclusion and exclusion in natural language. Studia Logica, 100(4), 705–725.Find this resource:
Kamp, H. (2011). Discourse representation theory. In C. Maienborn, K. von Heusinger, & P. Portner (Eds.), Semantics (pp. 872–923). Berlin: Mouton de Gruyter.Find this resource:
Karttunen, L. (1971). Implicative verbs. Language, 47, 340–358.Find this resource:
Karttunen, L. (1974). Presupposition and linguistic context. Theoretical linguistics, 1, 181–194.Find this resource:
Karttunen, L. (2011). You will be lucky to break even. In T. H. King & V. de Paiva (Eds.), From quirky case to representing space (pp.167–180). San Rafael, CA: CSLI.Find this resource:
Karttunen, L., & Peters, S. (1979). Conventional implicature, In Ch.-K. Oh & D. A. Dinneen (Eds.), Syntax and semantics, Vol. 11: Presupposition (pp. 1–56). New York: Academic Press.Find this resource:
Karttunen, L., Peters, S., Zaenen, A., & Condoravdi, C. (2014). The chameleon-like nature of evaluative adjectives. Empirical Issues in Syntax and Semantics, 10, 233–250.Find this resource:
Kiparsky, P., & Kiparsky, C. (1970). Fact. In M. Bierwisch & K. Heidolph (Eds.), Progress in linguistics (pp. 143–173). The Hague: Mouton.Find this resource:
Kratzer, A. (1991). Modality. In A. von Stechow & D. Wunderlich (Eds.), Semantics: An international handbook of contemporary research (pp. 639–650). Berlin: Walter de Gruyter.Find this resource:
Langendoen, D. T., & Savin, H. (1971). The projection problem for presuppostions. In C. Fillmore & D. T. Langendoen (Eds.), Studies in linguistic semantics (pp. 373–388). New York: Holt, Reinhardt, and Winston.Find this resource:
Lassiter, D. (forthcoming). Graded Modality: Qualitative and Quantitative Perspectives. Oxford University Press.Find this resource:
Lewis, D. (1979). Scorekeeping in a language game. Journal of Philosophical Logic, 8, 339–359.Find this resource:
Levin, B. (1993). English verb classes and alternations: A preliminary investigation. Chicago: University of Chicago Press.Find this resource:
MacCartney, B. (2009). Natural language inference (Doctoral dissertation). Stanford University.Find this resource:
MacCartney, B., & Manning, C. (2008). Modeling semantic containment and exclusion in natural language inference. Proceedings of the 22nd International Conference on Computational Linguistics.Find this resource:
Magnini, B., Zanoli, R., Dagan, I., Eichler, K., Neumann, G., Noh, et al. (2014). The Excitement Open Platform for Textual Inferences. In Proceedings of Association for Computational Linguistics; Systems Demonstrations.Find this resource:
Manning, C., de Marneffe, M.-C., & Potts, C. (2012). Did it happen? The pragmatic complexity of veridicality assessment. Computational Linguistics, 38(2), 301–332.Find this resource:
Martin, J. R., & White, P. R. R. (2005). The language of evaluation. Appraisal in English. New York: Palgrave Macmillan.Find this resource:
Monz, C., & de Rijke, M. (2001). Light-weight entailment checking for computational semantics. In P. Blackburn & M. Kohlhase (Eds.), Proceedings of the 3rd Workshop on Inference in Computational Semantics (ICoS-3).Find this resource:
Moss, L. (2015). Natural logic. In S. Lappin & C. Fox (Eds.), Handbook of contemporary semantic theory (pp. 561–592). Cambridge, MA: Wiley-Blackwell.Find this resource:
Nairn, R., Condoravdi, C., & Karttunen, L. (2006). Computing relative polarity for textual inference. In Proceedings of ICoS-5 (Inference in Computational Semantics).Find this resource:
Potts, C. (2005). The logic of conventional implicatures. Oxford Studies in Theoretical Linguistics 7. Oxford: Oxford University Press.Find this resource:
Potts, C. (2007). Into the conventional-implicature dimension. Philosophy Compass, 4(2), 665–679.Find this resource:
Potts, C. (2015). Presupposition and implicature. In S. Lappin & C. Fox (Eds.), Handbook of contemporary semantic theory (pp. 168–201). Oxford: Wiley.Find this resource:
Sammons, M., Vydiswaran, V. G. V., & Roth, D. (2010). Do not ask what textual entailment can do for you…. In Proceedings of the 48th Annual Meeting of the ACL.Find this resource:
Sánchez Valencia, V. (1991). Studies on natural logic and categorial grammar (Doctoral dissertation). University of Amsterdam, Netherlands.Find this resource:
Searle, J. (1989). How performatives work. Linguistics and Philosophy, 12, 535–558.Find this resource:
Simons, M., Beaver, D., Roberts, C., & Tonhauser, J. (2016). The best question: Explaining the projection behavior of factive verbs. Discourse Processes.Find this resource:
Steedman, M., & Baldridge, J. (2011). Combinatory categorial grammar. In R. Borsley & K. Borjars (Eds.), Non-Transformational Syntax (pp. 181–224). Blackwell.Find this resource:
Toledo, A., Alexandropoupou, S., Chesney, S., Katrenko, S., Klockmann, H., Kokke, P., et al. (2014). Towards a semantic model for textual entailment. Linguistic Issues in Language Technology, 9.Find this resource:
van der Sandt, R. (1992). Presupposition projection as anaphora resolution. Journal of Semantics, 9, 333–392.Find this resource:
von Fintel, K. (2008). What is presupposition accommodation. Again? Philosophical Perspectives, 22, 137–170.Find this resource:
Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67–78.Find this resource:
Zaenen, A. (2007). Give a penny for their thoughts (unpublished paper).
Zaenen, A., Karttunen, L., & Crouch, R. (2005). Local textual inference: Can it be defined or circumscribed?. 5ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment.Find this resource: