Train O Matic: Large Scale Supervised Word Sense Disambiguation In Multiple Languages Without Manual Training Data Matic Trai
User Manual:
Open the PDF directly: View PDF .
Page Count: 11
Download | |
Open PDF In Browser | View PDF |
Train-O-Matic: Large-Scale Supervised Word Sense Disambiguation in Multiple Languages without Manual Training Data Tommaso Pasini and Roberto Navigli Department of Computer Science Sapienza University of Rome {pasini,navigli}@di.uniroma1.it Abstract difference from many other problems, however, is that the classes to choose from (i.e., the senses of a target word) vary for each word, therefore requiring a separate training process to be performed on a word by word basis. As a result, hundreds of training instances are needed for each ambiguous word in the vocabulary. This would necessitate a million-item training set to be manually created for each language of interest, an endeavour that is currently beyond reach even in resource-rich languages like English. The second paradigm, i.e., knowledge-based WSD, takes a radically different approach: the idea is to exploit a general-purpose knowledge resource like WordNet (Fellbaum, 1998) to develop an algorithm which can take advantage of the structural and lexical-semantic information in the resource to choose among the possible senses of a target word occurring in context. For example, a PageRank-based algorithm can be developed to determine the probability of a given sense being reached starting from the senses of its context words. Recent approaches of this kind have been shown to obtain competitive results (Agirre et al., 2014; Moro et al., 2014). However, due to its inherent nature, knowledge-based WSD tends to adopt bag-of-word approaches which do not exploit the local lexical context of a target word, including function and collocation words, which limits this approach in some cases. In this paper we get the best of both worlds and present Train-O-Matic, a novel method for generating huge high-quality training sets for all the words in a language’s vocabulary. The approach is language-independent, thanks to its use of a multilingual knowledge resource, BabelNet (Navigli and Ponzetto, 2012), and it can be applied to any kind of corpus. The training sets produced with Train-O-Matic are shown to provide competitive performance with those of manually and semi- Annotating large numbers of sentences with senses is the heaviest requirement of current Word Sense Disambiguation. We present Train-O-Matic, a languageindependent method for generating millions of sense-annotated training instances for virtually all meanings of words in a language’s vocabulary. The approach is fully automatic: no human intervention is required and the only type of human knowledge used is a WordNet-like resource. Train-O-Matic achieves consistently state-of-the-art performance across gold standard datasets and languages, while at the same time removing the burden of manual annotation. All the training data is available for research purposes at http://trainomatic.org. 1 Introduction Word Sense Disambiguation (WSD) is a key task in computational lexical semantics, inasmuch as it addresses the lexical ambiguity of text by making explicit the meaning of words occurring in a given context (Navigli, 2009). Anyone who has struggled with frustratingly unintelligible translations from an automatic system, or with the meaning bias of search engines, can understand the importance for an intelligent system to go beyond the surface appearance of text. There are two mainstream lines of research in WSD: supervised and knowledge-based WSD. Supervised WSD frames the problem as a classical machine learning task in which, first a training phase occurs aimed at learning a classification model from sentences annotated with word senses and, second the model is applied to previouslyunseen sentences focused on a target word. A key 78 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 78–88 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics probability. Formally, (P)PR is computed as follows: automatically tagged corpora. Moreover, state-ofthe-art performance is also reported for low resourced languages (i.e., Italian and Spanish) and domains, where manual training data is not available. 2 v (t+1) = (1 − α)v (0) + αM v (t) where M is the row-normalized adjacency matrix of the semantic network, the restart probability distribution is encoded by vector v (0) , and α is the well-known damping factor usually set to 0.85 (Brin and Page, 1998). If we set v (0) to a unit probability vector (0, . . . , 0, 1, 0, . . . , 0), i.e., restart is always on a given vertex, PPR outputs the probability of reaching every vertex starting from the restart vertex after a certain number of steps. This approach has been used in the literature to create semantic signatures (i.e., profiles) of individual concepts, i.e., vertices of the semantic network (Pilehvar et al., 2013), and then to determine the semantic similarity of concepts. As also done by Pilehvar and Collier (2016), we instead use the PPR vector as an estimate of the conditional probability of a word w0 given the target sense1 s ∈ V of word w: Building a Training Set from Scratch In this Section we present Train-O-Matic, a language-independent approach to the automatic construction of a sense-tagged training set. TrainO-Matic takes as input a corpus C (e.g., Wikipedia) and a semantic network G = (V, E). We assume a WordNet-like structure of G, i.e., V is the set of concepts (i.e., synsets) such that, for each word w in the vocabulary, Senses(w) is the set of vertices in V that are expressed by w, e.g., the WordNet synsets that include w as one of their senses. Train-O-Matic consists of three steps: • Lexical profiling: for each vertex in the semantic network, we compute its Personalized PageRank vector, which provides its lexicalsemantic profile (Section 2.1). P (w0 |s, w) = maxs0 ∈Senses(w0 ) vs (s0 ) Z (2) P where Z = w” P (w”|s, w) is a normalization constant, vs is the vector resulting from an adequate number of random walks used to calculate PPR, and vs (s0 ) is the vector component corresponding to sense s0 . To fix the number of iterations needed to have a sufficiently accurate vector, we follow Lofgren et al. (2014) and set the error δ = 0.00001 and the number of iterations to 1 δ = 100, 000. As a result of this lexical profiling step we have a probability distribution over vocabulary words for each given word sense of interest. • Sentence scoring: For each sentence containing a word w, we compute a probability distribution over all the senses of w based on its context (Section 2.2). • Sentence ranking and selection: for each sense s of a word w in the vocabulary, we select those sentences that are most likely to use w in the sense of s (Section 2.3). 2.1 (1) Lexical profiling In terms of semantic networks the probability of reaching a node v 0 starting from v can be interpreted as a measure of relatedness between the synsets v and v 0 . Thus we define the lexical profile of a vertex v in a graph G = (V, E) as the probability distribution over all the vertices v 0 in the graph. Such distribution is computed by applying the Personalized PagaRank algorithm, a variant of the traditional PageRank (Brin and Page, 1998). While the latter is equivalent to performing random walks with uniform restart probability on every vertex at each step, PPR, on the other hand, makes the restart probability non-uniform, thereby concentrating more probability mass in the surroundings of those vertices having higher restart 2.2 Sentence scoring The objective of the second step is to score the importance of word senses for each of the corpus sentences which contain the word of interest. Given a sentence σ = w1 , w2 , . . . , wn , for a given target word w in the sentence (w ∈ σ), and for each of its senses s ∈ Senses(w), we compute the probability P (s|σ, w). Thanks to Bayes’ theorem we can determine the probability of sense s of w given the 1 Note that we use senses and concepts (synsets) interchangeably, because – given a word – a word sense unambiguously determines a concept (i.e., the synset it is contained in) and vice versa. 79 2.3 sentence as follows: P (σ|s, w)P (s|w) P (σ|w) P (w1 , . . . , wn |s, w)P (s|w) = P (w1 , . . . , wn |w) ∝ P (w1 , . . . , wn |s, w)P (s|w) P (s|σ, w) = (3) Sense-based sentence ranking and selection Finally, for a given word w and a given sense s1 ∈ Senses(w), we score each sentence σ in which w appears and s1 is its most likely sense according to a formula that takes into account the difference between the first (i.e., s1 ) and the second most likely sense of w in σ: (4) ≈ P (w1 |s, w) . . . P (wn |s, w)P (s|w) (5) ∆s1 (σ) = P (s1 |σ, w) − P (s2 |σ, w) where Formula 4 is proportional to the original probability (due to removing the constant in the denominator) and is approximated with Formula 5 due to the assumption of independence of the words in the sentence. P (wi |s, w) is calculated as in Formula 2 and P (s|w) is set to 1/|Senses(w)| (recall that s is a sense of w). For example, given the sentence σ = “A match is a tool for starting a fire”, the target word w = match and its set of senses Smatch = {s1match , s2match }, where s1match is the sense of lighter and s2match is the sense of game match, we want to calculate the probability of each simatch ∈ Smatch of being the correct sense of match in the sentence σ. Following Formula 5 we have: (6) where s1 = arg maxs∈Senses(w) P (s|σ, w), and s2 = arg maxs∈Senses(w)\{s1 } P (s|σ, w). We then sort all sentences based on ∆s1 (·) and return a ranked list of sentences where word w is most likely to be sense-annotated with s1 . Although we recognize that other scoring strategies could have been used, this was experimentally the most effective one when compared to alternative strategies, i.e., the sense probability, the number of words related to the target word w, the sentence length or a combination thereof. 3 Creating a Denser and Multilingual Semantic Network In the previous Section we assumed that WordNet was our semantic network, with synsets as vertices and edges represented by its semantic relations. However, while its lexical coverage is high, with a rich set of fine-grained synsets, at the relation level WordNet provides mainly paradigmatic information, i.e., relations like hypernymy (is-a) and meronymy (part-of). It lacks, on the other hand, syntagmatic relations, such as those that connect verb synsets to their arguments (e.g., the appropriate senses of eatv and foodn ), or pairs of noun synsets (e.g., the appropriate senses of busn and drivern ). Intuitively, Train-O-Matic would suffer from such a lack of syntagmatic relations, as the relevance of a sense for a given word in a sentence depends directly on the possibility of visiting senses of the other words in the same sentence (cf. Formula 5) via random walks as calculated with Formula 1. Such reachability depends on the connections available between synsets. Because syntagmatic relations are sparse in WordNet, if it was used on its own, we would end up with a poor ranking of sentences for any given word sense. Moreover, even though the methodology presented in Section 2 is languageindependent, Train-O-Matic would lack informa- P (s1match |σ, match) ≈ P (tool|s1match , match) · P (start|s1match , match) · P (fire|s1match , match) · P (s1match |match) = 2.1 · 10−4 · 2 · 10−3 · 10−2 · 5 · 10−1 = 2.1 · 10−9 P (s2match |σ, match) ≈ P (tool|s2match , match) · P (start|s2match , match) · P (fire|s2match , match) · P (s2match |match) = 10−5 · 2.9 · 10−4 · 10−6 · 5 · 10−1 = 1.45 · 10−15 As can be seen, the first sense of match has a much higher probability due to its stronger relatedness to the other words in the context (i.e. start, fire and tool). Note also that all the probabilities for the second sense are at least one magnitude less than the probability of the first sense. 80 mouse (animal) WordNet WordNetBN 1 mousen mouse1n tail1n little1a 1 hairlessa rodent1n 1 rodentn cheese1n trunk3n cat1n 2 elongatea rat1n house mouse1n elephant1n minuteness1n pet1n 1 nude mousen experiment1n mouse (device) WordNet WordNetBN 4 mousen mouse4n wheel1n computer1n 1 electronic devicen pad4n ball3n cursor1n hand operated1n operating system1n 1 mouse buttonn trackball1n cursor1n wheel1n 3 operatev joystick1n 1 objectn Windows1n Table 1: Top-ranking synsets of the PPR vectors computed on WordNet (first and third columns) and WordNetBN (second and fourth columns) for mouse as animal (left) and as device (right). tion (e.g. senses for a word in an arbitrary vocabulary) for languages other than English. To cope with these issues, we exploit BabelNet,2 a huge multilingual semantic network obtained from the automatic integration of WordNet, Wikipedia, Wiktionary and other resources (Navigli and Ponzetto, 2012), and create the BabelNet subgraph induced by the WordNet vertices. The result is a graph whose vertices are BabelNet synsets that contain at least one WordNet synset and whose edge set includes all those relations in BabelNet coming either from WordNet itself or from links in other resources mapped to WordNet (such as hyperlinks in a Wikipedia article connecting it to other articles). The greatest contribution of syntagmatic relations comes, indeed, from Wikipedia, as its articles are linked to related articles (e.g., the English Wikipedia Bus article3 is linked to Passenger, Tourism, Bus lane, Timetable, School, and many more). Because not all Wikipedia (and other resources’) pages are connected with the same degree of relatedness (e.g., countries are often linked, but they are not necessarily closely related to the source article in which the link occurs), we apply the following weighting strategy to each edge (s, s0 ) ∈ E of our WordNet-induced subgraph of BabelNet G = (V, E): ( 1 (s, s0 ) ∈ E(WordNet) w(s, s0 ) = W O(s, s0 ) otherwise (7) where E(WordNet) is the edge set of the original WordNet graph and W O(s, s0 ) is the weighted 2 3 overlap measure which calculates the similarity between two synsets: 0 W O(s, s ) = P|S| 1 2 −1 i=1 (ri + ri ) P|S| −1 i=1 (2i) where ri1 and ri2 are the rankings of the i-th synsets in the set S of the components in common between the vectors associated with s and s0 , respectively. Because at this stage we still have to calculate our synset vector representation, we use the precomputed NASARI vectors (Camacho-Collados et al., 2015) to calculate WO. This choice is due to WO’s higher performance over cosine similarity for vectors with explicit dimensions (Pilehvar et al., 2013). As a result, each row of the original adjacency matrix M of G will be replaced with the weights calculated in Formula 7 and then normalized in order to be ready for PPR calculation (see Formula 1). An idea of why a denser semantic network has more useful connections and thus leads to better results is provided by the example in Table 14 , where we show the highest-probability synsets in the PPR vectors calculated with Formula 1 for two different senses of mouse (its animal and device senses) when WordNet (left) and our WordNet-induced subgraph of BabelNet (WordNetBN , right) are used as the underlying semantic network for PPR computation. Note that WordNet’s top synsets are related to the target synset via paradigmatic (i.e., hypernymy and meronymy) relations, while WordNetBN includes many syntagmatically-related synsets (e.g., exper4 We use the notation wpk introduced in (Navigli, 2009) to denote the k-th sense of word w with part-of-speech tag p. http://babelnet.org Retrieved on February 3rd, 2017. 81 to increase coverage, to keep a level playing field we excluded the latter from the corpus. iment for the animal, and operating system and Windows for the device sense, among others). 4 Experimental Setup We note that T-O-M, instead, is fully automatic and does not require any WSD-specific human intervention nor any aligned corpus. Corpora for sense annotation We used two different corpora to extract sentences: Wikipedia and the United Nations Parallel Corpus (Ziemski et al., 2016). The first is the largest and most up-to-date encyclopedic resource, containing definitional information, the second, on the other hand, is a public collection of parliamentary documents of the United Nations. The application of TrainO-Matic to the two corpora produced two senseannotated datasets, which we named T-O-MW iki and T-O-MU N , respectively. Reference system In all our experiments, we used It Makes Sense (Zhong and Ng, 2010, IMS), a state-of-the-art WSD system based on linear Support Vector Machines, as our reference system for comparing its performance when trained on TO-M, against the same WSD system trained on other sense-annotated corpora (i.e., SemCor and OMSTI). Following the WSD literature, unless stated otherwise, we report performance in terms of F1, i.e., the harmonic mean of precision and recall. We note that it is not the purpose of this paper to show that T-O-M, when integrated into IMS, beats all other configurations or alternative systems, but rather to fully automatize the WSD pipeline with performances which are competitive with the state of the art. Semantic Network We created sense-annotated corpora with Train-O-Matic both when using PPR vectors computed from vanilla WordNet and when using WordNetBN , our denser network obtained from the WordNet-induced subgraph of BabelNet (see Section 3). Gold standard datasets We performed our evaluations using the framework made available by Raganato et al. (2017a) on five different allwords datasets, namely: the Senseval-2 (Edmonds and Cotton, 2001), Senseval-3 (Snyder and Palmer, 2004), SemEval-2007 (Pradhan et al., 2007), SemEval-2013 (Navigli et al., 2013) and SemEval-2015 (Moro and Navigli, 2015) WSD datasets. We focused on nouns only, given the fact that Wikipedia provides connections between nominal synsets only, and therefore contributes mainly to syntagmatic relations between nouns. Baseline As a traditional baseline in WSD, we used the Most Frequent Sense (MFS) baseline given by the first sense in WordNet. The MFS is a very competitive baseline, due to the sense skewness phenomenon in language (Navigli, 2009). Number of training sentences per sense Given a target word w, we sorted its senses Senses(w) following the WordNet ordering and selected the top ki training sentences for the i-th sense according to Formula 6, where: Comparison sense-annotated corpora To show the impact of our T-O-M corpora in WSD, we compared its performance on the above gold standard datasets, against training with: 1 ∗K (8) iz with K = 500 and z = 2 which were tuned on a separate small in-house development dataset5 . ki = • SemCor (Miller et al., 1993), a corpus containing about 226,000 words annotated manually with WordNet senses. 5 Results 5.1 Impact of syntagmatic relations The first result we report regards the impact of vanilla WordNet vs. our WordNet-induced subgraph of BabelNet (WordNetBN ) when calculating PPR vectors. As can be seen from Table 2 – which shows the performance of the T-O-MW iki corpora generated with the two semantic networks – using WordNet for PPR computation decreases • One Million Sense-Tagged Instances (Taghipour and Ng, 2015, OMSTI), a sense-annotated dataset obtained via a semi-automatic approach based on the disambiguation of a parallel corpus, i.e., the United Nations Parallel Corpus, performed by exploiting manually translated word senses. Because OMSTI integrates SemCor 5 82 50 word-sense pairs annotated manually. Dataset Senseval-2 Senseval-3 SemEval-07 SemEval-13 SemEval-15 ALL T-O-MW iki BN 70.5 67.4 59.8 65.5 68.6 67.3 T-O-MW iki WN 70.0 63.1 57.9 63.7 69.5 65.7 5.3 IMS uses the MFS as a backoff strategy when no sense can be output for a target word in context (Zhong and Ng, 2010). Consequently, the performance of the MFS is mixed up with that of the SVM classifier. As shown in Table 4, OMSTI is able to provide annotated sentences for roughly half of the tokens in the datasets. Train-O-Matic, on the other hand, is able to cover almost all words in each dataset with at least one training sentence. This means that in around 50% of cases OMSTI gives an answer based on the IMS backoff strategy. To determine the real impact of the different training data, we therefore decided to perform an additional analysis of the IMS performance when the MFS backoff strategy is disabled. Because we suspected the system would not always return a sense for each target word, in this experiment we measured precision, recall and their harmonic mean, i.e., F1. The results in Table 5 confirm our hunch, showing that OMSTI’s recall drops heavily, thereby affecting F1 considerably. T-O-M performances, instead, remain high in terms of precision, recall and F1. This confirms that OMSTI relies heavily on data (those obtained for the MFS and from SemCor) that are produced manually, rather than semi-automatically. Table 2: F1 of IMS trained on T-O-M when PPR is obtained from the WordNet graph (WN) and from the WordNet-induced subgraph of BabelNet (BN). the overall performance of IMS from 0.5 to around 4 points across the five datasets, with an overall loss of 1.6 F1 points. Similar performance losses were observed when using T-O-MU N (see Table 3). This corroborates our hunch discussed in Section 3 that a resource like BabelNet can contribute important syntagmatic relations that are beneficial for identifying (and ranking high) sentences which are semantically relevant for the target word sense. In the following experiments, we report only results using WordNetBN . 5.2 Performance without backoff strategy Comparison against sense-annotated corpora We now move to comparing the performance of T-O-M, which is fully automatic, against corpora which are annotated manually (SemCor) and semi-automatically (OMSTI). In Table 3 we show the F1-score of IMS on each gold standard dataset in the evaluation framework and on all datasets merged together (last row), when it is trained with the various corpora described above. 5.4 Domain-oriented WSD To further inspect the ability of T-O-M to enable disambiguation in different domains, we decided to evaluate on specific documents from the various gold standard datasets which could be clearly assigned a domain label. Specifically, we tested on 13 SemEval-13 documents from various domains6 and 2 SemEval-15 documents (namely, maths & computers, and biomedicine) and carried out two separate tests and evaluations of T-O-M on each domain: once using the MFS backoff strategy, and once not using it. In Tables 6 and 7 we report the results of both T-O-MW iki and T-O-MU N to determine the impact of the corpus type. As can be seen in the tables, T-O-MW iki systematically attains higher scores than OMSTI (except for the biology domain), and, in most cases, attains higher scores than MFS when the backoff is used, with a drastic, systematic increase over OMSTI with both Train-O-Matic configurations As can be seen, T-O-MW iki and T-O-MU N obtain higher performance than OMSTI (up to 5.5 points above) on 3 out of 5 datasets, and, overall, T-O-MW iki scores 1 point above OMSTI. The MFS is in the same ballpark as T-O-MW iki , performing better on some datasets and worse on others. We note that IMS trained on T-O-MW iki succeeds in surpassing or obtaining the same results as IMS trained on SemCor on SemEval15 and SemEval-13. We view this as a significant achievement given the total absence of manual effort involved in T-O-M. Because overall T-O-MW iki outperforms T-O-MU N , in what follows we report all the results with T-O-MW iki , except for the domain-oriented evaluation (see Section 5.4). 6 Namely biology, climate, finance, health care, politics, social issues and sport. 83 Dataset Senseval-2 Senseval-3 SemEval-07 SemEval-13 SemEval-15 ALL Train-O-MaticW iki 70.5 67.4 59.8 65.5 68.6 67.3 Train-O-MaticU N 69.0 68.3 57.9 62.5 63.5 65.3 OMSTI 74.1 67.2 62.3 62.8 63.1 66.4 SemCor 76.8 73.8 67.3 65.5 66.1 70.4 MFS 72.1 72.0 65.4 63.0 66.3 67.6 Table 3: F1 of IMS trained on Train-O-Matic, OMSTI and SemCor, and MFS for the Senseval-2, Senseval-3, SemEval-07, SemEval-13 and SemEval-15 datasets. Dataset Senseval-2 Senseval-3 Semeval-07 Semeval-13 Semeval-15 ALL OMSTI 469 494 89 757 249 2058 Train-O-Matic 1005 860 159 1428 494 3946 Total 1066 900 159 1644 531 4300 be used to generate sense-annotated data for any language supported by the knowledge base. Thus, in order to build new training datasets for the two languages, we ran Train-O-Matic on their corresponding versions of Wikipedia, then we tuned the two parameters K and z on an in-house development dataset7 . In contrast to the English setting, in order to calculate Formula 8 we sorted the senses of each word by vertex degree. Finally we used the output data to train IMS. Table 4: Number of nominal tokens for which at least one training example is provided by OMSTI or Train-O-Matic for each dataset. Dataset Senseval-2 Senseval-3 SemEval-07 SemEval-13 SemEval-15 ALL P 64.8 55.7 64.1 50.7 57.0 56.5 OMSTI R 28.5 31.0 35.9 23.4 26.7 27.0 F1 39.6 39.8 46.0 32.0 36.4 36.5 Results To perform our evaluation we chose the most recent multilingual task (SemEval 2015 task 13) which includes gold data for Italian and Spanish. As can be seen from Table 8 TrainO-Matic enabled IMS to perform better than the best participating system (Manion and Sainudiin, 2014, SUDOKU) in all three settings (All domains, Maths & Computer and Biomedicine). Its performance was in fact, 1 to 3 points higher, with a 6-point peak on Maths & Computer in Spanish and on Biomedicine in Italian. This demonstrates the ability of Train-O-Matic to enable supervised WSD systems to surpass state-of-theart knowledge-based WSD approaches in lowresourced languages without relying on manually curated data for training. Train-O-Matic P R F1 69.5 65.5 67.4 66.1 63.1 64.6 59.8 59.8 59.8 61.3 53.3 57.0 67.0 62.3 64.6 65.1 59.7 62.3 Table 5: Precision, Recall and F1 of IMS trained on OMSTI and Train-O-Matic corpus without MFS backoff strategy for Senseval-2, Senseval-3, SemEval-07, SemEval-13 and SemEval-15. in recall and F1 when the backoff strategy is disabled. This demonstrates the usefulness of the corpora annotated by Train-O-Matic not only on open text, but also on specific domains. We note that T-O-MU N obtains the best results in the politics domain, which is the closest domain to the UN corpus from which its training sentences are obtained. 6 7 Related Work There are two mainstream approaches to Word Sense Disambiguation: supervised and knowledge-based approaches. Both suffer in different ways from the so-called knowledge acquisition bottleneck, that is, the difficulty in obtaining an adequate amount of lexical-semantic data: for training in the case of supervised systems, and for enriching semantic networks in the case of knowledge-based ones (Pilehvar and Scaling up to Multiple Languages Experimental Setup In this section we investigate the ability of Train-O-Matic to scale to lowresourced languages, such as Italian and Spanish, for which training data for WSD is not available. Thanks to BabelNet, in fact, Train-O-Matic can 7 We set K = 100 and z = 2.3 for Spanish and K = 100 and z = 2.5 for Italian. 84 Domain Backoff Biology Climate Finance Health Care Politics Social Issues Sport MFS MFS MFS MFS MFS MFS MFS - T-O-MW iki P R F1 63.0 63.0 63.0 59.0 53.3 56.0 68.1 68.1 68.1 63.4 50.0 55.9 68.0 68.0 68.0 62.1 51.6 56.4 65.2 65.2 65.2 61.3 55.1 58.0 65.2 65.2 65.2 62.5 54.8 58.4 68.5 68.5 68.5 63.1 53.0 57.6 60.3 60.3 60.3 58.3 54.6 56.4 P 65.9 62.3 63.4 57.5 56.6 48.4 60.1 55.6 66.3 63.9 63.6 57.2 60.9 58.1 T-O-MU N R F1 65.9 65.9 56.3 59.2 63.4 63.4 45.4 50.7 56.6 56.6 40.2 43.9 60.1 60.1 50.0 52.6 66.3 66.3 55.9 59.6 63.6 63.6 47.9 52.1 60.9 60.9 53.3 55.5 P 65.9 48.1 68.0 58.0 64.4 57.4 52.9 34.6 63.4 54.1 65.6 54.7 58.8 45.0 OMSTI R 65.9 18.5 68.0 24.2 64.4 28.3 52.9 18.4 63.4 21.5 65.6 25.2 58.8 23.0 F1 65.9 26.7 68.0 34.2 64.4 37.9 52.9 24.0 63.4 30.8 65.6 34.5 58.8 30.4 SemCor F1 66.3 70.1 63.7 62.7 69.5 66.8 60.4 - MFS F1 Size 64.4 135 67.5 194 56.2 219 56.5 138 67.7 279 67.6 349 57.6 330 MFS F1 Size 71.1 100 40.9 97 Table 6: Performance comparison over SemEval-2013 domain-specific datasets. Domain Biomedicine Maths & Computer Backoff MFS MFS - T-O-MW iki P R F1 76.3 76.3 76.3 76.1 72.2 74.1 50.0 50.0 50.0 50.0 47.0 48.5 P 66.0 64.4 48.0 47.8 T-O-MU N R F1 66.0 66.0 59.8 62.0 48.0 48.0 44.0 45.8 P 64.9 60.5 36.0 21.2 OMSTI R 64.9 26.8 36.0 11.0 F1 64.9 37.2 36.0 14.5 SemCor F1 70.3 40.6 - Table 7: Performance comparison over the Biomedical and Maths & Computer domains in SemEval-15. Language Italian Spanish Dataset ALL Computers & Math Biomedicine ALL Computers & Math Biomedicine Best System F1 56.6 46.6 65.9 56.3 42.4 65.5 Train-O-Matic P R F1 65.1 55.6 59.9 52.7 43.3 47.6 76.6 67.6 71.8 61.3 54.8 57.9 53.3 44.4 48.5 71.8 65.5 68.5 Table 8: Performance comparison between T-O-M and SemEval-2015’s best SUDOKU Run. Navigli, 2014; Navigli, 2009). stances annotated with senses for each target word occurrence. Overall, this amounts to millions of training instances for each language of interest, a number that is not within reach for any language. In fact, no supervised system has been submitted in major multilingual WSD competitions for languages other than English (Navigli et al., 2013; Moro and Navigli, 2015). To overcome this problem, new methodologies have recently been developed which aim to create sense-tagged corpora automatically. Raganato et al. (2016) developed 7 heuristics to grow the number of hyperlinks in Wikipedia pages. Otegi et al. (2016) applied a different disambiguation pipeline for each language to parallel text in Europarl (Koehn, 2005) and QTLeap (Agirre et al., 2015) in order to enrich them with semantic annotations. Taghipour and Ng (2015), the work closest to ours, exploits the alignment from English to Chinese sentences of State-of-the-art supervised systems include Support Vector Machines such as IMS (Zhong and Ng, 2010) and, more recently, LSTM neural networks with attention and multitask learning (Raganato et al., 2017b) as well as LSTMs paired with nearest neighbours classification (Melamud et al., 2016; Yuan et al., 2016). The latter also integrates a label propagation algorithm in order to enrich the sense annotated dataset. The main difference from our approach is its need for a manually annotated dataset to start the label propagation algorithm, whereas Train-O-Matic is fully automatic. An evaluation against this system would have been interesting, but neither the proprietary training data nor the code are available at the time of writing. In order to generalize effectively, these supervised systems require large numbers of training in85 approaches are barely able to surpass supervised WSD systems when they enrich their networks starting from a comparable amount of annotated data (Pilehvar and Navigli, 2014). With T-O-M, rather than further enriching an existing semantic network, we exploit the information available in the network to annotate raw sentences with sense information and train a state-of-the-art supervised WSD system without task-specific human annotations. the United Nation Parallel Corpus (Ziemski et al., 2016) to reduce the ambiguity of English words and sense-tag English sentences. The assumption is that the second language is less ambiguous than the first one and that hand-made translations of senses are available for each WordNet synset. This approach is, therefore, semi-automatic and relies on certain assumptions, in contrast to TrainO-Matic which is, instead, fully automatic and can be applied to any kind of corpus (and language) depending on the specific need. Earlier attempts at the automatic extraction of training samples were made by Agirre and De Lacalle (2004) and Fernández et al. (2004). Both exploited the monosemous relatives method (Leacock et al., 1998) in order to retrieve sentences from the Web which contained a given monosemous noun or a relative monosemous word (e.g., a synonym, a hypernym, etc.). As can be seen in (Fernández et al., 2004) this approach can lead to the retrieval of very accurate examples, but its main drawback lies in the number of senses covered. In fact, for all those synsets that do not have any monosemous relative, the system is unable to retrieve examples, thus heavily affecting the performance in terms of recall and F1. Knowledge-based WSD, instead, bypasses the heavy requirement of sense-annotated corpora by applying algorithms that exploit a general-purpose semantic network, such as WordNet, which encodes the relational information that interconnects synsets via different kinds of relation. Approaches include variants of Personalized PageRank (Agirre et al., 2014) and densest subgraph approximation algorithms (Moro et al., 2014) which, thanks to the availability of multilingual resources such as BabelNet, can easily be extended to perform WSD in arbitrary languages. Other approaches to knowledge-based WSD exploit the definitional knowledge contained in a dictionary. The Lesk algorithm (Lesk, 1986) and its variants (Banerjee and Pedersen, 2002; Kilgarriff and Rosenzweig, 2000; Vasilescu et al., 2004) aim to determine the correct sense of a word by comparing each wordsense definition with the context in which the target word appears. The limit of knowledge-based WSD, however, lies in the absence of mechanisms that can take into account the very local context of a target word occurrence, including non-content words such as prepositions and articles. Furthermore, recent studies seem to suggest that such 8 Conclusion In this paper we presented Train-O-Matic, a novel approach to the automatic construction of large training sets for supervised WSD in an arbitrary language. Train-O-Matic removes the burden of manual intervention by leveraging the structural semantic information available in the WordNet graph enriched with additional relational information from BabelNet, and achieves performance competitive to that of semi-automatic approaches and, in some cases, of manually-curated training data. T-O-M was shown to provide training data for virtually all the target ambiguous nouns, in marked contrast to alternatives like OMSTI, which covers in many cases around half of the tokens, resorting to the MFS otherwise. Moreover Train-O-Matic has proven to scale well to lowresourced languages, for which no manually annotated dataset exists, surpassing the current state of the art of knowledge-based systems. We believe that the ability of T-O-M to overcome the current paucity of annotated data for WSD, coupled with video games with a purpose for validation purposes (Jurgens and Navigli, 2014; Vannella et al., 2014), paves the way for high-quality multilingual supervised WSD. All the training corpora, including approximately one million sentences which cover English, Italian and Spanish, are made available to the community at http://trainomatic.org. As future work we plan to extend our approach to verbs, adjectives and adverbs. Following Bennett et al. (2016) we will also experiment on more realistic estimates of P (s|w) in Formula 5 as well as other assumptions made in our work. Acknowledgments The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487. 86 References Adam Kilgarriff and Joseph Rosenzweig. 2000. Framework and results for english SENSEVAL. Computers and the Humanities, 34(1–2):15–48. Eneko Agirre, António Branco, Martin Popel, and Kiril Simov. 2015. Europarl QTLeap WSD/NED corpus. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University in Prague. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79–86. Eneko Agirre and Oier Lopez De Lacalle. 2004. Publicly available topic signatures for all wordnet nominal senses. In LREC. Claudia Leacock, George A Miller, and Martin Chodorow. 1998. Using corpus statistics and wordnet relations for sense identification. Computational Linguistics, 24(1):147–165. Eneko Agirre, Oier Lopez de Lacalle, and Aitor Soroa. 2014. Random walks for knowledge-based word sense disambiguation. Computational Linguistics, 40(1):57–84. Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual Conference on Systems Documentation, Toronto, Ontario, Canada, pages 24–26. Satanjeev Banerjee and Ted Pedersen. 2002. An adapted Lesk algorithm for word sense disambiguation using wordnet. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 136–145. Springer. Peter A Lofgren, Siddhartha Banerjee, Ashish Goel, and C Seshadhri. 2014. Fast-ppr: Scaling personalized pagerank estimation for large graphs. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1436–1445. ACM. Andrew Bennett, Timothy Baldwin, Jey Han Lau, Diana McCarthy, and Francis Bond. 2016. Lexsemtm: A semantic dataset based on all-words unsupervised sense distribution learning. In Proceedings of the 54nd Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 1513 – 1524, Berlin, Germany. Steve L Manion and Raazesh Sainudiin. 2014. An iterative “sudoku style” approach to subgraph-based word sense disambiguation. In In Proceedings of the Third Joint Conference on Lexical and Computational Se- mantics (*SEM 2014), pages 40–50, Dublin, Ireland. Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7):107– 117. Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context2vec: Learning generic context embedding with bidirectional lstm. In Proceedings of CONLL, pages 51–61. José Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. Nasari: a novel approach to a semantically-aware representation of items. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 567–577, Denver, Colorado. Association for Computational Linguistics. George A. Miller, Claudia Leacock, Randee Tengi, and Ross Bunker. 1993. A semantic concordance. In Proceedings of the 3rd DARPA Workshop on Human Language Technology, pages 303–308, Plainsboro, N.J. Andrea Moro and Roberto Navigli. 2015. Semeval2015 task 13: Multilingual all-words sense disambiguation and entity linking. Proc. of SemEval2015. Philip Edmonds and Scott Cotton. 2001. Senseval-2: overview. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, pages 1–5. Association for Computational Linguistics. Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity Linking meets Word Sense Disambiguation: a Unified Approach. Transactions of the Association for Computational Linguistics (TACL), 2:231–244. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Database. MIT Press, Cambridge, MA. Juan Fernández, Mauro Castillo Valdés, German Rigau Claramunt, Jordi Atserias Batalla, and Jordi Tormo. 2004. Automatic acquisition of sense examples using exretriever. In IBERAMIA Workshop on Lexical Resources and The Web for Word Sense Disambiguation, pages 97–104. Roberto Navigli. 2009. Word Sense Disambiguation: A survey. ACM Computing Surveys, 41(2):1–69. Roberto Navigli, David Jurgens, and Daniele Vannella. 2013. Semeval-2013 task 12: Multilingual word sense disambiguation. In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), in conjunction with the Second Joint Conference on Lexical and Computational Semantics (*SEM 2013), pages 222–231, Atlanta, USA. David Jurgens and Roberto Navigli. 2014. It’s All Fun and Games until Someone Annotates: Video Games with a Purpose for Linguistic Annotation. Transactions of the Association for Computational Linguistics (TACL), 2:449–464. 87 Daniele Vannella, David Jurgens, Daniele Scarfini, Domenico Toscani, and Roberto Navigli. 2014. Validating and extending semantic knowledge bases using video games with a purpose. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), pages 1294–1304, Baltimore, Maryland. Association for Computational Linguistics. Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217– 250. Arantxa Otegi, Nora Aranberri, Antonio Branco, Jan Hajic, Steven Neale, Petya Osenova, Rita Pereira, Martin Popel, Joao Silva, Kiril Simov, et al. 2016. Qtleap wsd/ned corpora: Semantic annotation of parallel corpora in six languages. In Proceedings of the 10th Language Resources and Evaluation Conference, LREC, pages 3023–3030. Florentina Vasilescu, Philippe Langlais, and Guy Lapalme. 2004. Evaluating variants of the lesk approach for disambiguating words. In Proceedings of LREC, Lisbon, Portugal. Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. Semi-supervised word sense disambiguation with neural models. Proceedings of COLING, pages 1374–1385. Mohammad Taher Pilehvar and Nigel Collier. 2016. De-conflated semantic representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1680–1690, Austin, TX. Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: A wide-coverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 System Demonstrations, pages 78–83, Uppsala, Sweden. Association for Computational Linguistics. Mohammad Taher Pilehvar, David Jurgens, and Roberto Navigli. 2013. Align, disambiguate and walk: A unified approach for measuring semantic similarity. In Proceedings of ACL, pages 1341– 1351. Micha Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations parallel corpus v1.0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoroz, Slovenia. European Language Resources Association (ELRA). Mohammad Taher Pilehvar and Roberto Navigli. 2014. A large-scale pseudoword-based evaluation framework for state-of-the-art word sense disambiguation. Computational Linguistics, 40(4):837–881. Sameer S Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. Semeval-2007 task 17: English lexical sample, srl and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 87–92. Alessandro Raganato, Jose Camacho-Collados, and Roberto Navigli. 2017a. Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. In Proceedings of EACL, pages 99–110, Valencia, Spain. Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2016. Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia. In Proceedings of IJCAI, pages 2894–2900, New York City, NY, USA. Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2017b. Neural sequence learning models for word sense disambiguation. In Proceedings of Empirical Methods in Natural Language Processing, Copenhagen, Denmark. Benjamin Snyder and Martha Palmer. 2004. The english all-words task. In Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 41–43, Barcelona, Spain. Kaveh Taghipour and Hwee Tou Ng. 2015. One million sense-tagged instances for word sense disambiguation and induction. CoNLL 2015, page 338. 88
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Count : 11 Page Mode : none Author : Tommaso Pasini ; Roberto Navigli Title : Train-O-Matic: Large-Scale Supervised Word Sense Disambiguation in Multiple Languages without Manual Training Data Subject : EMNLP2017 2017 Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.3 Create Date : 2017:11:20 19:32:30-08:00 Modify Date : 2017:11:20 19:32:30-08:00 Trapped : False PTEX Fullbanner : This is pdfTeX using libpoppler, Version 3.141592-1.40.3-2.2 (Web2C 7.5.6) kpathsea version 3.5.6EXIF Metadata provided by EXIF.tools