Interview with Dan Tufis

Dan Tufis

IATE is a must-use resource in dealing with multilingual documents for universal access to public services across language barriers

Dan Tufis is a Member of the Romanian Academy and has been Head of its Research Institute for Artificial Intelligence (ICIA) since 2003. He is also an Honorary Professor of the Alexandru Ioan Cuza University of Iasi and President of the Commission of the Romanian Academy for Romanian Language Technologies.

His main fields of interest are: corpus linguistics, natural language processing and machine translation.

Mr Tufis has created and coordinated the development of numerous processing tools and language resources for Romanian and other languages used by many researchers and projects worldwide. He has authored or co-authored more than 200 scientific papers for conference proceedings, peer-reviewed journals and book chapters, and edited 14 volumes. He has also participated in more than 35 projects on language technologies, both national and international (mainly EU-funded).

 

1. How important is terminology in natural language processing (NLP) applications such as machine translation or question answering?

This is a rhetorical question, I guess, because I don’t think you would find anyone working in NLP who would deny the crucial role of terminology in MT or QA. Some years ago, I read the results of a questionnaire on user satisfaction regarding the quality of automatic translation. The translation errors considered to be the most annoying were not related to morphology or syntax, but to the meaning and wrong translation of terms.

It is not surprising that so many projects have been dedicated to multilingual term extraction and its applications in processing domain-specific documents. A multi-word term is a special case of a multi-word expression, and we have recently seen a keen interest in this direction (see, for instance, the PARSEME COST action).

2. IATE is a public terminology database created and maintained by the terminology coordination unit. It comprises terms in the 24 languages used in the EU institutions. The terms are linked between languages and have context and definitions. They also have super-subordinate relations and they are linked to the documents in which they appear. Do you think this resource would be a useful tool for NLP?

Definitely! Yes! Actually, we keep an eye on IATE (we have already downloaded the IATE_download_12052015) and we have also successfully used the EuroVoc thesaurus in the past for the alignment of the JRC-Acquis corpus. IATE is a must-use resource in dealing with multilingual documents for universal access to public services across language barriers. In the context of the emerging Digital Single Market and the recent initiative of the Connecting Europe Facility’s Automated Translation platform (CEF.AT), supported by the European Language Resource Coordination (ELRC), IATE is expected to play a major role in ensuring high-quality translation services to allow Europe’s citizens and businesses to operate freely across language barriers.

As you probably know, the national contact points, members of the ELRC, will be responsible for setting up a permanent Language Resource Coordination mechanism that will feed the CEF Automated Translation DSI with relevant language resources in all official languages of the EU and CEF-associated countries, in order to improve the quality, coverage and performance of automated translation systems and solutions in the context of current and future CEF digital services. IATE and other similar resources will play an essential role in judging the relevancy of the collection of data to be sent to CET.AT.

At ICIA, IATE will be used both for the objectives of the ELRC and for an envisaged partnership with the Ministry of Justice.

3. Can you put the ICIA’s achievements in the field of computational linguistics into context? How does it link fields such as artificial intelligence and linguistics?

ICIA, established in 1994, is the Research Institute for Artificial Intelligence of the Romanian Academy. Besides conducting advanced research, ICIA is an active knowledge dissemination centre, being the organiser of various national and international conferences, workshops, summer schools and seminars to which renowned speakers are invited. It also runs MSc and PhD programmes.

The results obtained at ICIA in POS tagging, chunking, word alignment, parsing, word sense disambiguation, information retrieval, language learning, question answering, wordnet development, ontology-based language processing, text-to-speech generation and speech-to-speech translation have received international recognition. They have been discussed at major conferences (IJCAI, COLING, EACL, ACL, LREC, ECAI, CLEF, CONLL, IWSLT, Blizzard, etc.), featured in the most important journals in the field (AI Magazine, Computers and the Humanities, International Journal on Speech Technology, Journal of Decision Support Systems, Language Resources and Evaluation, International Journal on Information and Control, etc.), and cited and used by numerous researchers all over the world.

ICIA’s Natural Language Processing (NLP) group, the largest at the institute, has developed both state-of-the-art language technology tools (tokenisers, automatic diacritics restoration, taggers, lemmatisers, chunkers, dependency parsers, sentence and word aligners, statistical language and translation model builders, multilingual TTSs, etc.) and reference language resources for Romanian: Ro-WordNet (aligned to Princeton WordNet 3.0 and containing almost 60 000 synsets), the largest wordform lexicon for Romanian (Multext-East compliant and with more than 1 400 000 wordforms), the largest annotated parallel corpora which include Romanian (more than 50 000 000 lexical items, POS-tagged and lemmatised, XML-encoded and XCES-compliant), the full computational description of Romanian morphology, speech corpora of standard Romanian, etc. One of the largest projects currently being carried out at ICIA, in partnership with the Institute of Theoretical Informatics of the Romanian Academy, is dedicated to the creation of the IPR-cleared Reference Corpus for Contemporary Romanian (CoRoLa). The project started in 2014 and the first version of the Reference Corpus is to be released for public consultation in 2017. The joint team of the two institutes of the Romanian Academy is supported by reputed linguists from the Romanian Academy and also by more than 30 PhD and MSc students in linguistics from the Alexandru Ioan Cuza University of Iași, the University ‘Politehnica’ of Bucharest and the University of Bucharest.

4. Can you tell us what the similarities would be between IATE, a terminology database, and the European versions of Princeton WordNet?

Both multilingual resources have hierarchical structuring and entries classification, and contain sense equivalents. Although both are intended to deal with language processing, content-wise they have different coverage: while IATE (and any other multilingual terminological dictionary) is a multilingual collection of terms in specialised domains, the European versions of Princeton WordNet cover general language, and are very unlikely to contain most of the terms one could find in a terminological thesaurus. Given that any specialised text contains not only domain-specific terms but a lot of general language, the marriage between a wordnet-like ontology and an IATE-like thesaurus is a very wise decision when trying to deeply process a collection of documents of interest.

With the domain classification of the terms in a terminological resource and synsets in a wordnet-like ontology (via domain labels or SUMO/MILO categories) one could restrict the vocabulary (and thus limit the word polysemy and speed up processing) to words or expressions which are highly relevant for the domain of interest.

 

5. How do you view the role of the EU’s language policies? Do they encourage the development of multilingual resources for automatic processing of natural language?

Unfortunately, the last few years have seen few EU calls for language-oriented projects. At national level the situation is even worse, and several regional initiatives may lose momentum. Concerted projects such T4ME, MetaNord, Cesar and MetaNet4U have generated a lot of resources, distributed via Meta-Share central and local platforms. In the absence of support, the maintenance and updating of these contributions is fading (the commitments for keeping up the services were for two years, a term which will expire this year). It is true that local authorities were expected to get more involved in supporting these initiatives, which didn’t happen, at least not in Romania.

However, the European Commission programme aiming at supporting a truly multilingual Europe and Digital Single Market brings language resources and tools back into the focus of attention. The European Language Resource Coordination project (ELRC) was created to set up a permanent Language Resource Coordination mechanism that will feed the CEF Automated Translation DSI with relevant language resources in all official languages of the EU and CEF-associated countries, in order to improve the quality, coverage and performance of automated translation systems and solutions in the context of current and future CEF digital services. This is the largest ever worldwide effort to collect public service data to support the multilingualism of citizens and services. The initial phase of the ELRC project involves raising the awareness of local authorities as regards the importance of their involvement and close cooperation with local experts in this endeavour.

 

6. What were the challenges involved in developing the Romanian Wordnet? What linguistics resources were used and how many linguists collaborated in its development?

Romanian Wordnet was based on several Romanian reference dictionaries: the Academy’s Explanatory Dictionary (DEX), the Academy’s Orthographic Dictionary, the Dictionary of Antonyms (also developed at the Romanian Academy), the reference Romanian-English dictionary (Andrei Bantaș), and also the bilingual lexicons extracted by our alignment tools. The final lexicons were hand-validated by experienced linguists who worked on the project during its lifecycle. Three linguists on the ICIA’s staff worked on the project, but also many MSc and PhD students enrolled in Computational Linguistics programmes. I would say that during the 14 years of its development, Ro-Wordnet benefited from the expertise of more than 20 professional linguists.

7. Considering your experience in the field of alignment of multilingual lexical ontologies, do you see a possible benefit for NLP tools in aligning IATE with Wordnet and other existing thesauri?

This idea is already being used by many projects or applications, even if it does not involve Wordnet and IATE as such. As I said before, more often than not the entries in a terminological resource are not found in a general lexical resource. So it is more a problem of merging the entries from the two resources. Alignment, whenever an entry exists in both resources, should be negotiated in terms of precedence depending on the specificity of the documents to be processed. This is not a trivial task, and it is usually done dynamically depending on the issue to be resolved (combination of language / translation models).

 

8. Last but not least, can you describe the involvement of the Research Institute in the development of cross-lingual lexical ontologies? How can this improve the automatic approaches to language processing? How can this serve as a translation tool (for human translation)?

The construction of the first lexical ontology for the Romanian language was a joint undertaking between our institute and the NLP group from the Alexandru Ioan Cuza University, and began in 2001 as part of the three-year BalkaNet European project. The project was a follow-up to the European project EuroWordNet, which defined the multilingual strategies in aligning different language wordnets. The BalkaNet project involved exemplary international cooperation, with the results significantly exceeding expectations.

From 2004 onwards, the task of further extending, correcting and maintaining Ro-WordNet was taken on by ICIA. The merging of the semantic lexicon with the Sumo/Milo ontology turned Ro-WordNet into a proper lexical ontology. Over the years, we have used Ro-WordNet (aligned with Princeton WordNet as well as with the other languages in BalkaNet) extensively as a primary language resource in all our multilingual projects and experiments: word alignment in parallel corpora, multilingual term extraction, cross-lingual word sense disambiguation, cross-lingual question answering, comparable corpora collection and particularly machine translation. I think that the state-of-the-art results we obtained in several evaluation shared tasks were to a large extent attributable to the quality (and size) of our lexical ontology. It is a well-known fact that the quality of an SMT system depends to a great extent on the quality of the translation equivalent table.

The wide coverage of a bilingual wordnet and the quality of the translation equivalents established via interlingual links between the synsets of the two wordnets represent a gold-mine for building and enhancing robust and quality translation models. Although not widely used, the alignment of various proper ontologies with the Princeton WordNet has the potential of offering the support of the axiomatic knowledge for deep language processing. As for the support in human translation, I can say that I use the EN-EN bilingual wordnet on a regular basis when I write my scientific papers. I often know a word and its sense in Romanian or in English but can’t remember the equivalent in the other language (I guess that most people have experienced this situation; psycholinguists call it the ‘Tip of the Tongue’ problem).


RalucaInterviewer: Raluca Caranfil

Raluca Caranfil graduated in journalism from the University of Sibiu, Romania. She worked as a journalist for various different media for almost 12 years, during which she gained experience as a reporter, editor, radio and TV presenter, blogger, website editor and social media administrator. She is currently doing a master’s degree at the University of Luxembourg, specialising in intercultural communication, and has experience in fields such as culture, politics, social journalism, communication and linguistics. She loves radio, travelling and playing with her dog.