Dagmar Gromann is a computer scientist and linguist currently working as Assistant Professor Tenure-Track at the Centre for Translation Studies of the University of Vienna in Vienna, Austria. Before that, she worked at the International Center for Computational Logic at TU Dresden in Dresden, Germany. She was a post-doc research fellow in the Marie Curie Initial Training Network at the Artificial Intelligence Research Institute (IIIA) in Barcelona, Spain. She has worked with numerous project partners in the field of Artificial Intelligence and NLP, such as the German Research Center for Artificial Intelligence, to mention just one.
She earned her doctorate from the University of Vienna under the supervision of Prof. Gerhard Budin. Among her primary research interests are ontology learning and learning structured knowledge utilizing deep learning methods. Other areas of Gromann’s interest involve, among other things, machine learning and cognitive theories. She has been a host, co-organizer and member of numerous scientific committees and conferences. Most recent ones include EMNLP 2020, ISWC 2020, LREC 2020, IJCAI-PRICAI 2020, and AAAI 2020. She is active in the international language technology community as National Anchor Point contact person for the European Language Resource Coordination (ELRC) and National Competence Center main contact for the European Language Grid (ELG). She is also a management committee member and working group leader in the expert network created by the COST Action NexusLinugarum (CA18209) on Web-centred linguistic data science.
- Looking at your resume, I noticed that your professional background covers a vast array of different topics: cognitive linguistics, translation, computer science, and even business. How does this experience relate to your work in terminology?
Let me start by explaining a little bit about those different research interests. In fact, they developed quite naturally out of my educational background, industry experience, and research positions that I have had in the past. For instance, I have completed my PhD and received a grant from the Vienna University of Economics and Business for my research. It explains the focus on terminology in the domain of finance. My educational background, however, includes linguistics and computer science. Working as a translator, I became fascinated with computational approaches to terminology. For me, combining the one most central language resource, that is terminology, with computer science seemed like a natural fit. Therefore, I started working on, among other things, computational concept modeling and terminology extraction. After completing my thesis, I joined the Artificial Intelligence Research Institute in Barcelona, Spain, where many people worked on mathematical models of embodied cognition. Their work sparked my interest, in particular, the theory of image schemas which has a clear connection to cognitive linguistics. This robust linguistic perspective prompted me to work on embodied cognition with a colleague of mine, Maria M. Hedblom, a cognitive scientist. The research in Barcelona, connections, and contacts that I have made there – they all shared a common terminological focus. Ultimately, the aim was to utilize my computational skills for terminology work and to integrate the cognitive component to answer the question of how image schemas help to analyze differences between languages in a specialized domain.
- One of your research interests involves integrating terminologies with ontologies (i.e. ontoterminology). Could you briefly explain the differences between these two knowledge representation models? What does ‘keeping ontological and terminological dimensions both separated and linked’ in knowledge modeling mean?
Ontologies and terminologies both seek to organize, structure, and represent knowledge. Their perspective on it, however, is radically different. Ontologies, for instance, are computational artifacts that formally model the structure of reality or, to put it another way, they represent relevant entities and relations observable in certain situations or events. Let us take the example of a university: What can you observe there? Who are the main actors? What are their main actions? One might have relevant entities such as students, professors, researchers, lecturers. There are some physical entities as well, such as the lecture_hall, offices, and so on. The idea behind ontology modeling is thus – similar to terminologies – to put these entities together, abstract their properties into concepts, and relate these concepts – or ontology classes – to each other. Such relations can be hierarchical: One thing is another, as in a student is_a person, a professor is_a person, a lecture_hall is_a room. What is left, is to relate these entities with non-hierarchical relations, as in professor supervises students.
Ontologies are known as formal representation systems which means that they must be machine-readable. Consequently, one can automatically draw conclusions about new knowledge based on the knowledge already existing. This process is called inference. It also means that one must represent knowledge in a strict way to avoid misinterpretation, as it is processed automatically. So, for instance, in our very basic relation of professor supervising students, the relation is modeled as asymmetric, or to put it differently, it is one-directional: The professor supervises the student but not the other way around. This piece of information must be specified in the ontology to avoid misinterpretation by adding formal axioms and structures. As you can see, with this heavy focus on formality, natural language becomes secondary, and the main issue is to make reality and knowledge about it machine-readable. Terminologies, on the other hand, are created based on natural language used in a specific domain. Rather than from observations of entities, events, and actions in these domains, one starts at the language level. Natural language automatically reflects how human beings perceive, measure, and understand the reality, which makes it a filtered version of it. It is what we call epistemology.
The idea behind ontology modeling is thus – similar to terminologies – to put entities together, abstract their properties into concepts, and relate these concepts – or ontology classes – to each other.
Terminologies are interested in HOW we talk about things more than how things ARE in specific domains. In talking about these domains, we use linguistic expressions which we then group to form concepts and relations between them. However, they are not formal (or machine-readable), and hence one cannot automatically draw any conclusions. Also, terminology science has been somehow weak on the definition of what the concept is. Literature or standards talk about concepts or concepts systems, but they do not provide an answer to their exact nature.
In my approach to ontology-terminology modeling, I strived to combine the strengths of both these resources. For example, an ontology’s linguistic aspects can be enhanced by associating terminological information with ontological concepts. And, conversely, you can provide a formal and strongly specified concept system for terminologies by using the ontology as a concept system. One should, however, bear in mind that, since both those resources have a significantly different perspective on the knowledge one cannot simply convert a terminology into an ontology (or the other way around). They must be kept separate, intact yet interlinked, which is made possible through semantic web standards and specified relations.
- What advantages can be gained by integrating ontologies and terminologies? In how far can the terminology and/or ontology community benefit from it?
The advantage is a fully machine-readable resource that has a rich multilingual terminological information. It is something that the industry can benefit from immensely. Not only is it possible to consult the knowledge in the sense of searching for (multilingual) information and then seeing what is out there, but also to reason on the knowledge that already exists.
- Does this general ontology-terminology idea find practical application in terminology management in the industry?
Yes. For instance, major airplane producers use ontologies to model requirements in airplane designs: How much space is needed for the feet, between the seats, etc. Modeling this kind of information with an ontology-terminology is a natural choice, especially in a multilingual context, not only for the creation of (multilingual) documentation but reasoning on the previously collected knowledge too.
- Not only modeling but also publishing, sharing, and connecting of terminological resources (LLOD) as a part of the Semantic Web is an interest of yours. How can the terminology (or linguistic resources in general) become a part of the Semantic Web? What are the requirements it must fulfill (formats)?
The starting point for Linked Data (LD) was to specify several necessary principles that an LD resource must fulfill. Actually, the same principles apply to the Linguistic Linked Open Data (LLOD) cloud, and any kind of linguistic resource published as Linked Data. The key principles include: 1) The data has to be under an open license, 2) Each element in a dataset needs to be uniquely identified, 3) It should be represented in a specific web standard (usually Resource Description Framework (RDF) but it could also be another web standard, for instance, the Web Ontology Language (OWL)), 4) It should be linked to an already existing resource – this gives you all the benefit of interlinking resource on the LLOD cloud.
- What are the benefits and limitations of publishing terminological resources as linked data?
Reusing is a benefit in itself, especially with this kind of format. However, it also allows one to interchange data easily since it is globally available. With an open representation, the resources evolve faster and are freely extendable. It differs significantly from a database, where it is difficult to add or change elements. Quite the opposite is the case with LD: LD is very flexible, both in terms of adding and changing resources.
One of the limitations may be that, currently, certain types of information are unrepresentable like diachronic information for digital humanities or similar fields that utilize historical data and display the evolution of language concepts. This applies to other types of linguistic information descriptions too, for instance, phonological, morphological, and multimodal information.
Semantic Deep Learning refers to the combination of Semantic Web and Deep Learning. As the name suggests, it involves integrating ontologies and other types of Semantic Web technologies into deep learning to guide the machine’s decisions.
This, however, I am happy to report we work on quite actively in a COST Action called NexusLinguarum. COST Actions are networks of experts that come together to boost a certain field, which in this case, is the field of linguistic data science in general. Our main objective at NexusLinguarum is evaluating LLOD resources, approaches, standards, and providing reports on the state-of-the-art developments in linguistic data science and propose best practices, training schools and training materials. We strive to extend the current state of research by coming up with solutions to best model different levels of description, such as diachronic, morphological, and phonological information. Another aim is to report on and expand ways of utilizing deep learning, Big Data, and other Natural Language Processing (NLP) techniques in the creation, use, and application of LLOD, including an extensive collection of use cases in various domains. This allows newcomers to the field to see what linguistic data science can do.
- Semantic Deep Learning is the name of the workshop series you co-organize. Can you tell me more about it? What fascinates you about this topic?
Semantic Deep Learning refers to the combination of Semantic Web and Deep Learning. These are also the two research fields that have been accompanying my career for the past couple of years. As the name suggests, it involves integrating ontologies and other types of Semantic Web technologies into deep learning to guide the machine’s decisions. To this end, we have organized five workshops that collocate with major artificial intelligence (e.g. IJCAI), computational linguistic, and international Semantic Web conferences to get different communities on board. We have also organized a special issue on Semantic Deep Learning at the Semantic Web journal. It is truly fascinating how creative people are in combining Semantic Web technologies with Deep Learning. Some even use the combination of ontologies to provide explanations for deep learning, which remains to be a recent research challenge. We understand the technical side of it, but how does the neural model learn the representations of texts and of images to make predictions? There is still a lot to be discovered here.
- What are your responsibilities as a National Anchor Point contact person for the European Language Resource Coordination (ELRC) and National Competence Center, and the main contact for the European Language Grid (ELG)?
These are two different initiatives. The first one, ELRC, focuses predominantly on collecting and providing country-specific and multilingual language resources to train European Machine Translation (MT) systems. My role here is to keep track of publicly available language resources developed in Austria by different institutions, then, to point these out and provide them to the EU. My colleagues from the Center for Translation Studies (CTS), students, and I are actively creating language resources for this purpose too. Actually, ELRC has been operating long before I’ve joined the CTS, and I took over, only recently, from my dear colleague Gerhard Budin. He has been very active in this field and created a portal with publicly available resources. Additionally, we organize local workshops to bring industry and academia together in the field of language resources and technologies in Austria.
The second initiative, ELG, is an EU project. Their goal is to make language technologies available globally and publicly in a possibly neat format, and in a consistent manner on one holistic platform. The idea here is to provide web-based and easily accessible tools for machine translation, terminology, and lexicography extraction, among many others. My task is, again, on the one hand, to make this initiative known to the Austrian public, and, on the other hand, to involve companies by asking them about their needs and possible contributions. Furthermore, we cooperate with public institutions such as the Language Center at the Military in Austria.
- What are your next research goals?
Though I have a couple of different machine translation-related projects, the most interesting ones relate to terminology: how can you integrate terminologies into neural machine translation models to guide decisions for low-resource languages, such as Standard Austrian German. Since we don’t have enough data to train machine translation systems on Austrian German, we need to use a model that is already trained on English to German and try to readjust it for the Austrian standard variety. This is what I’m currently working on – using Austrian-German terminologies and integrating them into the training process to help the system learn this variety which should be considered in machine translation systems, to make them more usable, for any application that requires Austrian-German.
The other two major terminology related projects concern term extraction. There are many term extraction tools but none of them provide a full concept system, merely a list of term candidates. For this purpose, we want to build on ontology learning. This project called Text To Terminological Concept System (Text2TCS) will be financed by and integrated into the ELG. The idea is to produce a very useful tool that can extract full terminological concept systems across languages.
Finally, my research activities reflect my cognitive interests. I try to use the idea of embodied cognition to analyze differences across languages in specific domains. This extends the idea of the ontology which assumes general knowledge that exists in the world – a universal knowledge – and the cognitive perspective focuses more on the individual, on the physical experiences people make with their bodies. I think it is interesting to bring this strong individualized approach into the mix and then to analyze specialized natural language expressions across different natural languages.
Interview by Justyna Dlociok, former trainee at Polish Translation Unit, DG TRAD at the European Parliament. English Language and Linguistics, University of Vienna, Vienna; Specialized Translation and Language Industry, University of Vienna, Vienna.