On 23rd October the Directorate-General for Translation of the European Commission in Luxembourg organised a seminar on Linguistic Linked Open Data for Terminology, presented by the prominent scientists Elena Montiel-Ponsoda and Jorge Gracia.
Translators, terminologists and trainees from the Commission and the Parliament had the opportunity to attend this seminar and learn more about Linked Data. This included the benefits of such technology for language resources as well as finding out more regarding existing models of linguistic information represented as Linked Data.
Inspired by this presentation, two of the current Terminology Coordination Unit trainees interviewed each of the two speakers. Today, TermCoord is publishing the first of these, with the computer scientist, Jorge Gracia.
Mr Gracia’s field of interest expands, among others, to Linguistic Open Data and Ontology matching. Research like this could change the way we compile, store and work with terminology databases, and it is the reason why we decided to include this interview under the publication series “Why is Terminology you passion?” An equally interesting interview with the co-presenter of the aforementioned seminar, Mrs Elena Montiel-Ponsoda, will also be online soon.
Jorge Gracia holds a degree in Physics from the University of Zaragoza, where he also obtained a PhD in Computer Science with a thesis titled “Integration and Disambiguation Techniques for Semantic Heterogeneity Reduction on the Web” (2009). He currently works as an assistant professor at the University of Zaragoza. Previously, he worked in computer consultancy in Barcelona and, more recently, as a postdoctoral researcher at the Universidad Politécnica de Madrid, participating in leading research projects on semantics and knowledge engineering. He has been a visiting researcher in leading research centres such as the Knowledge Media Institute (Open University, UK), INRIA (Grenoble, France), Università di Roma “La Sapienza” (Italy) and CITEC at University of Bielefeld (Germany). His main research areas are the Semantic Web, Linguistic Linked Data, Ontology Matching, and Query Interpretation. He has been co-chair of the W3C Best Practices for Multilingual Linked Open Data community group and currently is co-chair of the W3C Ontology Lexica community group, where the lemon-ontolex model has been developed.
1. You are a computer scientist and most of your latest research projects are about Natural Language Processing and Language Resources on the Web of Data. What fascinates you most about these projects?
Natural language understanding by machines is a long-term goal of Artificial Intelligence. Our research, in the intersection of linguistics and the Semantic Web, has more modest goals, but it is still a step in that direction. What I consider most challenging and fascinating is the inherently imprecise nature of human language, so different from the formal and structured languages that computers use to run their programs. It is therefore very rewarding when, through a computer program, you are able to extract some insights from linguistic data, to formalise the semantics of certain entities, or to infer new knowledge.
2. What is Linked Data and why is it useful for language resources, terminology and dictionaries?
Linked Data refers to a set of best practices for exposing, sharing and connecting data on the Web. Such data can refer to practically anything, including documents, people, physical objects and abstract concepts. As a result, a “Web of Data” is emerging in which links are at the level of data, as a counterpart to the “traditional” Web, in which links are established at the level of documents (e.g. hyperlinks between web pages). When applied to language resources, we are representing and connecting linguistic data, and contributing to the growth of the so-called Linguistic Linked Open Data cloud.
Publishing language resources as Linked Data offers clear advantages to both data owners and data users, such as higher independence from domain-specific data formats or vendor-specific APIs (well-established standards of the World Wide Web Consortium are used instead), as well as easier access and re-use of linguistic data by semantic-aware software. In fact, Linked Data allows to more easily connect datasets created by different people and for different purposes in a unified graph, so the combined information can be more easily traversed, queried and analysed.
3. Could you give us some examples of Linked Data in terminology? What results can terminologists get if we use Linked Open Data, and how do they differ from the results we get through the traditional databases we use today?
An interesting example of Linked Data in terminology is Terminoteca RDF, an effort that we started when I was part of the Ontology Engineering Group (Universidad Politécnica de Madrid), focused on converting a number of multilingual terminologies in Spain into Linked Data. As a result, we obtained a unified graph where terminological data that was initially disconnected was easily discoverable with simple queries. The same types of queries are not impossible through traditional databases but are far from straightforward and they come at the price of losing Web-centred aspects (in the Web of Data, terms are defined in a unique manner at a Web scale and can be discovered/queried through Web standards).
4. In your opinion, what IT skills does a terminologist need to have?
In my view, modern terminologists do not need to be IT experts but at least to be aware of new technologies that can have an impact in their work and to be open minded towards them. This will give them the competence of choosing what is best for their work as well as the capacity of better communicating their needs to technologists.
5. IATE is a database of more than a million multilingual entries and some datasets of its content is Linked Data. What would it require to transform the whole database into Linked Data and what would be the advantages of doing so?
The Linked Data demonstrator of IATE that was built in the context of the LIDER European project showed the feasibility of the application of Linked Data techniques to such an important resource, but this was based on an open subset of the data. The status of the whole data in terms of licensing and reusability should be checked to allow for a complete migration. If this conversion takes place, the IATE data would be ready for reuse by Linked Data-aware software agents and applications and for its interlinking to other resources on the Linguistic Linked Open Data Cloud.
6. You have been working on the lemon model, a model of linguistic information as Linked Data. How can the interoperability of this model be used for translation and/or terminology, namely for IATE?
lemon, when used to represent translations, can be useful at two levels: first, at the knowledge representation level, and second, at the data interoperation level.
Firstly, one of the aspects of lemon in which I have been more involved, jointly with my colleagues at Universidad Politécnica de Madrid, has been in the development of a module for representing translations and terminological variations. This module, called “vartrans”, covers representation needs when accounting for translations and variations. If someone needs a rich representation of translation relations as Linked Data, for instance, to record provenance of the translation, directionality (source/target languages), or type of translation (e.g., “direct translation”, “cultural equivalent”, etc.), this module can be very helpful.
Secondly, linked data allows you to connect translations from different multilingual/bilingual data sources and dictionaries in a unified graph, thus being able to easily infer new translations between initially disconnected languages that were not explicitly defined in the original data. Along these lines, I am co-organising a “Translation Inference Across Dictionaries” shared task (https://tiad2019.unizar.es/) with the idea of exploring and comparing techniques that infer such indirect translations.
7. Another project that you have participated in is called Apertium, a machine translation platform. Could you explain to us a bit more about this project?
Apertium is an open source platform for developing rule-based machine translation, initially developed by Universitat d’Alacant in Spain, and now in the hands of a wider and very active community. I did not take part on this exciting project directly, but I took some of their resources and transformed them in order to enrich the cloud of Linguistic Linked Open Data. For example, a family of bilingual dictionaries was built as part of Apertium, which was exploited by translation systems. What we did is to convert twenty-two of such dictionaries into RDF (the basic formalism to represent data as Linked Data) and to publish them on the Web. We named this initiative “Apertium RDF”, which is a nice demonstration of the use of lemon to represent and interconnect bilingual dictionaries on the Web of Data.
8. You keep a blog where you write about computing and the Semantic Web. To what extent do you believe that the blog and other social media help so that people reach you and the knowledge you are sharing?
Unfortunately, I do not devote much time to the blog, although I plan to change this in the near future. I deem this format a way of sharing knowledge that complements very well the scientific papers, which are more difficult to be consumed by non-experts. In addition to this, writing a blog entry is a very good exercise to put your ideas in order and to formulate them in a more accessible way.
9. Lynx is the new project that you and your team are working on. Could you describe what the aim of the project is?
In a nutshell, the idea of Lynx is to build a Legal Knowledge Graph that will integrate and link heterogeneous compliance data sources, including legislation, case law, standards and other private contracts, to support the development of smart services for legal compliance. The multilingual aspect is very important in the project, since the main issues with legal compliance usually take place across borders and languages. The techniques of Linked Data are core in this project, which are used both to represent knowledge and to link it.
10. How do you envisage the future for language and terminology resources as well as for dictionaries?
I think that dictionaries and terminologies must get rid of their physical boundaries to become natively digital. Although there are many electronic dictionaries out there, most of them still stick to the printed form version and mimic the hierarchical structures that one can find in paper. But, this is only one of many possible arrangements of lexical information. In the Linked Data paradigm, any element of the lexicon (lexical entry, lexical sense, translation, form, etc.) can be a “first class citizen” and become the centre of a graph-based structure, which will allow for many other possible arrangements and views on the information.
Linked Data has proved to be useful for language resources in general, particularly when it comes to terminologies and dictionaries. By means of such technologies, we foresee more unified/linked graphs of terminologies and dictionaries on the Web, enriched through their linkage to other resources. A pending challenge is to build “Linked Data native” dictionaries/terminologies (so far we have converted existing ones), which will open the field to new exciting possibilities and new (un-envisioned yet) forms of working with lexicographic data.
Interviewed by Olga Vamvaka – Terminology trainee at the Terminology Coordination Unit of the European Parliament (Luxembourg).
She holds a BA in International Relations and Organisations and an MA in Translation and has worked in language teaching. She speaks Greek, English, Czech and French.