SynchroTerm – Canadian-based statistical term extractor from Terminotix

Languages supportedEnglish, French, Spanish, Italian, Portuguese, German, Swedish, Russian, Greek, Dutch, Hungarian, Norwegian, Polish, Turkish, Czech, Danish, Bulgarian, Finnish, Romanian, Lithuanian, Slovak, and Slovenian
File formats supportedSynchroTerm supports different file formats from which the term extraction can be carried out:

DOC; XLS; RTF; TXT; HTML; PDF; TMX (TRADOS, Déjà Vu, Word Fast, SDLX); Bitext (from LogiTerm).

Creation of monolingual and bilingual listsIn bilingual projects: results obtained after extracting, using ‘.tmx’ files, are very good as the alignment with ‘.tmx’ files is perfect. However, when using other bilingual files, alignment is not so good and, therefore, bilingual extraction is affected.
Extraction options/settingsExtracted terms can be seen in context and it is possible to select other contexts for the terms. Terms are then validated by the terminologist and then they are automatically included in a validation list.
Validation listsThese lists can be exported by exporting the SynchroTerm project. These previous projects/validation lists can be imported into new SynchroTerm projects before carrying out the extraction in order not to extract the same terms already included in the previous validation lists, i.e. they work as “stop lists”.
Creating terminology entriesValidation lists are the starting point to create terminology entries. It is possible to define default values (= data fields) for certain attributes to speed up data entry later on. SynchroTerm provides 3 methods for generating entries:

  1. Batch processing allows you to create entries automatically. The number of entries created and the number of errors they contain depend on the options you set. Depending on the type of document and on your own preferences, you can opt for one of the two strategies below:

o    Generate a large number of entries automatically. The number of errors in the entries will be higher; however, the number of entries to be added manually will be lower.

o    Generate fewer entries automatically. The number of errors in the entries will decrease; however, the number of entries to be added manually will be higher.

Batch processing works best for documents in which the terminology is both repetitive and consistent.

  1. Creating entries manually from terms selected by SynchroTerm. There are four procedures to do this:

o    Create entries one at a time from the lists of terms extracted by SynchroTerm from the input files;

o    Modify the context to be saved with the entries;

o    Add a context to an entry if SynchroTerm is not set up to save the context automatically;

o    Delete irrelevant entries from the lists of extracted term.

  1. Creating terms from terms you select yourself in the input files. Once new entries have been created from expressions identified by SynchroTerm, it is possible to add terms that you can select yourself. Selecting terms manually allows you to select expressions that SynchroTerm may have overlooked or to select single word expressions.
Exporting entriesValidated and created entries can be exported into the following formats:

–       Terminotix LogiTerm: to import terminology as a Word file, use the LogiTerm format;

–       HTML: to view a printed copy of your data, export your entries in html format, and then open the export file with your Internet browser in order to print it;

–       Trados MultiTerm 5.5;

–       Trados MultiTerm iX;

–       Trados WinAlign: note that when you export your data in the Trados WinAlign format (tab-separated text), the entries include only the source and target terms;

–       Microsoft Excel: Useful to import into IATE;

–       PROMPT;

–       Comma-delimited: which can in turn be converted into any database format you choose because SynchroTerm allows you to sort the fields in any specific order before they are exported. Useful to import into IATE.


SDL MultiTerm Extract

Languages supportedall Unicode-compliant languages
File formats supportedTXT; DOC; HTML; HTM; TMX; RTF; XML; SGM; SGML; PPT; XLS; TMX; TMW; TTX (as per user manual)
Creation of monolingual and bilingual listsIt is not possible to set frequency parameters (frequency with which a translated term occurs) for monolingual extractions. This is only possible for bilingual extractions.
Extraction options/settingsIt exports term lists in TXT, XML and in existing MT term bases

Extraction time somewhat longer.

It generates a folder with several files stored under it.

Quite accurate.

The tool is not case sensitive (this is not a parameter that the user can choose)

It is possible to filter by validated and non-validated terms (the latter is useful when double-checking that all relevant terms were validated). Also, when validating other terms present in the non-validated list, it is possible to export again and replace the previous list. MT Extract automatically proposes a stop-word list for monolingual and bilingual extraction projects. The list is quite complete and can be enriched with new words to improve extraction quality.

In addition to a stop-word list, MT Extract includes also a list of Basic Vocabulary which further improves the extraction quality. For more precise term extractions, it is advisable to leave the max. Term length parameter to 10 (default). Extraction time somewhat longer.

Validation listsIt is very easy to validate. MT Extract allows users to edit the list of terms extracted provided in the Term Window. The possibility of editing the extracted terms is fundamental when validating terms to include in a glossary or a database. MT Extract, being a statistical tool does not lemmatise results. In fact, different forms of the same terms are extracted according to their frequency.
Creating terminology entriesThe term candidates proposed are almost all relevant terms. The few false positives produced could be included in the basic vocabulary or imported into a learning database to further improve extraction quality
Exporting entriesThe Export process is quite intuitive (File> Export Wizard). MT Extract can export in different formats: TXT, Multiterm XML and Multiterm export format. The latter allows users to export directly into an existing Multiterm term base. This database of “unwanted terms” can be added to list of stop words and basic vocabulary to prevent their extraction. Once a user has a good database of unwanted terms to include, the extraction quality should improve. The same can be done by using existing term bases whose terms will be excluded in the terminology extraction.




Languages supportedKeyword extraction is supported in over a half-dozen different languages, enabling even foreign-language content to be categorized and tagged: EN, FR, DE, IT, PT, RU, ES, SV.
File formats supportedAlchemyAPI supports different file formats from which the term extraction can be carried out, such as Microsoft Word, Microsoft Excel, HTML or web-based document.

API endpoints are provided for performing keyword extraction on Internet-accessible URLs and posted HTML files or text content.

Extraction options/settingsExtracted terms can be seen in context and it is possible to select other contexts for the terms.
Validation listsTerms are validated by the terminologist and then they are automatically included in a validation list. The validation lists are processed very quickly and they present a time-efficient way of collecting the results.
Exporting entriesExtracted meta-data may be returned in XML, JSON, RDF, and Microformats rel-tag formats.
Automatic extraction and verificationFour trainees worked on the term-mining project between July and September 2011. In total, approximately 472 terms (40%) were extracted manually and 716 (60%) automatically before verification.

Approximately 416 terms extracted automatically were discarded by the users because they were not real terms. It gives 58% of the total number of terms extracted automatically before verification. Further, 124 terms were identical or almost identical to terms from manual extraction.

Thus, the final number of terms extracted automatically (i.e. not including discarded terms and terms identical to those extracted manually) was 176 (out of 716 terms, i.e. approximately 25% of the total of terms extracted automatically).

AccuracyThe disadvantage of the tool is that the results are limited to a list of 80 terms. Thus, there is a problem when comparing a short and a long document: the short document analysis results in some irrelevant terms, but the analysis of the long one results in a list of terms that is too short.

The trainees working with AlchemyAPI have in general shared the view that, although highly efficient and user-friendly, the automatic extraction tool is not yet reliable enough to be useful. It cannot, for instance, determine what a term is and it does not seem to have any criteria for extraction.

It was also indicated by the trainees that the manual extraction is more relevant and reliable, especially when the human extractor is a specialist in a concerned language domain. Another concern expressed was that in some cases automatic extractor tends to invent new, and very often false, terms by combining words.


TaaS – Terminology as a Service

Languages supportedThe 24 official languages of the European Union plus Russian
File formats supportedPDF, DOC, DOCX, XLS, XLSX, PPTX, RTF, TXT, XLIFF, XLF, XML, HTML, HTM, MIF. The open Beta version has certain limitations in terms of file and project size.
Extraction options/settingsTo extract source terms, there are many options to choose from:

– TWSC (Tilde wrapper system for CollTerm): based on linguistic analysis enriched by statistical features

– Kilgray Terminology extractor: based on language independent statistical analysis

– Normalise terms: Canonical or dictionary forms

– Keep existing terms

– Visualisation

It also allows you to choose sources for target translation lookup:

– TaaS public collections

– My collections

– EuroTermBank


– TAUS Data

– Web Data

Extraction time somewhat long but quite accurate.

It is possible to filter by non-validated terms, terms without translation, as well as by the source used for target translation lookup.

The tool is not case sensitive (this is not a parameter that the user can choose).

It automatically uses a stop-word list containing the 200 most common words from an Inverse Document Frequency -IDF list (reference corpus).

Validation listsIt is very easy to validate and you can also easily edit any approved entry. The tool integrates automated data extraction and user-supported clean-up of raw terminological data and one can share user-validated terminology.
Creating terminology entriesYou can edit terms as well as add new source terms and then edit them. You can add definition, notes, usage (source identifier, register, term type, etc.).
Exporting entriesIt exports terms in TBX, CSV, TSV, Moses and custom.


TermoStat Web 3.0

Languages supportedFrench, English, Spanish, Italian and Portuguese.
File formats supportedOnly TXT and RTF.
Extraction options/settingsYou can select to either extract simple terms (here you can also choose between adjectives, adverbs, verbs or nouns) and/or complex terms.

Extracted terms can be seen in context.

Validation listsEditing is not possible. To do so, export the results and work from there on.
Creating terminology entriesNot possible.
Exporting entriesIt exports terms only in TXT, which is not appropriate at all for a term table. The txt can though be opened in Excel by applying the corresponding settings (Data > From text: delimited, UTF-8, Tab).
AccuracyThe tool might seem very basic, but the results are really good. You get a table listing terms, frequency rate, score and orthographical variants -which is really good, as it list plural forms here instead of counting them also as “terms”.



Languages supportedEnglish
File formats supportedText needs to be copied/pasted into the box. Text can be up to 200.000 characters long.
General characteristicsResults can be sorted alphabetically, by relevance, according to their frequency and familiarity.

You can create word lists, but a subscription is necessary.

Terms are shown in context.

A really interesting feature is that by selecting any word on the list it is possible to see a snapshot of the Visual Thesaurus map and definitions for that word, along with examples of the word in the text.

It is not really intended for extraction purposes, but it can come quite handy for a quick search.

Exporting entriesOnly possible if a small subscription is paid.


