Term extraction analysis done by TermCoord
SynchroTerm – Canadian-based statistical term extractor from Terminotix
|Languages supported||English, French, Spanish, Italian, Portuguese, German, Swedish, Russian, Greek, Dutch, Hungarian, Norwegian, Polish, Turkish, Czech, Danish, Bulgarian, Finnish, Romanian, Lithuanian, Slovak, and Slovenian|
|File formats supported||SynchroTerm supports different file formats from which the term extraction can be carried out:|
DOC; XLS; RTF; TXT; HTML; PDF; TMX (TRADOS, Déjà Vu, Word Fast, SDLX); Bitext (from LogiTerm).
|Creation of monolingual and bilingual lists||In bilingual projects: results obtained after extracting, using ‘.tmx’ files, are very good as the alignment with ‘.tmx’ files is perfect. However, when using other bilingual files, alignment is not so good and, therefore, bilingual extraction is affected.|
|Extraction options/settings||Extracted terms can be seen in context and it is possible to select other contexts for the terms. Terms are then validated by the terminologist and then they are automatically included in a validation list.|
|Validation lists||These lists can be exported by exporting the SynchroTerm project. These previous projects/validation lists can be imported into new SynchroTerm projects before carrying out the extraction in order not to extract the same terms already included in the previous validation lists, i.e. they work as “stop lists”.|
|Creating terminology entries||Validation lists are the starting point to create terminology entries. It is possible to define default values (= data fields) for certain attributes to speed up data entry later on. SynchroTerm provides 3 methods for generating entries:|
o Generate a large number of entries automatically. The number of errors in the entries will be higher; however, the number of entries to be added manually will be lower.
o Generate fewer entries automatically. The number of errors in the entries will decrease; however, the number of entries to be added manually will be higher.
Batch processing works best for documents in which the terminology is both repetitive and consistent.
o Create entries one at a time from the lists of terms extracted by SynchroTerm from the input files;
o Modify the context to be saved with the entries;
o Add a context to an entry if SynchroTerm is not set up to save the context automatically;
o Delete irrelevant entries from the lists of extracted term.
|Exporting entries||Validated and created entries can be exported into the following formats:|
– Terminotix LogiTerm: to import terminology as a Word file, use the LogiTerm format;
– HTML: to view a printed copy of your data, export your entries in html format, and then open the export file with your Internet browser in order to print it;
– Trados MultiTerm 5.5;
– Trados MultiTerm iX;
– Trados WinAlign: note that when you export your data in the Trados WinAlign format (tab-separated text), the entries include only the source and target terms;
– Microsoft Excel: Useful to import into IATE;
– Comma-delimited: which can in turn be converted into any database format you choose because SynchroTerm allows you to sort the fields in any specific order before they are exported. Useful to import into IATE.
SDL MultiTerm Extract
|Languages supported||all Unicode-compliant languages|
|File formats supported||TXT; DOC; HTML; HTM; TMX; RTF; XML; SGM; SGML; PPT; XLS; TMX; TMW; TTX (as per user manual)|
|Creation of monolingual and bilingual lists||It is not possible to set frequency parameters (frequency with which a translated term occurs) for monolingual extractions. This is only possible for bilingual extractions.|
|Extraction options/settings||It exports term lists in TXT, XML and in existing MT term bases|
Extraction time somewhat longer.
It generates a folder with several files stored under it.
The tool is not case sensitive (this is not a parameter that the user can choose)
It is possible to filter by validated and non-validated terms (the latter is useful when double-checking that all relevant terms were validated). Also, when validating other terms present in the non-validated list, it is possible to export again and replace the previous list. MT Extract automatically proposes a stop-word list for monolingual and bilingual extraction projects. The list is quite complete and can be enriched with new words to improve extraction quality.
In addition to a stop-word list, MT Extract includes also a list of Basic Vocabulary which further improves the extraction quality. For more precise term extractions, it is advisable to leave the max. Term length parameter to 10 (default). Extraction time somewhat longer.
|Validation lists||It is very easy to validate. MT Extract allows users to edit the list of terms extracted provided in the Term Window. The possibility of editing the extracted terms is fundamental when validating terms to include in a glossary or a database. MT Extract, being a statistical tool does not lemmatise results. In fact, different forms of the same terms are extracted according to their frequency.|
|Creating terminology entries||The term candidates proposed are almost all relevant terms. The few false positives produced could be included in the basic vocabulary or imported into a learning database to further improve extraction quality|
|Exporting entries||The Export process is quite intuitive (File> Export Wizard). MT Extract can export in different formats: TXT, Multiterm XML and Multiterm export format. The latter allows users to export directly into an existing Multiterm term base. This database of “unwanted terms” can be added to list of stop words and basic vocabulary to prevent their extraction. Once a user has a good database of unwanted terms to include, the extraction quality should improve. The same can be done by using existing term bases whose terms will be excluded in the terminology extraction.|
|Languages supported||Keyword extraction is supported in over a half-dozen different languages, enabling even foreign-language content to be categorized and tagged: EN, FR, DE, IT, PT, RU, ES, SV.|
|File formats supported||AlchemyAPI supports different file formats from which the term extraction can be carried out, such as Microsoft Word, Microsoft Excel, HTML or web-based document.|
|Extraction options/settings||Extracted terms can be seen in context and it is possible to select other contexts for the terms.|
|Validation lists||Terms are validated by the terminologist and then they are automatically included in a validation list. The validation lists are processed very quickly and they present a time-efficient way of collecting the results.|
|Exporting entries||Extracted meta-data may be returned in XML, JSON, RDF, and Microformats rel-tag formats.|
|Automatic extraction and verification||Four trainees worked on the term-mining project between July and September 2011. In total, approximately 472 terms (40%) were extracted manually and 716 (60%) automatically before verification.|
Approximately 416 terms extracted automatically were discarded by the users because they were not real terms. It gives 58% of the total number of terms extracted automatically before verification. Further, 124 terms were identical or almost identical to terms from manual extraction.
Thus, the final number of terms extracted automatically (i.e. not including discarded terms and terms identical to those extracted manually) was 176 (out of 716 terms, i.e. approximately 25% of the total of terms extracted automatically).
|Accuracy||The disadvantage of the tool is that the results are limited to a list of 80 terms. Thus, there is a problem when comparing a short and a long document: the short document analysis results in some irrelevant terms, but the analysis of the long one results in a list of terms that is too short.|
The trainees working with AlchemyAPI have in general shared the view that, although highly efficient and user-friendly, the automatic extraction tool is not yet reliable enough to be useful. It cannot, for instance, determine what a term is and it does not seem to have any criteria for extraction.
It was also indicated by the trainees that the manual extraction is more relevant and reliable, especially when the human extractor is a specialist in a concerned language domain. Another concern expressed was that in some cases automatic extractor tends to invent new, and very often false, terms by combining words.
TaaS – Terminology as a Service
|Languages supported||The 24 official languages of the European Union plus Russian|
|File formats supported||PDF, DOC, DOCX, XLS, XLSX, PPTX, RTF, TXT, XLIFF, XLF, XML, HTML, HTM, MIF. The open Beta version has certain limitations in terms of file and project size.|
|Extraction options/settings||To extract source terms, there are many options to choose from:|
– TWSC (Tilde wrapper system for CollTerm): based on linguistic analysis enriched by statistical features
– Kilgray Terminology extractor: based on language independent statistical analysis
– Normalise terms: Canonical or dictionary forms
– Keep existing terms
It also allows you to choose sources for target translation lookup:
– TaaS public collections
– My collections
– TAUS Data
– Web Data
Extraction time somewhat long but quite accurate.
It is possible to filter by non-validated terms, terms without translation, as well as by the source used for target translation lookup.
The tool is not case sensitive (this is not a parameter that the user can choose).
It automatically uses a stop-word list containing the 200 most common words from an Inverse Document Frequency -IDF list (reference corpus).
|Validation lists||It is very easy to validate and you can also easily edit any approved entry. The tool integrates automated data extraction and user-supported clean-up of raw terminological data and one can share user-validated terminology.|
|Creating terminology entries||You can edit terms as well as add new source terms and then edit them. You can add definition, notes, usage (source identifier, register, term type, etc.).|
|Exporting entries||It exports terms in TBX, CSV, TSV, Moses and custom.|
TermoStat Web 3.0
|Languages supported||French, English, Spanish, Italian and Portuguese.|
|File formats supported||Only TXT and RTF.|
|Extraction options/settings||You can select to either extract simple terms (here you can also choose between adjectives, adverbs, verbs or nouns) and/or complex terms.|
Extracted terms can be seen in context.
|Validation lists||Editing is not possible. To do so, export the results and work from there on.|
|Creating terminology entries||Not possible.|
|Exporting entries||It exports terms only in TXT, which is not appropriate at all for a term table. The txt can though be opened in Excel by applying the corresponding settings (Data > From text: delimited, UTF-8, Tab).|
|Accuracy||The tool might seem very basic, but the results are really good. You get a table listing terms, frequency rate, score and orthographical variants -which is really good, as it list plural forms here instead of counting them also as “terms”.|
|File formats supported||Text needs to be copied/pasted into the box. Text can be up to 200.000 characters long.|
|General characteristics||Results can be sorted alphabetically, by relevance, according to their frequency and familiarity.|
You can create word lists, but a subscription is necessary.
Terms are shown in context.
A really interesting feature is that by selecting any word on the list it is possible to see a snapshot of the Visual Thesaurus map and definitions for that word, along with examples of the word in the text.
It is not really intended for extraction purposes, but it can come quite handy for a quick search.
|Exporting entries||Only possible if a small subscription is paid.|
3,998 total views, 6 views today