Summarization, Translation and Entity Extraction
Contact: Chiori Hori, Stephan Vogel
Started in 2005
The STEEM project aims to improve the reliability and usefulness of machine translation (MT) by shifting emphasis from a word-by-word translation of the input to the proper handling of key information that is to be understood and processed by an English speaking human analyst. To achieve this goal, information has to become more readable (reduced to its key information), and reliability/confidence in this key information has to be improved.
Today’s MT systems have particular difficulty with key information such as names, as they are infrequent, specialized, and may not be included in a translators dictionary. Resulting MT output may appear fluent, but could contain grave errors that distort the underlying meaning. The problem is even more pronounced, if the input to an MT system is the result of an automatic recognition process such as speech recordings, scanned images, etc. In this case, the textual transcript will also contain errors due to recognition errors or out-of-vocabulary items and cannot be relied upon. Moreover, even if an accurate transcript including for all named entities could be achieved, machine translation is further complicated by the lack of parallel corpora for conversational spoken language, and the resulting output from translating disfluent input would hardly be readable or useful. Rather, translation must be anchored on the information bearing items and selectively ignore irrelevant or distracting material.
In the STEEM project we explore solutions to these problems in two separate ways:
the first by reliably detecting and searching for named entities and their relations, even if they are buried in the source by the out-of-vocabulary problems we discussed;
the second, by summarizing the input in a manner that is optimized for translation and for retention of useful content.
Such dramatic reduction requires summarization to go beyond selecting phrases from the source, to rewording (translating!) the input to a new more concise paraphrase of the content. If the summarization is guided by the key entities and their relationships, we can perform consistency checks and present the translator with a rendition of the input that is more suitable for proper translation by a text translation system (since most parallel corpora will be text based). As input languages we plan to process Chinese and Arabic, and output English.





