TIDES MegaRADD-MT: Large-Data Rapidly-Adaptable, Data-Driven MT
Contact: Stephan Vogel
Language Technologies Institute - Carnegie Mellon University
Alex Waibel, PI
Objective: To develop new and improved machine translation engines, in particular using Example-Based MT (EBMT) and Statistical MT (SMT), and Multi-Engine integration techniques, to support rapid deployment of MT, both to new languages with large amounts of available data, and to new tasks. In support of this, to investigate the mining of world-wide web and other public language data sources.
Approach: The SMT approach at CMU is based on extending word-to-word translation models to phrase-to-phrase translation. Bilingual corpora are used to find these translation correspondences, which are then used in the statistical translation engine, together with a standard n-gram language model, to generate translations for new sentences. For Chinese-English translation, the Chinese to English statistical and example-based translation systems will also be improved by using automatically generated word classes and named entity detection. For Arabic-English translation, baseline statistical and example based translation systems have been developed, using the UN Arabic-English bilingual corpus. Preprocessing included correction of corrupted numbers in the training corpus. The baseline SMT system uses the statistical lexicon trained on the bilingual corpus and phrase-to-phrase alignment extracted from the Viterbi alignment.
For Multi-Engine MT (MEMT), different translation approaches (including at least one statistical engine, an example-based MT engine, and a glossary engine) are being integrated into a multi-engine translation system, where each translation engine contributes translation hypotheses for all or part of the input sentence. An improved selection module will be developed, for extracting the overall best translation on the basis of translation scores and language model scores. To do this, statistical engines must be adapted to work within the MEMT framework, and MEMT decoding performance should be improved with improved language modeling and the ability to incorporate output from engines using different source/target word alignments in a single decoder search. Innovative techniques from machine learning are also being investigated, to see if they will improve the selection process.
In order to provide increased amounts of parallel language data to support the above work, methods for extracting Chinese-English and Arabic-English are being investigated. As a first step, methods to find similar stories in Chinese and English newswire data streams are being investigated, based on automatically-trained sentence alignment models. Parallel sentences must then be found in these stories; this is being done using sentence and word level alignment methods. Human evaluations will then be done to determine the quality of the generated bilingual corpus. The Chinese Xinhua news and the world-wide web are currently being used as data sources. (Xinhua work done in collaboration with LDC.)
Recent Accomplishments: (as of June 2002)
<typolist>
Produced the best-scoring U.S. Chinese-English system in the large-common-resources track of the June 2002 DARPA MT Evaluation.
Improved all MT systems through careful, evaluation-driven selection of bilingual data sources, and careful selection of data to include in large Language Models.
Improved preprocessing for Chinese, especially the treatment of number, time and date expressions, on the basis of regular grammars.
Improved Chinese-English SMT by re-segmenting untranslated Chinese and translating it in a second pass.
Developed dynamic generation of phrase-to-phrase translation lexicons for SMT. Two different approaches to find good phrase-to-phrase translations were developed, implemented and tested. Both alignments gave significant improvement in translation quality.
Improved preprocessing for Arabic, by including minimal Arabic morphological processing: prefixes and infixes were heuristically removed from the word until the remaining word is known from the training corpus, or no further reduction is possible. The number of unknown words in the test sentences was cut by more than 50% using this approach.
Both Arabic MT systems were also improved by adding human knowledge in the form of handwritten rules to translate numbers, dates, and names of people and places, and by training the alignment model from English-to-Arabic instead of Arabic-to-English and reversing the resulting phrase-to-phrase transducer.
Improved EBMT by importing SMT-produced cross-language word alignments into EBMT.
Improved MEMT by doubling the size of the language model, thus reducing the out-of-vocabulary rate.
</typolist>





