Volume 2 No.6                                                                                                                     November 1999

TECHNOLOGY

Machine Aided Translation Systems: The Indian Scenario

Machine Aided Translation (MAT) systems aim at achieving partially automated translation between natural languages. In India, the immediate target of research in this field has been to achieve a certain credible level of success in translating from English into Indian languages. Inter- translatability between Hindi and English and among other major Indian languages are some of the long term goals of the MAT enterprise in the country. At IIT Kanpur, we have achieved some success towards the first goal. While efforts are on to further consolidate these results, we present here an overview of the work being done at the Institute.

ANGLABHARTI and ANUBHARTI are two of the machine translation systems developed at the Department of Computer Science and Engineering, IIT Kanpur. ANGLABHARTI system makes use of pattern directed approach to translate from English to Indian languages. Different patterns of the source language have been captured by examining the corpus of the source language. Since many of the Indian languages are similar in nature, an intermediate code or a pseudo-target catering to a class of target languages is generated. Translation in the corresponding target language is generated using a text generator for that language. ANUBHARTI system makes use of the example-based approach. This approach assumes that all different events are stored in memory and response to a new event is handled by first recalling the past events which are similar to the new input. The developer of the system does not have to build handcrafted rules, instead, rules are acquired from the examples in implicit form. We have implemented a Hindi-to-English translation system using this approach.

The ANGLABHARTI System

ANGLABHARTI is an English to Indian languages machine aided translation system. As English continues to be the major link language, the need to have translation systems from English to other languages has been ever growing both in government and commercial sectors. However, a perfect translation for open domain is not feasible with the current level of knowledge and resources available.

ANGLABHARTI aims at providing a practical approach to machine translation, wherein an attempt is made to get most of the job done by the machine and only about 10% of the task is left for human post-editing.

The ANGLABHARTI system uses a pattern directed approach using context free grammar like structures for analyzing source language sentences. A pseudo-target is generated which is applicable to a group of Indian languages. A set of rules is acquired through corpus analysis. In these rules, the plausible constituents are identified through partial pattern matching with respect to which movement rules for the pseudo-target are constructed.

A pattern directed parsing is performed on the source language, English. The parsing makes use of the syntactic and semantic information of the words of the sentence obtained through morphological analysis. These patterns are matched to the left-hand side of the rules stored in the rule-base. On finding the match, the corresponding rule is invoked, and the corresponding right-hand side of the rule is invoked to convert the input sentence into the pseudo target. Multiple invocation of rules is also possible in case of multiple pattern groupings at the source level. In such a case, more than one pseudo target is generated leading to multiple translations.

A number of semantic tags have been used to resolve sense ambiguity in the source language. Alternative meanings for the unresolved ambiguities are retained in the pseudo-target language code. A text generator module for each of the target languages transforms the pseudo-target code into the target language. A corrector for ill-formed sentences is used for each of the target languages. Finally, a human-engineered post-editing package is used to make the final corrections. The human post-editor needs to know only the target language.

Potentialities of the System

Currently implemented English-to Hindi translation system is able to translate a variety of sentences. Very good domain specific results have been obtained. We have tested the system for health compaign and several pamphlets and medical booklets have been successfully translated.

It is observed that the ANGLABHARTI approach offers a quick prototyping of the system. To start with, some source language patterns were taken from a book, they were converted in the form of rules and these rules were easily implemented using the programming language Prolog. Expectation driven rules were formulated so that they can identify the correct role, i.e., the right syntactic category of the each word in that particular sentence. It is observed that a small set of about 50 rules takes care of more than 70\% of common usage English sentences leaving interrogative sentences. The developer need not worry about developing a formal parser. The rules based on observed patterns from the corpus with the Prolog interpreter perform the task of parsing. The rules are added incrementally based on observation and experimentation of uncovered source patterns. At no stage developer is worried about complete coverage. The system grows with time as more and more experimentation is done.

Problems in System Development

Some of the difficulties in building the lexical database are as follows:

· Right selection of target language meaning

· Right selection of semantic tag corresponding to a meaning

· Correct judgement regarding the verb's selectional properties

· Right decision about the amount of knowledge to be fed in lexical database, such as the number of meanings, entry of semantic tags, etc.

Some of the difficulties in constructing the rule-base are as follows:

· Difficulty in determining the effect on the rule-base when a new rule is added: owing to the interacting nature of rules, addition of a new rules may trigger changes in the rule-base as a whole. Such changes need to be kept track of.

· Multiple parsings: when a sentence is input for translation, many times we get multiple translations. These translations are generated because the system retains all those meanings which it is not able to disambiguate during the resolution process of the fired rule in the rule-base. However, at times multiple translations are generated because system fires more than one rule in the rule-base.

The ANUBHARTI System

The other approach used by several MAT systems is the example-based approach which makes use of already translated examples to produce the translation of given input through analogical process. The developer does not build handcrafted rules for translation, instead the rules of the source and target language are acquired from the translated example sets in implicit form. However, a suitable distance function needs to be designed which can correctly predict the degree of closeness of the example source sentence and the input sentence.

Example-based approach needs a database of examples (example-base) for translation. This example-base contains the corpus of source language to target language translation pairs. Along with these pairs, mapping between the words of the source language sentence and the target language is also stored in the example-base.

When an input sentence is fed for translation, after morphological analysis, it is matched with the source language sentences of the existing example-base and for each source language sentence distance with the input sentence is evaluated based on some similarity measure. Now the corresponding target language sentence of minimum distance is invoked and using the mapping, target language translation is generated.

The Approach Used in ANUBHARTI

ANUBHARTI system uses a hybrid approach called HEBMT, which combines the essentials of the pattern-based approach with those of the example-based approach. It makes use of abstracted example-base instead of raw example-base. When an input sentence is fed for translation, after morphological analysis, it is passed on to the finite state machine which identifies the syntactic units (verb phrase, noun phrase etc.) of the sentence. Then, an appropriate partition of the abstracted example-base is searched based on the number and the type of syntactic units. If such a partition exists in the example-base, a distance matrix is computed for the input sentence and the example sentences of that partition, and the example with the minimum distance is invoked for target language translation. If such a partition does not exist, input sentence is entered as an example in the example-base and user/developer is asked to enter the target language pattern. This leads to the growth of example-base for future translations. It is also observed that example-base approach followed in ANUBHARTI provides a generic model for translation among Indian languages and it is possible to deal with any two language pair with minimal addition of modules.

Some of the distinguishing features of the HEBMT system are the following:

· Utilization of filtered and abstracted example-base from corpus of examples

· Efficient search using partitioning of the example-base

· Simple matching function to invoke the best example

· Use of syntactic groups in the input sentence for matching and transfer to the target

Problems in System Development

· Difficulties in morphological analysis:

In prototype ANUBHARTI system, the source language Hindi is a verb final language, where most verb phrases are combinations of a noun and a simple verb, or an adjective and a simple verb. Therefore, identification of verb-phrase is done from the end of sentence, and then morphological analysis for the rest of the sentence is performed. This requires a different strategy to be developed for morphological analysis.

· Difficulties in finite state machine

· Developer needs to examine lots of corpus of source and target language to build the finite state machine.

· Developer has to feed source and target language dependent knowledge in the finite state machine which sometimes may be incomplete or incorrect. Each language has some peculiar type of sentences where we need to do word level matching for good translation, but in HEBMT system syntactic units are matched which puts restriction on the translation pattern. So in the case of HEBMT system there is a need to build a layer of small raw example-base that contains irregular sentences of source and target language pairs.

· Design of efficient distance function: The distance function retrieves the best example for a given input from the abstracted example-base and then corresponding to target pattern of that example, translation is generated. We have defined a simple distance function which depends upon the type and semantics of each syntactic unit in the sentence. Semantic information is retrieved from the lexical database. Lexical database is a dictionary plus a knowledge base with grammatical, semantic and syntactic information. A list of semantic tags and their hierarchy has been designed for the HEBMT system. However, any list of semantic tags and hierarchy can not be stated as necessary and sufficient. Therefore, some times distance function may not pick up the best example. We need to use dynamic programming so that weights are adjusted dynamically and distance function keeps improving during the growth of example-base.

Integrated Approach to MAT

Both ANGLABHARTI and ANUBHARTI systems have been implemented and show good results. However, none of the approaches can be expected to give best results in all cases.

From our experiences, we find that while bulk of the patterns can be easily handled with rule-base as used in ANGLABHARTI, the phrases and all peculiar patterns can be easily handled using ANUBHARTI.

We are proposing an integrated approach to machine translation which will include the modules from both approaches and will heuristically select the approach on case-by-case basis. The system may first invoke examples and on failure may go for invoking rule-base. Obviously, the system cost in such a case will be high. However, VLSI parallel implementation with specific architecture will provide both speed and cost effectiveness when produced in bulk. We are also working on the design of Application Specific Integrated Circuits (ASICs) for implementing part of these schemes into hardware.

Ajai Jain

Department of Computer Science and Engineering

Indian Institute of Technology

Kanpur - 208016

e.mail : ajain@iitk.ac.in


[back] [next]