Tuesday, February 26, 2013

Dispelling MT Misconceptions

MT in 2013 is still a complex affair requiring many skills, expertise and understanding that are not commonplace, to enable successful deployment as a productivity enhancing technology for business translation needs. While it has become much easier to build basic custom engines using a variety of Instant Moses solutions or by creating a dictionary for a RbMT, there are still very few who know how to coax MT system output to consistent productivity enhancing levels. Getting some kind of a basic engine up and running is NOT the same thing as having a production-ready post-editor friendly system. There are even fewer who know what to do if the first MT attempt does not work, or is lackluster. Most of these basic/instant MT systems are inferior to basically free online MT from Microsoft and Google. Building long-term productivity and strategic production advantage require much more skill, expertise and experimentation than most LSPs or users have access to, or care to invest in.   While it is sometimes possible for a user to get usable MT output after throwing some data into an instant MT/Moses engine, it is not common, even for “easy” languages like Spanish as several TAUS case studies reveal. 

It is my sense that MT is still complex enough that meaningful expertise can only be built around one methodology i.e. RbMT or SMT and that anybody who tells you that they can do both should be viewed with some skepticism. It is almost certain that they cannot do both well, and also quite likely they cannot do either well if they claim expertise in both, since very different kinds of skills are required. Specialization and long-term experience is necessary to build real competence with either approach.

We have reached a point today, where many more MT systems are successful, but we also have many mediocre systems that do not provide any long-term production/productivity leverage and can easily be duplicated by any competitor with minimal investment. Today it is quite easy to find many (usually bad) examples of free/instant MT but the best custom systems are still not widely known or commonplace. Good MT system development takes work and ongoing investment and require overall process modifications, communication and expectation management, not only technology investments.

Recently we have seen some articles in the blogosphere and even the mainstream professional translation press that continues to provide what I believe is a lop-sided and even a somewhat disingenuous view of the verifiable use and known best practices of various MT technologies. (This link gets you to full article). In this particular case it is somewhat clear that the author has a preference and a bias favoring an RbMT approach where value-add is generally limited to building dictionaries. 

The misinformation is typically around the following concepts:
  • Rules-Based vs. Statistical MT Comparisons
  • The scope and extent of possibilities with instant MT customization
  • The degree of expertise and experience required to develop skills in any of these  approaches

Firstly let me state my own biases:
  1. I think the Rules-based MT vs. Statistical MT arguments are largely irrelevant, even though I think it is increasingly evident that SMT is becoming the preferred approach, especially as more linguistics are added to the data-driven approach. To a great extent most systems out there except for raw Moses systems are all hybrids of some sort.  Recently MT technology has evolved to a point where SMT and RBMT concepts are being merged into a single ”hybrid” approach. While there is some overlap in these approaches, there are two primary hybrid models in use today.
  2. a) RbMT with SMT smoothing tacked on after the RBMT translation is completed, such as with Systran to help improve the fluency and quality of the often clumsy raw RbMT output and,
    b) Linguistically informed rules that modify source text before SMT processes it and that guides the SMT processes and additional rules after SMT processing takes place to perform normalization and adjustments to translation output where required. Or the newer syntax and morpho-syntactic SMT approaches which have shown limited success and are still emerging.
    Finally, what really matters is how much productivity does an MT system offer, and the RbMT vs. SMT issue is largely irrelevant. The objective is to get translation work done faster and more cost effectively.

  1. In the right hands, both approaches (RbMT or SMT) can work for projects where MT is suitable. However, there are many more user controls and much simpler options available to tune MT systems in the SMT world.
  2. In general I would say that it makes sense to specialize in one MT (SMT or RbMT) approach and go deep to understand what you can control and how it works rather than do shallow and instant approaches. It takes work and extensive experimentation to develop real expertise in either approach and there is nobody I know in the industry who can do both well.  So choose RbMT  or  SMT and figure out what it takes to make it REALLY work rather than do the kind of shallow tests that Lexcelera does and draws definitive conclusions on these results as described in the article. Many of the conclusions drawn in the article are more a reflection of the quality of their effort than the actual possibilities of the technology in more skillful hands.
Some of the specific claims made and disinformation in the Multilingual article referenced above that I would challenge and dispute are as follows:

“In our experience, languages such as Japanese and German perform best with an RBMT approach” This was actually true in the early SMT days (~2005-2007) but is simply not an accurate truism anymore. I have seen custom SMT (if done right) outperform customized RbMT systems in both these languages even when large amounts of data are not available.

if you do not have enough data — we're talking millions of segments of in-domain bilingual and monolingual segments — you may not have enough corpora to train an SMT engine This seems to me to be a statement often made by people who have little or very shallow experience with SMT. In the large majority of SMT systems I have been involved with this amount of training data volume was simply not available. However, it is possible to get productivity enhancing SMT engines with even just 50,000 segments if you know what you are doing. This is possible even for languages like Japanese and Russian as Scott Bass of Advanced Language Translation points out in this webinar, where this was done with a fraction of the data mentioned in this misleading statement. A large majority of Moses MT engines, especially those of the instant kind, produce MT systems that are inferior to the free MT provided by Google and Microsoft. This is more likely to be related to a lack of understanding about the technology rather than any fundamental deficiency in the basic technology or the data as the Multilingual article suggests. If data privacy or copyright is not an issue, most LSPs would probably be better of using the Microsoft Hub option over using some generic instant MT option or some LSP managed Moses effort. 

“If the terminology is fixed in a narrow domain such as automotive or software documentation, RBMT or a hybrid is generally the best choice. This is because the rules component protects terminology better”  While this may be true for systems developed by naïve Moses users, many SMT experts like Asia Online have figured out that terminology really matters and know how to use it. Most of the corporate SMT systems out there focus exactly on automotive and IT product user documentation of various kinds, in addition to unstructured content. It is in fact possible to build a single Automotive engine (at Asia Online) and then tune it for different clients (Toyota, Honda etc..) and have the preferred terminology dominate IF you know what you are doing. See the diagram below for example.

“Wild West content where the terminology runs all over the map and would be impossible to train for, such as patents, works better with SMT. ”  This again suggests the authors lack of experience with patent domain and basic unfamiliarity with SMT technology. The largest terminology effort I have seen was with a patent engine where tens of thousands of scientific and technical terms were carefully translated to ensure accurate and useful translation of patent material. SMT benefits greatly from good, consistent terminology work and we have several customers (e.g. Sajan) who have gone on record to say that terminology consistency was one of the major benefits of an Asia Online engine. In fact the strategy deployed by Asia Online in data scarce situations usually begins with a tightly focused terminological foundation. 

“However, if there are metadata tags, you should be aware that SMT doesn't preserve tags well, so RBMT or hybrid technology will save you some headaches.”  While this  may be true for many Moses efforts made by technically naïve and unskilled users, any SMT developer worth his/her salt knows how to easily resolve this problem. Asia Online handles all the formatting tags in XLIFF and TMX automatically and also provides a variety of tools that allow power users to do sophisticated handling of different kinds of formatting.

“Today's SMT systems are still hampered by a lack of predictability, which means that translators waste a lot of time verifying terminology that already ought to be automatically verified.”  Asia Online ran an experiment a few years ago using TDA data from multiple sources. It was discovered that combining data or using noisy data of any kind produces much lower quality MT systems.Understanding how to get the data clean and building a quality foundation makes on-going maintenance and update of the engine much easier and largely eliminates this unpredictability. We also discovered that consistent terminology in the TM ensures much higher quality results and thus at Asia Online we now have tools to ensure this. Again, if you know what you are doing this is a manageable issue and after you have built a few thousand engines you realize that unpredictability can be managed by data cleaning and ensuring terminological consistency. Kevin Nelson, Managing Director of Omnilingua, stated in a webinar that the terminology and writing style produced by his Asia Online MT system was even more consistent than a human only approach. This was specifically noticed by his end-client who contacted Omnilingua directly without prompting to discuss how they had accomplished recent improvements in qualityimage
“When post-editing SMT, that next training cycle may be six months or a year away because you usually want a fair bit of new data accumulated before you begin the process of retraining. In this case, the post-editors are not empowered to make lasting changes and it typically takes until the next training cycle to see any progress at all”.  This may actually be true for many Moses systems and for most naïve users of instant MT solutions. But for the higher value-add systems like the ones produced by Asia Online this is not true. There are two ways that SMT based systems can incorporate corrective feedback:
  1. Real-time corrections that are used on each job and can easily be done by translators every single time they run a translation. Since there is no additional cost for retranslating the same content at Asia Online, users are encouraged to resubmit the translation until it is in better shape to hand over to a post-editor. Many dumb and high-frequency error patterns can be corrected instantly by some simple analysis and corrections based on small test translation runs.
  2. Periodic retraining which is done when sufficient corrective feedback is available. Incremental Trainings with Asia Online can be performed in just a few days and can be performed with just a few thousand segments to show meaningful improvements especially with terminology and high-frequency phrases.

Perhaps the biggest misconception of all is that More Data is Always Better.  We now have much more evidence that this is frequently not true. Even Google, the high priest of big data, admitted this some time ago: "We are now at this limit where there isn't that much more data in the world that we can use"

So be careful to not believe everything you read (including on this blog) and if you take more than a glancing look at MT technology today you will probably understand that while it is becoming much simpler to play and experiment with MT, it is still a long way from being easy to produce production-quality systems that provide long-term business leverage. Do not underestimate the expertise requirements to be successful with MT, and realize that even after jumping in with Asia Online or others it will  take ongoing changes in process and human factor management to really achieve long-term cost advantages and build sustainable business leverage.  The reward for those who figure this out will be clear differentiation and long-term production cost advantages that others with instant MT or home-brewed Moses systems will never be able to match.

MT is messy and not quite as predictable as most want it to be yet. You have to have a stomach for uncertainty and are probably better off with "real experts" than people who say they can do it all and are "technology agnostic". And the next time you see an article that says they have all the answers for you and that for a nominal service charge you could reach nirvana tonight just tell them: "Don't you jive me with that cosmic debris!". 

Watch this video and feel your face melt at 4:50 when the guitar solo happens.