Welocalize Presents

Advancements in the World of MT

Welocalize Season 1 Episode 5

In this fifth episode of the podcast, Welocalize AI and MT expert Lena Marg chats to host Louise Law.  

Lena talks about the adoption levels of machine translation (MT) across industries and some of the crucial topics surrounding MT, including toxicity in MT and data privacy. Listeners can expect to gain a better understanding of the MT landscape and learn expert tips on how to get started. 

“There’s a huge variation of MT adoption across brands – for many large technology clients, MT is commonplace and they’re looking at how to add more natural language processing and AI solutions into the mix. For some industry areas, particularly those that highly regulated areas like life sciences and legal, clients are more cautious and need more customization and help with their MT deployments.” Lena Marg, Director, AI & MT Operations, Welocalize 

In this podcast episode, Lena and Louise talk about… 

  • Who is using MT? Adoption rates across industries 
  • The complexity of the MT landscape 
  • Is toxicity in MT something we can solve? 
  • The impact of MT data training and privacy issues on how brands approach MT 

About the Welocalize podcast 

The Welocalize podcast is dedicated to exploring the world of multilingual communication and the technologies that enable brands to reach global audiences. Guests shares their expertise and stories on topics related to language, localization, and translation to help brands create the best customer experience 

Louise Law:

Hello and welcome to the Welocalize podcast, where we feed our curiosity and talk about the most important and challenging topics a language, multilingual content localization, and much more helped along by a wide variety of guests. I'm Louise Law, your Welocalize host and in this podcast episode I'm joined by Lena Marg, who is director of Machine Translation Operations at Welocalize and she's a well-known figure in the language industry, often speaking at industry and academic events about the intersection of language and technology. Lena, welcome to the podcast.

 

Lena Marg:

Hi, Louise. Thanks for having me today.

 

Louise Law:

So Lena, we always talk about Machine Translation and we know MT, it's not a new topic. It's been talked about for decades. But obviously the advancement of MT and how it is used in language programs-it's still an important area, especially now, as many brands need to translate more with less and sometimes they can be unsure as to how best to deploy, manage and effective MT program.

 

Lena Marg:

I've been in this for so many years now, but it's interesting to see that the interest is just not fading at all and we keep getting requests and interest in this space. Absolutely.

 

Louise Law:

Let's just have a quick positive history of MT for people that might be interested. It came into use in the mid-50’s in research and then started to really take off in the mid-90’s when IBM leveraged statistical empty models. And then around 2017 there was another lean towards neural MT, which uses the power of AI and neural networks pretty much NMT, which is normal and is what most programs approved with today.

 

Louise Law:

That's a quick history of MT, but you've been working it for many, many years. Could you tell our listeners how you got into the field of MT?

 

Lena Marg:

Yeah, I don't think it's necessarily the most obvious way. I think because I had just graduated in conference interpreting actually, but probably there was a bit of a memory of my days in the university where they had machine translation, computational linguistics as well. And then I got my first job as a computational linguist actually, which effectively was a combination of post editor but also linguist for German, in that case managing the then rule based engines.

 

You know, you just talked about that. So that was kind of where I started looking after the engines for German, for the different customers that were effectively using them. I really like that combination of finding patterns in language. What are the exceptions? What are the rules? How can we inform those dictionaries and grammars of those engines to give me better output?

 

It helped me as a post editor. Sometimes I just wondered whether I'm just naturally lazy and it suited me to not translate from scratch. Those hundred thousands of automotive manuals.

 

Louise Law:

So obviously, you took all this curiosity and you’ve worked for many years and you now work with many, many of our global brands to integrate MT into the localization and language programs. As things stand today, how confident would you say brands are adopting MT and other AI and natural language processing tools into their programs?

 

Lena Marg:

I see a huge sort of variation. So I would say that if we look at the localization industry, most like a classical localized industry and specifically the larger tech companies, the enterprise clients as we call them, I'd say for them machine translation is very commonplace now, you know, they've been doing it for a while. They are quite mature programs and in their case it's more about now how to add long tail languages or exactly as you said at other and other NLP features, other AI solutions into the mix.

 

But there's other industries like life sciences, possibly because of, you know, the more regulated nature of the content. What I think this is really just catching on now and we see a lot of activity right now. And again, of course, automotive manufacturing, some of them have been in there for a very long time. Have their manuals done. So it really varies.

 

Louise Law:

So almost industry and what the prominent content types. Yeah, yeah, yeah, yeah, yeah she's deployed you know the actual empty landscape full of providers and partners and how we can use it and everything like that. It could be quite complex to navigate that landscape very briefly because we don't have long today. What would you say, what's the current landscape look like in MT?

 

Lena Marg:

My team and I often in this together for a while most surprised just how complex it feels these days. You know, you always sort of think it's going to be simpler, but in reality, for us it feels like it's just increase in complexity of providers, like you said, and also available connections, integrations, quality levels, other features. How can these be combined and so on.

 

I guess the interesting aspect, of course, is on the flip side, if you are a first user of machine translation, maybe all you see is one provider, right? So if we think about DeepL and their huge breakthrough, and I think in some markets they're almost, you know, that's all people talk about. And so that's probably the only solution they turn to right now.

 

Equally, Google have recently released their translation hub or obviously now the talk about ChatGPT was going to get a lot of people interested in very specific solutions that they'll turn to. But I think the more you start looking into it, that's when the complexity becomes clear, especially when you want to deploy it in a workflow with human in the loop aspect, content and editing and so on.

 

Louise Law:

There's a lot of names reference that we've seen and they're all over the place to take you back to basics. How, how would somebody get started with them? Would they go for a simple plug and play or a more customized solution suited for their different types of content in industry? And then maybe also, as you mentioned before, to be integrate a human element whereby the empty output goes through a post editing process, how people get started with them.

 

00;05;48;19 - 00;06;06;22

Lena Marg:

Take a think what you said there. I mean the use case is super important, understanding what you want to use it for and then maybe a plug and play is where you can start. If it's no for maybe that's all you need initially and you can build from there. I think I'd generally I would recommend, I don't know, as you call it, sometimes a pilot, right?

 

You start with a narrowly defined set of languages and content types and you start exploring there, what is it I can do here? What do I have implemented from there? And gradually grow and build your program from there. And that doesn't have to take very long. It's just that of establishing your baseline of what is it, you know, I can achieve with my core languages, core markets, and then seeing from there, how much further do I want to invest and go and what is the best fit for my program.

 

Also, of course, you know, no bias here, but it's worthwhile understanding clearly what your role is in that. What is your knowledge about your limitations, reaching out to partners where you feel you're stuck? Oh, you don't have that knowledge to take you further. There's a lot of, you know, waste now out there that as well. Again, to find partners to support you with your empty deployments.

 

Louise Law:

Start small, but also work with people that you trust.

 

Lena Marg:

And yeah, I would say so.

 

Louise Law:

Let's just move on today around a couple of important areas surrounding anti. We're seeing a lot of media coverage around what's called toxic anti and also data privacy surrounding empty. So let's talk a little bit about toxicity in empty which is where some of the translations can say incite hate, abuse or even violence against groups because errors have been introduced in the source or this being mistranslated.

 

Can you tell us a little bit more about it? Because obviously it's you know, it's a topic that you're aware of and what people like we localize can do about it. How can we prevent that bias? And some of the data that Empty is working with?

 

Lena Marg:

You're touching on a good point that because I think to clarify upfront toxicity and itself we're talking about source content almost it can be produced in the target text, but then it would be described as so-called added toxicity. Typically, you know what we find it more, I guess in the space of social, social media user generated content.

 

Fundamentally, I think there's two things to consider. Alissa Rubin, speaking from college, also talk about that in their paper is to have a quality assurance methodology that has a category that flags toxicity and considers that as a very critical error, but also exactly like you said, it's about the data, right? So what's the data might already be biased.

 

So if you have less good quality data or they're too sparse, then maybe there is an added risk of leading to this added toxicity. Right. So that's I think what two things I would probably consider here. I think for companies like ours, the importance is sort of how do we capture it, how do we track it if we think that the content type or that the workflow lends itself for this problem.

 

Louise Law:

Certain content, as you say, can be prone to more toxicity than others do that in a way, it's kind of working as a team to make sure that we're kind of identifying those potential risk areas where toxicity could appear.

 

Lena Marg:

Most typically right now, it's not one of our primary concerns, and that's probably because we mostly work with professionally author content, right? So that's in our industry. And also, of course, we have very rigid QA mechanisms using, for example, the MQM framework that also comes with the flexibility to add or to categories. So I think and we have post editing, right?

 

So, so in our work does I think we're fairly well set up, but of course if we do raw machine translation and the content lends itself, we should absolutely consider this aspect and make sure we have mechanisms to catch toxicity.

 

Louise Law:

Moving on to privacy, it's a really important part of the empty discussion, and this is especially important in some of these highly regulated areas that you mentioned before, like Life Sciences, Financial Services. They've got highly confidential content that needs to be protected. What words of wisdom would you give to an organization to make sure that they're confident that their empty systems are protected and that all the data privacy is as good as it possibly could be?

Lena Marg:
Privacy actually comes up for two reasons within our conversation. So the first one is translation bias, sort of or maybe the InfoSec departments may have concerns around using certain empty providers as they're unclear what would happen with that data. There are different set up scenarios here with regards to server locations dedicated workspaces, data vaults and so on. And of course these providers have disclaimers in terms of how that it is handled that we would share with them, referring back to what we discussed earlier as well here, having this complexity in the landscape can help, of course as well, because I'm sure that's a solution for for everyone out there.

The other aspect around privacy, which we increasingly are conflicted about, actually, is when people within organizations use, for instance, a Google Translate, Microsoft being a deep L and their day to day work without a secure set up. And that's sort of customers then get concerned and think, actually, could we offer this to employees but in a secure manner, sort of a secure translation tool, as it were?

 

Louise Law:

I suppose that's kind of educating the teams and making sure people are aware of latest regulations and what So since application can help those infosec concerns as well, we could really talk for much longer. We're kind of running out of time. But as we wrap up, if you could share one piece of advice for anyone listening who wants to take those first steps towards empty or they want to really elevate their anti program, what would that be? 

What would be that one piece of advice like that?

 

Lena Marg:

I think it's probably going back to what we said earlier a little bit. Start defining sort of your use case and start clarifying what is your expertise in that conversation. Like, what can you bring to the table to make that happen? Don't underestimate the effort that goes into a good empty program, as it were. And by that I simply mean that if you want there to be post editing, for example, you want to send decent empty output to your translators for best quality and for best efficiency and fairness.

Quite frankly. But also if it's for raw consumption, you will be concerned about your brand terminology, your brand style and so on. So there are, like I said, there are those sort of challenges that that don't come with the out of the box solutions. So it's kind of good being aware of your own sort of, you know, where do I need help, how far can I go on my own?

And just kind of being clear around that maybe and it's okay to dabble a little bit in the beginning to see, you know, play it, play around with it a little bit. If you have a technical person on the team that wants to play around, but then realize then maybe you need to ask for help, it's just sort of we see recently, of course, you know, an increase in people trying to do it on their own, which is super.

But ultimately, like I said, if you want to have that good quality output, is your terminology respected with your style, reflect it. It's no no harm reaching out to others to support.

 

Louise Law:

Thank you so much for joining us today. You've been a super guest and we appreciate your time today and it's been great to chat. Thanks so much.

 

Lena Marg:

Thank you, Louise!