How can applications like Siri or Google Assistant communicate with us in different languages?
One of the essential AI applications in today's smartphones is the smart personal assistant, also known as the Voice Assistant. Experts in the field expect the number of such applications to rise to 21.4 million by the end of 2020, which will continue to grow in the coming years.
The reliance on smart personal assistants for searching on Google has become so common that major companies have started allocating resources to develop applications specialized in search engine optimization (SEO) technologies. Moreover, the language used by these applications has reached such a high level of quality that many companies now depend on them for critical aspects of their marketing and sales operations.
In August 2018, Google Assistant began supporting bilingual functionality. Before that, users could only use one language at a time with the application, and to switch languages, they had to go back into settings to change the language first.
Now, Google Assistant can easily understand two different languages simultaneously without the need to adjust the settings. The AI team at Google is also working on developing an application that can understand three different languages at the same time.
In today’s article, we will delve deeper into how such applications work, as well as the technologies used in their development and updates.
How can a machine learn human languages?
Behind every smart personal assistant lies highly complex technology. Manufacturers of these applications and devices must teach them how to recognize speech and also produce it. In other words, they must learn to speak, listen, understand the commands they are given, and then provide appropriate responses. This becomes even more challenging in multilingual devices and applications.Natural Language Processing (NLP):
Before diving into how personal assistant applications like Siri or Google Assistant work, we first need to understand the key technology used in developing such applications, which is natural language processing (NLP).Natural Language Processing, or NLP for short, is a branch of artificial intelligence that aims to develop devices and programs capable of processing linguistic data. Training computers to understand speech is by no means an easy task. While any computer can absorb vast amounts of data, its ability to handle unstructured data remains limited.
This is the case with language, as linguistic information is merely unstructured data for machines. Its nature, spontaneity, and the multiple contexts and aesthetic dimensions it carries add further complexity to the process.
When it comes to training computers to process language, we face three main challenges:
- The foreignness of the concept of "human language" to machines.
- The nature of human language, with its diversity and reliance on countless variables.
- The limited understanding we have of how the brain works and processes language, despite advancements in this area.
So, how do applications like Siri or Google Assistant work?
Let's assume you ask Siri, the personal assistant used in iPhones, about tomorrow's weather. Here's what happens:- First, your phone will capture your voice and convert it into text for processing.
- Through a program specialized in natural language processing (NLP), your phone will attempt to interpret the meaning of your question.
- If the sentence you said is a question, and you used the appropriate questioning tone, the program (with the help of AI) will identify semantic markers indicating that you asked a question, and these markers will be added to the text before processing.
- Words like "weather" and "tomorrow" will form the content of the question, essentially becoming keywords for the search.
- Siri will then search on your behalf online and present the results to you in the form of a voice response.
What role does voice-over play in the development of such applications?
When Siri was first launched in 2011, the application faced many criticisms. Some found the overall experience poor, while others complained that the personal assistant couldn't understand their accents. This was due to a lack of diversity in the linguistic material used to train the neural networks powering the application.In other words, these applications learn how to handle different human languages by being provided with specific voice and text data. If voice samples are from individuals from a specific geographic region or with certain accents (or with a deliberately neutral accent), the application will fail in real-world settings. It won't be able to understand rare speech patterns or regional dialects derived from a particular language. This is precisely why many NLP companies are now seeking global voice-over services that can provide diverse voice samples with varied accents.
However, the job of voice-over artists isn't limited to inputting voice commands and training these applications to understand. They also provide the means for these applications to respond to users. This is done through what's known as a phoneme, the smallest unit of sound that has meaning in a language. We can speak by combining these sound units.
Thus, when Siri is asked to convert written text into speech, it first searches for a voice pronunciation that has already been entered and stored in its database by voice-over artists. If it doesn't find one, the application will then try to understand the linguistic structure of the sentence or input text to determine the appropriate tone for each word.
The application will then break down the text into a mixture of sound units and search for the most suitable sounds in its database, allowing it to deliver an appropriate response.
In conclusion, the process by which applications like Siri or Google Assistant communicate with us in natural human language involves numerous complex operations and intricate algorithms. From recognizing speech and converting it into text to understanding its meaning and searching for a suitable response, before converting it back into speech for us to hear—all of this happens within seconds or minutes!