How did Siri learn English?

Let alone Arabic, Cantonese, Mandarin, Danish, Dutch, Finnish, French, German, Herbew, Italian, Japanese, Korean, Malay, Norwegian, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish?

To train speech recognition, text to speech, or other natural language processing capabilities, computational linguists essentially have to take language apart. Only then, can they can put it back together again. To do this, they use a process called “tokenization”.

.

Segmenting Words, Sentences, and More

First, they need to teach the computer to segment words. Segmenting words is also called tokenization, lexicalizing, or lexing. Because why use one word where three will do?)

Sound easy?

Well, when you first heard a foreign language for the first time, could you tell where one word ended and another began? It probably helped when you learned to read, especially if the language you were learning conveniently segmented words between white spaces.

Unfortunately, not all languages do that!

Mandarin, for example, does not use spaces in their writing. Neither does Bengali or Thai.

To segment the words of any language, linguists work together with computer programs, to look for statistical patterns in the language. Those patterns can help the computer learn where word boundaries are.

They then need to do this tokenization at several levels: the word level, the sentence level, and more (depending on the purpose of the technology).

After tokenization, you are left with “tokens”: a collection of words, punctuation, links, possessive markers or other grammatical information, and more.

.

Parts of Speech (POS) Tagging

Next, linguists need to identify and tag words based on their grammatical characteristics, or parts of speech.

Why is this important?

Depending on the purpose of the technology, certain kinds of words may be particularly important. For most technologies, proper nouns are likely to need extra checking, as they will involve names of people or brands.

Adjectives are more likely to be indicative of sentiment or feeling. These might be particularly important if the end customer wants to use this technology to perform a sentiment analysis, for example.

Interested in working with our linguists? Learn more about our services here.

What Comes Next?

That depends on the technology you want to build, or the data you want to analyze.

But once a text is properly segmented and tagged, there is a lot that a computational linguist can do with it. We can perform complex analyses of sentiment, identify trends, or language technologies like text-to-speech engines.

 

Can Computers Do This Yet?

Computers can certainly help. However, natural language processing still needs a lot of help from humans. When developing any new language technology, human linguists are usually needed to help collect, analyze, and prepare customized language data.

For example, if you want to build a robot that will help with patient intake at a hospital, you will want that robot to have extensive knowledge of medical terms.

If it doesn’t know that the words “dry cough” often appear together because it is a common symptom, it might mishear it as something like “try coffee”.

The more customized the technology is, the better it will perform.

 

How Do You Hire Tokenizers, or Computational Linguists Who Can Perform Tokenization?

To carry out tokenization, POS tagging, or other linguistic analyses, you should reach out to computational linguists. They will need to be 1) specialized in the language needed and 2) trained in advanced linguistic methods.

At Meridian Linguistics, we count on the specialized linguistic skills of hundreds of phoneticians, annotators, tokenizers, POS taggers, and more. We have carefully recruited and cultivated this network of linguists over the course of several years of language technology projects. They work in dozens of languages.

Don’t hesitate to contact us for more information.

FIND SPECIALIZED LINGUISTS NOW