Captioning Live Video

The following information is adapted from Speech Recognition Software: Past, Present & Future – GlobalMe.Net, and Speech Recognition Technology Overview – GlobalMe.Net.

Speech to Text

Speech "balloon"

Speech to text (StT) software is a type of assistive technology program that converts words that are spoken aloud to electronic written text to support increased demonstration of learning and independence. StT can also be referred to as dictation or Speech Recognition Programs.

One of the fastest growing areas of “everyday technology,” StT enables phones, computers, tablets, and other machines to receive, recognize and understand human utterances by combining legacy translation technologies with new Artificial Intelligence (AI) systems that can “learn as it listens” effectively improving the quality of the transcription the longer you use the service.

Modern StT uses natural language as input to trigger an action; enabling our devices to also respond to our spoken commands. The technology is being used to replace other, more ‘tired’ methods of input like typing, texting, and clicking. A slightly ironic development, seeing as texting and typing had become the preferred method of communication over voice calls just a few short years ago.

Today, speech recognition technology takes on many forms; from dictating text messages to your smartphone while driving, to asking your car to make dinner reservations at the Chinese restaurant down the road. In this latter case, the StT uses the GPS capabilities of your automobile’s technology, as well as “data in the cloud” to dynamically adapt and respond to each unique situation.

In the home, your “smart” speaker system (i.e., Amazon Alexa, Google Home) responds to, “please put on that new Beyoncé song” and has the ability to talk to your other smart devices that we use in our daily lives. Soon we can expect technologies with a consciousness…AI on steroids.

Speech-To-Text Applications

MS-PowerPoint – Presentation Translator

This information is adapted from Microsoft Translator’s website

Though not designed as assistive technology, but rather as a way of providing instant language translations when presenting in a group with of people who speak in different languages,  Presentation Translator subtitles your live presentation straight from PowerPoint, and lets your audience join from their own devices using the Translator app or browser.

As you speak, Presentation Translator displays subtitles directly on your PowerPoint presentation in any one of more than 60 supported text languages. This feature can also be used for audiences who are deaf or hard of hearing.

Up to 100 audience members in the room can follow along with the presentation in their own language by downloading the Translator app or joining directly from their browser.

The Microsoft Presentation Translator live feature is built using Microsoft Translator core speech translation technology, the Microsoft Translator API, an Azure Cognitive Service.

Presentation Translator integrates the speech recognition customization capabilities of Custom Speech Service (CSS) from Azure’s Cognitive Services to adapt speech recognition to the vocabulary used in the presentation.

What does custom speech recognition do?

  • Improves the accuracy of your subtitles by learning from the content in your slides and slide notes. In some cases, you will see up to 30% improvement in accuracy.
  • Customizes speech recognition for industry-specific vocabulary, technical terms, acronyms, and product or place names. Customization will reduce these errors in your subtitles, as long as the words are present in your slide or slide notes.

How to set up custom speech in your presentation

  • The first time you customize speech recognition for your presentation, it can take up to 5 minutes for Presentation Translator to finish learning.
  • After the first time, the subtitles will start instantaneously unless you update the content of your slides.
  • Tip: start the custom speech recognition during a practice run so that you don’t experience delays when you present to your audience.

How does the custom speech recognition feature work?

The custom speech recognition feature works by training unique language models with the content of your slides. The language models behind Microsoft’s speech recognition engine have been optimized for common usage scenarios.

The language model is a probability distribution over sequences of words and helps the system decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first sentence is far more likely to occur, and therefore will be assigned a higher score by the language model.

If your presentation uses particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customizing the language model.

For example, if your presentation is about automotive, it might contain terms like “powertrain” or “catalytic converter” or “limited slip differential.” Customizing the language model will enable the system to learn this.

When you use the Customize speech recognition feature in Presentation Translator, your presentation content – including notes from the slides – is securely transmitted to the Microsoft Translator transcription service to create an adapted language model based on this data. Data used for customization is not de-identified and is retained in full, along with the adapted model, by the service for thirty (30) days from last use to support your future presentations and use of the language modeling.

The Presentation Translator service now requires installation of a plugin and a robust internet connection. It is currently available as a plugin for PowerPoint and OneNote and it is likely the service will be added to all versions of MS-Office 365 in the near future.

FMI – Please see the Microsoft Presentation Translator website

Google Slides Translator

This information is adapted from…

Last fall, Google’s G Suite is adding automated closed captions to Google Slides. The feature works by accessing your computer’s microphone to pick up on what you’re saying during a presentation. It then transcribes your speech as captions, which appear on the slides you’re presenting in real time.

Google said the closed-captions feature in Slides can be helpful not only for people who are hearing impaired, but also for audience members in a noisy room. It can also be beneficial when a presenter isn’t speaking loudly enough, the company said.

To activate the feature, click the “CC” button in the navigation box when you start presenting. You can also use keyboard shortcuts, which are command + Shift + c on Mac and Ctrl + Shift + c in Chrome OS/Windows.

For more information:

This information adapted from

Otter aims to capture your conversations and make them easily accessible to you. The company believes that this is useful for many situations, including capturing meeting minutes, sharing conversations with others, or improving customer service. can be used for free for up to 600 minutes of transcription per month and unlimited storage. You may upgrade to Otter Premium to get 6,000 minutes of transcription per month for a monthly or annual subscription fee. can run on the Chrome browser with any computer or via a free app from iTunes Store or Google Play.

On mobile, will try to show you a live transcript as you’re recording. The final transcript and audio processing can take some time, and depends on the length of the recording and how busy the service is. The mobile app will show you if your recording is still being processed. You can swipe down to refresh and see if the processing is completed.

Both iOS and Android apps allow you to edit your transcript by long-pressing on the speech bubbles. On the web, simply mouse over a speech bubble and click on the floating pencil button in the upper right.

On Android, you can use the “Integration with Call Recorder” to record your phone calls and upload to Otter to get transcripts. Look for tighter integration with phone calls in Otter coming soon.

One of the advantages of Otter is the ability to “tag” speakers so the transcript shows who was saying what when. By tagging the section of text with the speaker’s name, Otter is able to begin to recognize the qualitative differences in each speakers’ voice (voiceprint) and then continue to track who is speaking later in the conversation.

Pricing: up to 600 minutes per month is free for an individual; up to 6,000 minutes per month is $100/yr one an individual; $150/year for teams of up to three users for 6,000 minutes per month.

For more information please see