Reservations on the Usability of Automatic Captions

Lately in the mainstream press, there have been articles trumpeting that Google is adding automatic captions to YouTube videos. For examples, see:

Captioning For YouTube Is Not New

For at least a year now, Google has provided the public the capability of adding captions and subtitles to videos uploaded to YouTube.  This enables human-generated transcriptions to serve as the captioning.  What’s new, at least for YouTube, is that Google is now using speech recognition technology to convert the speech to text automatically; no transcription files have to be uploaded manually.

Automatic Captioning Is Not New

Unheralded by the mainstream press, last year The National Institute on Disability and Rehabilitation Research (NIDRR) funded a project that used IBM Research Labs technology to automate the closed captioning of video-based instruction.  It was intended to improve accessible distance learning for people with cognitive disabilities, The Deaf and the hard-of-hearing.


This feature developed in parallel by both companies appears to be a boon for the affected populations.  Unfortunately, speech recognition technology still has far to go to produce language understandable by most people, especially people with cognitive disabilities.

The YouTube automatic captioning is based upon Google Voice technology.  I am a user of Google Voice.  Its conversions of voice-mail messages to transcripts is so bad that I keep using it because it consistently gives me a good laugh.  Its speech recognition is quite poor.

At least, in concessions, Google “… promises that the technology will improve over time” and IBM advertises an “…over 90 percent accuracy”.  That seemingly short way to go, and the small remaining percentage of improvement, actually account for an amount of errors so significant that the transcriptions produced are difficult to comprehend.

The other feature Google announced is that it is giving people the option of using its automatic translation system to read the captions in any of 51 languages.

It was almost twenty years ago that I first researched and started following the effort to computerize the translation of written text from one language to another.  Despite its great strides forward since then, its Achilles’ heel has always been parsing context.  Here is a simple example.  If it is said that employees are green, Americans understand that to mean they are inexperienced.  Computers have always had difficulty determining context, so that a common automatic translation mistake is to misstate that as “the employees are the color green”.

Users of the current Google Translator service, which translates Web site text from one language to another, often tell me it provides an idea of the content being translated, but it has far to go to match human-translated content.


The promise of these automatic tools is that they will make it much easier to caption videos, thus promoting the widespread use of captioning.  Due to insufficient speech recognition, and the problem with context parsing, I predict that, for years to come, all the automatic addition of captions to YouTube videos will require human revision to make them understandable.  Once people realize this, I expect the use and the adoption of it will be low.

The good news is that Google has a strong financial incentive to get this right.  Its empire relies upon the association of advertisements with textual content.  The more accurate Google can make automatic captioning and translation, the more it will be able to monetize other content, such as video and audio, via their captioned text.