Adding captions and subtitles

In this tutorial we will guide you through the process of adding subtitles and captions to HTML5 video, looking at some of the problems that currently exist, and solutions to those problems.

Below we will build up a simple demo. You can see the source code, and also view it live. You'll notice that the source code has different directories — these correspond to the different stages of the tutorial, allowing you to both check what your code should look like after each stage, and start the tutorial at any stage if you don't wish to go right from the beginning.

At this point, download the content kit so you have the demo code available when working through the tutorial — see the demo directory.

This demo purely concentrates on text tracks, therefore we've not added much in terms of CSS or additional HTML/JavaScript to get in the way. If you want to style your video page up after completing the tutorial, please go ahead!

Note: Thanks to Ian Devlin for letting us use some of his code as the basis for the demo in this tutorial.

Start: A basic HTML5 video

Note: This section relates to Slide 5 of the slideshow.

Let's begin by inspecting the start state of the demo. At this point we have a simple HTML5 video in our page, and not much else:

<video controls preload="metadata">
  <source src="../video/sintel-short.mp4" type="video/mp4">
  <source src="../video/sintel-short.webm" type="video/webm">

	<p>It appears that your browser doesn't support HTML5 video. Here's a
  <a href="../video/sintel-short.mp4">direct link to the video instead</a>.</p>
</video>

there is not much to see here — we have set the preload attribute value to metadata, so the browser will cache the video's metadata (meaning not too much data is downloaded immediately, but we have access to useful data like the video's length), included the default browser controls using the controls attribute, and added a fallback paragraph that is displayed if the browser doesn't support HTML5 video.

The two <source> elements provide a choice of different video formats for cross browser support.

Step 2: Adding subtitles

Note: This section relates to Slide 8 of the slideshow.

Now let's move on to adding some text tracks to our video. Open up your start index.html file, and add the following lines below the <source> elements:

<track label="English" kind="subtitles" srclang="en" src="vtt/sintel-subtitles-en.vtt" default>
<track label="Deutsch" kind="subtitles" srclang="de" src="vtt/sintel-subtitles-de.vtt">
<track label="Español" kind="subtitles" srclang="es" src="vtt/sintel-subtitles-es.vtt">

The <track> elements associate text tracks with the video. The attributes are as follows:

This sounds like it all makes sense, and it does, in terms of the spec definition. Unfortunately, browsers don't currently do a very good job with their default text track UX.

It is therefore a good idea to implement your own custom menu using JavaScript. You'll see how in the next section.

Try testing your code now by double clicking your index.html file; be warned that if you are using Chrome/Opera for testing, you'll need to run your code through a local web server (such as Python SimpleHTTPServer), otherwise you may get an error message in the console about the text tracks being blocked from loading if they are loaded via file://.

WebVTT syntax basics

Let's look at the contents of one of our .vtt (video text track) files, before we move on. HTML5 video originally used .srt (SubRip Text) files to provide text tracks, but these were replaced by .vtt because .srt is only really for subtitles, whereas there are lots of different types of text track you might want to use.

Open one of the files inside your code's vtt directory in a text editor. You'll see entries like this:

WEBVTT

0
00:00:00.000 --> 00:00:12.000
[Test]

NOTE This is a comment and must be preceded by a blank line

1
00:00:18.700 --> 00:00:21.500
This blade has a dark past.

2
00:00:22.800 --> 00:00:26.800
It has shed much innocent blood.

The file must start with WEBVTT. We then include separate text track blocks, each starting with a number. The numbers must go up in order.

The second line of each block is a timestamp range, indicating what time the text track should start being shown, and what time it should disappear again. The start and end timestamps are in the format hh:mm:ss:msmsms, allowing for very precise times. All digits must be filled in, so for example you can't just write 50 milliseconds as 50 — you'd need to include a leading zero — 050.

The third line onwards (you can include multiple lines in each block) is the text that you actually want to display.

Note: there is a lot more than this available in WebVTT syntax. See Step 4: captions

Step 3: building a custom menu with JavaScript

Note: This section relates to Slide 16 of the slideshow.

Let's add some proper interactivity to our text tracks that works across browsers. You can see the finished version of this code in the step3 directory in the source code, if you need to check it out.

Note: In a real project you'd probably hide the browser's default controls and create a complete custom control set, as shown in Video player styling basics. Here however we just wanted to focus on the basics of text tracks.

First of all, add the following HTML below your </video> closing tag:

<form>
  <select name="select">
    
  </select>
</form>

This will act as our simple menu for selecting the different text tracks we want to display. Next, insert a <script></script> element just above the closing </body> tag to put your JavaScript in (or link to a separate script file if you wish).

Now add the following inside your script element:

var video = document.querySelector('video');
var select = document.querySelector('select');

This simply grabs a reference to the elements we want to manipulate using JavaScript.

Next, add the following below your first two lines of JavaScript.

function hideTracks() {
  for (var i = 0; i < video.textTracks.length; i++) {
    video.textTracks[i].mode = 'hidden';
  }
}

hideTracks();

Here we a creating a function that loops through all the text tracks available on our video (you can grab an array of all available text tracks using video.textTracks), and sets their mode properties to hidden, meaning that any currently showing text tracks will be hidden (to show a text track you'd set its mode to showing). We then run the function to make sure we start the video in a clean state.

Next, add the following block at the bottom of your other JavaScript:

var tracksOff = document.createElement('option');
tracksOff.setAttribute('value','off');
tracksOff.textContent = 'Tracks off';
select.appendChild(tracksOff);

for (var i = 0; i < video.textTracks.length; i++) {
  var curTrack = video.textTracks[i];
  var addTrackOpt = document.createElement('option');
  addTrackOpt.setAttribute('value',curTrack.kind + '-' + curTrack.language);
  addTrackOpt.textContent = curTrack.label + ' ' + curTrack.kind;
  select.appendChild(addTrackOpt);
}

select.addEventListener('change',function() {
  trackChange(select.value);
});

First of all, we create an <option> element called tracksOff, give it a value of off and text content of Tracks off, and then append it to our HTML as a child of our <select> element. This creates our 'off' option, to turn off any text tracks that are currently showing.

Then we loop through our text tracks again. This time, in each loop we store a reference to the current text track in curTrack (to make writing subsequent code shorter), create a new <option> element, and give it a value and text content based on the current track's kind, language and label properties. We then add each <option> element to the <select> element.

The final part of this code adds an event lister to our <select> element so that every time its value is changed, it runs a function called trackChange(), passing it the current select value. We'll look at this function in the next section — for now, save and refresh, and have a look at the generated code in your browser dev tools, to help you understand what the last section of code did. It will look something like this:

<select name="select">
    <option value="off">
        Tracks off
    </option>
    <option value="subtitles-en">
        English subtitles
    </option>
    <option value="subtitles-de">
        Deutsch subtitles
    </option>
    <option value="subtitles-es">
        Español subtitles
    </option>
</select>

Now we'll add that trackChange() function to the code and look at what it does. Add the following, just below the hideTracks() function:

function trackChange(value) {
  if(value === 'off') {
    hideTracks();
  } else {
    hideTracks();
    var splitValue = value.split('-');
    
    for (var i = 0; i < video.textTracks.length; i++) {
      if(video.textTracks[i].kind === splitValue[0]) {
        if(video.textTracks[i].language === splitValue[1]) {
          video.textTracks[i].mode = 'showing';
        }
      }
    }
  }
}

The argument the function takes is the value of the <select> element after a new option has been selected in it. the first if block checks whether that value is off. If so, we just run the hideTracks() function to hide any active subtitles.

If the value isn't off, the else block is run. First, the hideTracks() function is run, because we don't want to have multiple tracks shown at the same time.

Next, we split the value at the "-" character, to get an array of two values — the first is the track kind, and the second is the track language. Remember how when we generated the select menu in the first place, we generated the values from the kind and language and put a "-" in the middle of them, e.g. subtitles-en? Here we are going in the opposite direction.

Next we have a for loop with two nested ifs. In each loop iteration, if the current text track's kind is equal to the kind from the select value , we then test to see if the current text track's language is equal to the language from the select value. If that's also true, then we've found the correct text track and we display it by setting its mode to showing.

Save your code and try it out again.

Step 4: captions versus subtitles, and default selection

Note: This section relates to Slide 26 of the slideshow.

So far we have only added subtitles to our video, but we should keep in mind that there are other types of text tracks to consider. Subtitles are generally for the use of people who can hear the audio dialog, but can't understand the language it is spoken in. They only include the words that are being spoken and are not positioned.

Captions

Captions on the other hand are generally for the use of people who are deaf or hard of hearing. They tend to include information on who is speaking each line of dialog, and the lines are often positioned near to the character to further aid recogition of this. In addition, captions tend to include information to describe any music that plays, sound effects that occur, etc.

Open your index.html file and add the following line below the first <track> element:

<track label="English" kind="captions" srclang="en" src="vtt/sintel-captions-en.vtt">

Now try refreshing your example — you should now have a fourth option in your select menu — English captions. Choose this one and observe how it differs from the English subtitles.

Adding a default selection

Instead of having no subtitles selected by default, it might be nice to have the English subtitles selected by default; this is pretty easy to achieve. In the for loop that populates the select menu, add the following highlighted block:

for (var i = 0; i < video.textTracks.length; i++) {
      var curTrack = video.textTracks[i];
      var addTrackOpt = document.createElement('option');
      addTrackOpt.setAttribute('value',curTrack.kind + '-' + curTrack.language);
      addTrackOpt.textContent = curTrack.label + ' ' + curTrack.kind;
      select.appendChild(addTrackOpt);

      if(curTrack.language === 'en' && curTrack.kind === 'subtitles') {
        addTrackOpt.setAttribute('selected','selected');
        trackChange(select.value);
      }
    }

Here, we check whether the language is English and the track kind is "subtitles". If they both return true, we set the selected attribute on that particular <option> element, and run the trackChange function to set the English subtitles to play by default.

Cue Settings

Open up your sintel-captions-en.vtt file — you'll see that the cues in this file have some extra information available, for example:

1
00:00:18.700 --> 00:00:21.500 line:20% align:end
<c.man><b>Man</b>: This blade has a dark past.  </c>

On the second line after the cue timings we can see Cue settings — optional instructions used to position where the cue text will be displayed over the video.

A setting's name and value are separated by a colon. The settings are case sensitive so use lower case as shown. There are five cue settings:, discussed in the sections below.

vertical

vertical indicates that the text will be displayed vertically rather than horizontally, such as in some Asian languages.

vertical:rl writing direction is right to left
vertical:lr writing direction is left to right

line

line specifies where text appears vertically. If vertical is set, line specifies where text appears horizontally.

Value can be a line number:

Or value can be a percentage

  vertical omitted vertical:rl vertical:lr
line:0 top right left
line:-1 bottom left right
line:0% top right left
line:100% bottom left right

position

position specifies where the text will appear horizontally. If vertical is set, position specifies where the text will appear vertically.

  vertical omitted vertical:rl vertical:lr
position:0% left top top
position:100% right bottom bottom

size

size specifies the width of the text area. If vertical is set, size specifies the height of the text area.

  vertical omitted vertical:rl vertical:lr
size:100% full width full height full height
size:50% half width half height half height

align

align specifies the alignment of the text. Text is aligned within the space given by the size cue setting if it is set.

  vertical omitted vertical:rl vertical:lr
align:start left top top
align:middle centred horizontally centred vertically centred vertically
align:end right bottom bottom

Styling the displayed subtitles

Lets have a look at our caption example again:

1
00:00:18.700 --> 00:00:21.500 line:20% align:end
<c.man><b>Man</b>: This blade has a dark past.  </c>

You can see that the text cue has some special tags marking it up. Text cues can be styled via CSS Extensions.

The ::cue pseudo-element is the key to targetting individual text track cues for styling, as it matches any defined cue. Have a look in the style/style.css file and you'll see some simple rulesets like this:

video::cue { whitespace: pre; }

video::cue(.man) { color:yellow }

The first rule conserves whitespace on all cues, whereas subsequent rules apply different colours to the text spoken by specific characters.

What properties can be applied to text cues?

There are only a handful of CSS properties that can be applied to a text cue:

What tags are available to cues?

The different available tags are as follows:

Timestamp tag

The timestamp tag specifies a timestamp to show a subset of the cue text at a slightly later time than the start timestamp specifies. The timestamp must be greater that the cue's start timestamp, greater than any previous timestamp in the cue payload, and less than the cue's end timestamp.

The active text is the text between the timestamp and the next timestamp or to the end of the payload if there is not another timestamp in the payload. Any text before the active text in the payload is previous text. Any text beyond the active text is future text. This enables karaoke style captions.

1
00:16.500 --> 00:18.500
When the moon <00:17.500>hits your eye

1
00:00:18.500 --> 00:00:20.500
Like a <00:19.000>big-a <00:19.500>pizza <00:20.000>pie

1
00:00:20.500 --> 00:00:21.500
That's <00:00:21.000>amore

The following tags require opening and closing tags (e.g. <b>text</b>).

Class tag

The class tag (<c></c>) styles the contained text using a CSS class.

<c.classname>text</c>

Italics tag

The italics tag (<i></i>) Italicizes the contained text.

<i>text</i>

Bold tag

The bold tag (<b></b>) bolds the contained text.

<b>text</b>

Underline tag

The underline tag (<u></u>) underlines the contained text.

<u>text</u>

Ruby tag

The Ruby tag (<ruby></ruby>) is used with ruby text tags to display ruby characters (i.e. small annotative characters above other characters).

<ruby>WWW<rt>World Wide Web</rt>oui<rt>yes</rt></ruby>

Ruby text tag

The ruby text tag (<rt></rt>) is used with ruby tags to display ruby characters (i.e. small annotative characters above other characters).

<ruby>WWW<rt>World Wide Web</rt>oui<rt>yes</rt></ruby>

Voice tag

Similar to class tag, the voice tag (<v></v>) is also used to style the contained text using CSS.

<v Bob>text</v>

Conclusion

That rounds off our tutorial on text tracks. If you look in the final directory, you'll find a slightly improved version that contains a simple JavaScript library, captionator, which adds in a JavaScripted version of the necessary APIs to allow the example to work in older browsers.

Note: There are a few services available for speeding up the process of developing subtitles/captions, but we'd recommend checking out Amara — UniversalSubtitles.