Aspects of Editing – AI


I’m sure that many of you remember those days back in the mid to late ‘90s when Avid, Lightworks, Ediflex and a number of other electronic editing machines were starting to raise the possibility of a non-celluloid editing process. I remember several editors talking about how there was nothing that could convince them that there was anything that could replace the immediacy of touching the film, of stretching footage out along their arms to measure timing.

They would not, or could not, change to push ‘buttons on a keyboard.’ Today, those editors are called ex-editors. That is, of course, ancient history now, and all of us actively edit on digital nonlinear editors. I don’t look back with nostalgia to those days – at hunting for two-frame ‘trimmettes’ in the bottom of a trim bin or on the bottom of my shoe. So, as we move forward toward 2020, it is with just a little hyperbole that I suggest that we are in similar circumstances now – except it is not electronic editing that is giving us new opportunities but the technologies of artificial intelligence and deep learning. First off, let me say that no editor is in danger of losing their job because of AI. Unless what you do can be replicated easily by a slightly-trained human, no amount of computing power is going to give computer code the ability to make the judgments that we do every day. But there is no doubt that our assistants’ jobs will change, as AI-enabled software takes over much of their editing prep work.

First, a definition of AI. Briefly (and with no nuance at all) artificial intelligence is the ability for code to rewrite itself to better achieve its goal – whether that goal is identifying audio and transcribing it, analyzing MRIs or scans for patterns that are too complex for a doctor to diagnose, or separating a mass of footage into separate bins (or tagging that footage) sorted by character or shot size. There are many ways that AI can teach itself to better achieve its goal. You can feed the program a lot of shots (this is called a ‘corpus’ of data, creating a ‘neural network’ for the software to learn from) and tell it which shots are close-ups, and which are medium shots.

The program can then extrapolate that knowledge to your own footage, shots that are not part of its teaching corpus. There are also some forms of learning which do not require humans to say what something is or how the program can learn from it. This is called deep learning. The computer really is teaching itself. As an example, presently, anyone who uses the ‘select subject’ or ‘select and mask’ buttons in Adobe Photoshop (where the program is intuiting what the main subject or subjects of an image are) is using AI and image and pattern recognition. Anyone who is using, SpeedScriber, Transcriptive or most other transcription tools, is using AI and speech recognition. Anyone who is using Siri or Alexa to ask their phones to perform a task is using AI. Here’s another use of AI that many people don’t know about.

Go into your iPhone (this may also apply to Android) and click on the search function in the top menu bar in the Photos app. Type in a random word, like ‘dog’ or ‘flower.’ In addition to photos shot on ‘Flower Street’ you’ll also see a pretty good set of results of pictures of your dog, or flowers, without you ever tagging them with those terms. That is AI using image recognition. I constantly use Google’s reverse image search ( and click on the camera icon).

I recently uploaded a photo of all of the participants in a workshop I did in Brazil. The search results found the Wikipedia and Dictionary entries for ‘seminar’ and provided hundreds of photos very much like mine. It also provided me with some videos of seminars. If I were looking for a piece of temp (or even final) stock footage that might be very useful for me. That is AI and I’ve used it. Other companies besides Google use this technology for images. Search at the stock footage site, Pond5, by dragging a video clip into their search criteria. You’ll get images and videos that bear some relationship to your image – old cars, similar color palette (your DP will love you for that), movement and more. It’s exciting and an indicator of how we will be able to search our own footage in the near future. Documentary editors with too much B-roll can jump for joy soon.

Presently, transcription and search are the two most visible uses of AI in our lives. But there’s more. Look at the chunk of this YouTube video from Stanford University and Adobe Research ( on ‘Dialogue Driven Editing’ from 2:00 to 2:26. You’ll see a slightly improved version of Avid’s ScriptSync, where clips are created from camera files (even if there are multiple takes within a single file) and then matched to a script. It will also use image recognition to determine who is in the shot and its size. It can add that metadata, normally created by our assistants, to all of the other metadata that we get from set.

This is already an improvement that can save our assistants a ton of time and get you your footage much faster, valuable if you are editing on set. But it is the section of the video that starts at 2:26 where the real time saving (and, honestly, fear) is shown. Using a list of editing styles that the program has learned – such as ‘begin on a wide shot,’ or ‘show the character who is talking,’ or ‘favor the character, Stacy,’ or ‘cut to her closeup at this line’), the editor can assemble a timeline of guidelines, rather than images, and the program will assemble a first cut which obeys those styles.

Right now, the results are pretty bad, since it doesn’t recognize good or bad performance or subtle nuances (and it’s only dialoguebased) but if it can help us wade through the four to five hours of dailies we get every day – both from an assistant editor and editor point of view – it provides great value. Editors, of course, edit using storytelling, and a great grasp of emotional performances and nuanced behavior. I doubt that AI will ever fully replace either of those. But it will certainly get closer to helping us make choices once it integrates emotion analysis (scientists call that sentiment analysis). There is already a great body of research on how words convey emotions. Natural language processing or semantic analysis has long been an effective tool in assessing the emotional value that the construction of sentences and word choice creates in cultures.

Do a web search on ‘emotion analysis’ and you’ll find a large number of articles and images which analyze simple and complex types of emotions. We spend a lot of time in our editing process trying to elicit specific emotional responses in our audiences, don’t we? We want them to feel sad at the right moments or laugh when we want them to. Audience research testing (preview cards, turning a dial between ‘like’ and ‘dislike’) has long proved what a blunt and ineffective set of tools we have. Emotional analysis through AI has begun to make this easier.

Using simple, cheap webcams, we can record individual audience responses to a scene, a commercial or anything visual. There are many technologies that can, in real time, translate these faces and figure out the character’s emotions (often with better accuracy than a human can). The image from Microsoft Azure’s AI technology at shows how it translates a still image into eight different emotions (in the linked image, the toddler’s main emotions are ‘contempt,’ followed by ‘disgust,’ which will surely be a familiar one to parents of young children). Companies such as Affectiva have made a business of interpreting individual viewer’s faces into their emotional response, so filmmakers can see if they are really affecting their audience in the way they had hoped. You can try out a demo or see results at their website,

There is, of course, a downside to this and it is clear that the age-old struggle between art and commerce will continue to deepen. But I, for one, welcome what the new tools can bring us. Are they smiling at the right time? Are they surprised when I want them to be? This will give me better answers to those questions than looking around in the dark at a preview screening. There is one more area where AI is changing our world and it will only get more pervasive – the arena of ‘deliverables’ or ‘localization.’ In a world where our projects are released simultaneously around the world, it is more important to speed up the work between our picture ‘lock’ and when the many versions of that project
get released. Translations, subtitling, foreign dubbing, control of multiple release format workflows, etc. are all getting increasingly complicated. Studios see AI as an effective way to streamline some of this workflow.

Natural language processing is getting sophisticated enough to recognize idioms in one language and provide analogous phrases in foreign languages. This will provide better first drafts for subtitlers and international deliveries. On the technical side, a German language version of a Netflix show needs different deliverables than a theatrical Japanese language version. AI can learn these things – even as re-edits happen – and help create efficient workflows to deliver them on time. And then there is what I call ‘hyper-localization.’

It is well known that streaming services like Netflix and Amazon collect massive amounts of data. My Amazon Prime videos can already recognize what actor is in a scene and present me with a clickable link to see more about them. Netflix knows who is watching any given program, who is binge watching it, what episode they stopped watching, and even the frame where they bailed out. It is one small step to matching that data up with what is actually happening in the film right before they stop watching. Can we create different versions of our works that could be automatically changed to fit the differing tastes of my neighbors and I? If they like action more than I do, or less blood, can Netflix stream different versions to them than to me? The technology for this is already been created.

The research data for this is informed by AI and it will change how we create our deliverable elements. Many of the young filmmakers who I teach at USC’s School of Cinematic Arts, are not offended by creating multiple experiences within one story. Those who came up as gamers, or who are delving deeply into immersive cinema, have far fewer problems with allowing audiences to select some of the pathways through the stories (we say that they have more ‘agency’) than the filmmakers I grew up with. The ability to think of multiple choices in storylines may become very important in our futures. And it will affect how we, as editors, tell the stories we are helping to create.

The exciting thing is that there will be so many different forms of storytelling media and they will all need editors as storytellers. Those of us who embrace this change will be very busy working in our medium’s future. Or we will become ex-editors.