VEED.IO our online video editor is completely self-funded and our team is less than 10 engineers in total. Yet, despite our limitations, over the past 2 years, we have managed to build something we previously doubted was even possible... A cloud based video editor.
We have hit many roadblocks along the way and had to rewrite our codebase entirely multiple times until we found the solution that would eventually work. And so today we would like to share with you the story of what it was like building VEED and give you a rough idea of what tech it took to make an online video editor possible.
A bit of history
After spending thousands of hours editing videos with complicated and clunky programs, we were, frankly, tired and frustrated with how most of the existing editing tools on the market were so unnecessarily complex to do simple things. So 2.5 years ago, we started working on building a simple online video editor.
Most heavy duty video editing tools were designed for making Hollywood-grade films and not social media content. Therefore we though there was a need for a simple yet powerful online video editor. A tool that anyone could learn to use in less than a few minutes. We couldn't find any tools that could help us do that, so foolishly we decided it’s time to build it ourselves.
You must have heard us say that we have made every possible mistake in the book there is, and we really did. These mistakes have cost us hundreds, maybe thousands, of hours of wasted time after realising certain approaches weren't good enough and starting over and over again. Thats why we decided to write this post so others can learn from our mistakes.
First steps, So how did we do it?
Before we began working on our new idea, we needed to make sure we understood the core fundamentals of video creation itself and the techniques that make it possible. So that’s something we want to share with you now.
So let’s start with the basics. What is video? Real simple - A video is a collection of images (known as ‘frames’) that are played back to you in order and at a certain speed (known as ‘frame rate’) to create an illusion of a moving image.
A good way to test this yourself is to take any video and slow it down dramatically. Once the frame rate drops low enough, we won’t necessarily see a video but instead it’d look closer to a jumpy slideshow.
Now let’s talk a bit about various features for video editing:
One of the most popular features amongst our users is ‘Trim’. So if you would like to remove a specific section of a video, all we need to do is to remove the frames that you don’t want and stick the video back together. Voila, your video is trimmed now. Conversely, you can stitch two different videos together by combining their frames, thus making your video longer.
Now let’s talk about adding content inside of the video. Let’s say you would like to add an image, text or subtitles to it. Since every video is merely a collection of frames, all we would need to do is to layer our object, be it image or text, on top of the frames to produce (aka ‘render’) the new set of frames.
And that’s basically it, these are the basics of video editing. But is it really that simple? Not quite, videos and video editing nowadays have come quite far and the process is a little bit more complex than what we have described. In fact, it is a lot more complex that you might think! There is a lot of multidimensionality by which videos can differ from one another. For example, videos can have different widths and heights (aka ‘aspect ratios’), video color representation can vary drastically from one video to another, videos won’t necessarily work everywhere because of incompatible codecs or formats, and much much more.
And that is exactly why working with video editing code can often feel close to something like this..
..Something that works for one video or device might not necessarily work for many others, so it was important that we understood how we could cater to all of the edge cases.
Both Sabba and I have and still genuinely believe that opening up a simple app in your browser and editing videos this way is a much easier and user-friendlier experience as opposed to downloading a 10gb+ video editing software that you have then spend hours learning. Luckily we are web developers by trade and so we started experimenting with our ideas right away.
Once we have had a good idea of video editing fundamentals we set about our task. Now we had two big challenges ahead of us. First, we needed to create a backend that could create high quality custom videos. And second, we needed a frontend GUI layer that would accurately emulate the final result that comes out the other end. So to make things simpler we decided to focus on the backend first. Big question we wanted to answer first was “Is it actually possible to create high quality videos programmatically using open source technologies?”
Having read a few startup articles, we were wary of overengineering our tech stack before knowing if anyone would even use it. So we decided to build our first MVP using Processing, the simplest possible solution we could think of.
Processing Framework, Video Editor:
Sabba was the first one to suggest we use this framework as he was already very familiar with it after working as a product & interaction designer. And that’s because Processing framework is an open-source, Java-based creative coding toolkit and is extremely easy to get started with. It looks something like this…
Code goes on the left and the visual output is in a separate window. The point of processing is to help more people get into coding through doing visual arts. Another great thing about processing is that you can run it headlessly using a script which was ideal for us. We were immediately drawn to Processing due to its vast toolkit right out of the box and it’s incredible simplicity. It even had a record() function so we could record our animations without building our own custom encoding kit ourselves. We were able to create a super simplistic video editor in a matter of minutes.
Unfortunately though, we really struggled to push this technology far. Something that was immediately obvious to us about processing is how unbelievably slow it was. A few seconds of video would take nearly 5 minutes or more to complete the task.
Here is an example of one of our early prototypes. It’ll give you an idea of what working with processing could look like
Aside from this, even though we had a way to embed videos into processing, we couldn’t do it while preserving the audio. So we’d have to create another tool that would get us the audio as well. But the biggest issue we had was that we really couldn’t integrate seamlessly with processing and our GUI. So we put our pink shades to the side and thought it’d be best if we explored technologies that allowed us more freedom to do what we needed.
Building a video editor with Phantom.js
Having failed miserably trying to get the Processing framework to work for us we came up with another hacky idea. Since our biggest problem at this point was to build a system that would seamlessly work with the frontend, we thought “Why not just run our frontend on the backend?”.
Sounds confusing? It really isn’t, in fact, the idea behind this is incredibly simple. All we needed to do is build a GUI preview in the browser and then run that same browser preview headlessly on our server and record it. What we wanted to achieve is to have users upload their video and once it is displayed in the editor allow them to add extra layers to it with CSS and HTML. For instance, we could use absolute positioning to place text elements on top of the video. We’d replay the browser preview on the backend and record it to get the final result.
The immediate benefit of having Phantom.js is that we didn’t have to write 2 systems at once (one for frontend and another for backend), thus avoiding double work. And it worked brilliantly at first, we were able to build plenty of features with Phantom’s incredibly simple API, such as adding images or text, in a short period of time. To give you an idea of what is possible with Phantom here is a snippet of code that loads Google homepage and captures it as an image:
While the API itself was incredibly simple to work with, it was only a matter of time until we have started to hit Phantom.js’s limits as well.
The biggest problem that we found with Phantom is that it is unable to give developers wanting to record videos a finer grained control over the frames of the video. Specifically, Phantom.js is only able to record browser contents with 1 second intervals giving us at best, a decent gif.
Another glaring issue was that since we were recording the frontend, we’d have to wait until Phantom has finished replaying an entire video on the backend, which isn’t ideal if the video is long. Naively, we thought that these issues could easily be fixed this later, however we clearly have underestimated the effort it would take. Initially we were inspired by Giphy’s gif maker and were ok with a limitation but after speaking to lots of potential users we realised that this was simply not good enough.
Adobe After Effects Render
As we were scratching our heads and started looking for another solution that could work, we have noticed that there were already quite a lot of video editing websites that let you create high quality videos easily, much better than what, frankly, we can even achieve with VEED today. The ONLY issue is that these videos would be part of a rigid template that you can’t modify or change. In addition, the cost of running a this is just not scalable
The way those websites operate is they use Adobe After Effects or Adobe Premiere Pro SDKs behind the scenes with high quality predefined templates that are missing a few things from the user that they can fill in. Namely they’d take your video and inject it into the template and sometimes allow you to modify text in certain areas, but you won’t have any control over styling and would have pretty much 0 creative input. You’d either love the template or move on.
While the idea of having high quality video was intriguing to us, Adobe’s SDKs really left much more to be desired and we really wanted to build something that would exceed content creators expectations. After all we were content creators ourselves (albeit terrible, but content creators nonetheless) and were, first and foremost, building VEED for ourselves and what we wanted is an editor that could give you freedom to express yourself in whatever way you wanted. Quality is something that comes with time so we are not worried about that either.
FFmpeg & C++ Video Editor
If you are a developer and have been even remotely interested in the video editing space or even better tried building some video related tools chances are you have most likely used or at least heard of FFmpeg command line tools. FFmpeg is a complete, cross-platform solution to record, convert and stream audio and video. Without it most, if not all, of the browser based video editing tools wouldn’t exist today. FFmpeg is truly a godsend piece of software, a swiss army knife that helped create a multibillion dollar industry on the back of its incredibly versatile API.
We are about to tell you how we have utilised FFmpeg and its toolkit to create most of the video editing features you see today on VEED, but we just don’t feel right doing so without giving proper credit to the makers. Fabrice Bellard and Michael Niedermayer, if you happen to be reading this, thanks for everything you have done, you really deserve all of the credit.
So, as we were wrapping stuff up with Phantom.js and knew Adobe’s SDK wasn’t truly a good option we were craving for a perfect solution, something that could help us push the boundaries of web-based video editing and, frankly, we were tired of wasting our time rebuilding our tech.
At this point we had two options really. We could either try one more third-party video editing toolkit like moviepy or build our own custom rendering code. Moviepy, definitely, was an intriguing option but after so many failed attempts of trying to build the editor, we just weren’t sure if we could take a gamble once again. In hindsight, we are extremely happy that we didn’t make the choice of using moviepy as in the long run it would really just end up bottlenecking us.
Instead, we came to a pretty tough conclusion and that was - if we wanted to build a high quality video editing software that could challenge many of today's video editing giants it had to be done in such a way that gives us the most freedom to modify and expand on our code. That also meant we’d likely need to properly roll up our sleeves this time round and prepare ourselves for a long build cycle, every startup's worst nightmare. And that is exactly what we did - we chose to use C++ as the programming language to build our rendering logic. It really gave us the opportunity to dig into many of the lower level details of video editing specifics that previously were not available to us. To name a few, using C++ we now had an option to work directly with libavcodec and FFmpeg’s C library directly without the limitations of using it’s cli toolkit.
At this point we were really hoping that if we slow down now, we’ll be able to reap the fruits of our labour for many years to come.
One thing that definitely needs to be mentioned first is that initially we were still looking for ways to make our life a little bit easier, so instead of doing many of the encoding and editing things directly, we opted to use OpenCV to save us a little bit of time when we first started the transition.
OpenCV is one of the most, if not the most, popular open-source computer vision libraries out there. So, why would we even want to use a computer vision software for video editing if we had no use for computer vision at this point at all? The reason is pretty simple actually, it was the quickest way for us to start. Using OpenCV we were able to generate videos using as little code as possible.
OpenCV is still using FFmpeg and libavcodec in the background, which, in turn, also means that even though OpenCV is not designed with video editing in mind, it has a lot of features that come right out of the box with it that are perfect for video editing.
We have been using OpenCV for a good 4-5 months, right up to our trip to Ycombinator, before finally deciding to part ways with it. There are several reasons for that of course. As you can imagine editing videos in the cloud, and just generally, is quite a resource heavy task. We really have to keep track of our resource usage on the servers to make sure what we are actually doing is sustainable in the long run.
As our user base and server costs kept on increasing we figured the best solution to our growing pain is to ditch OpenCV in favor of working with FFmpeg and libavcodec directly. We have mentioned libavcodec a few times at this point and all it is, despite the confusing name, is an open-source library of codecs for encoding and decoding video and audio data. libavcodec is an integral part of many of today's open-source multimedia applications and frameworks.
The biggest challenge of not relying on OpenCV’s toolkit anymore is that it has increased the complexity of our code by a lot. But sometimes you gotta do what you gotta do friends - we really needed to be able to directly process all of the various formats of video that come out slightly different from libavcodec, so that’s what we ended up doing in the end.
And this is really why there aren’t many startups in our market space taking a similar approach to us, at least we have not heard of any. While there is a massive upside to having a complete control of your entire system, we gotta say, a year ago we weren’t sure if what we were doing is even possible. The initial setup for libavcodec needed a lot of boilerplate code and documentation is scarce (most of the time we just ended up reading the source code). Besides video, audio can be really complicated to handle as well (especially when visualizing it at the same time) and needs to be converted to fit the requirements for the AAC codec we use in our standard output format. Thankfully ffmpeg helps us here too with libavfilter, which can mix, cut and convert audio The tight control by AAC probably caused more than 50% of bugs and crashes while rendering videos over the last year, but by slowly tuning the libavfilter parameters for processing the audio we got to a point where almost any file can be rendered, sometimes even if it has some corrupted content, which happens surprisingly often.
Building out the GUI (OpenGL & WebGL)
Finally, all we had to do is build the graphical user interface for our video editor. Something that might not be immediately obvious is that this actually requires us to have 2 video renderers working at the same time. One that renders previews on the frontend and another that renders actual videos on the backend. The most difficult part here is to make these two renderers synergise as seamlessly as possible, i.e. what you see is what you get at the end.
The general idea behind the design is as follows:
1. User does all the necessary editing on the frontend.
2. If they added images or other assets we collect them in our storage.
3. Following users actions we also create a recipe of instructions for all the edits that user did. This is a set of instructions for the second renderer with references to original assets. (e.g. this recipe might contain x y coordinates of where to place text and images and much more)
4. Our C++ renderer creates a user's video based on this information.
5. Finally, a user gets the result on the frontend.
In reality, again, this is much harder to achieve than it might sound. Since these are two completely separate renderers, it would be super hard to ensure that what you see is what you actually get. So we cheated. Instead of trying to render the text on the backend we just upload the images of the text rendered on the frontend as separate assets. Another way we make it easier is to rely on opengl shaders for moving things and rendering effects. There is a small subset of opengl and webgl shaders that works exactly the same way and we basically build our renderers around that subset. Of course we've seen small discrepancies between the frontend and backend but very rarely. Even then it is often boiled down to differences in the decoding of the video rather than the renderer itself.
OpenGL ( aka Open Graphics Library) is a cross-language, cross-platform API for rendering 2D and 3D vector graphics. WebGL is the same thing but for the web applications running in browsers. We use these two technologies to render 2D graphics, or more specifically, frames of your videos. If you are not familiar with shaders, then shaders are things that can add that extra flavour you needed to your videos. Graphics is a science in itself and if you’d like to learn more about shaders and graphics, then Shader Toy might be the most creative and fun way to start.
Something that was particularly hard for us to do is to make heavy duty WebGL logic work with the way React works. In fact, our frontend renderer is by far the most complex piece of software today, even when compared to our C++ renderer. This is because of how many changing elements there are in the frontend, as well as us trying to make React do what it wasn’t originally designed to do.
We might cover more on how our frontend works in deeper detail in the future, as it is a whole another blog post in itself.
The ideal stack
While we think we have done a pretty good job so far at building a rather versatile and powerful video editing experience in the browser, we have also made a lot of shortcuts and the system as it is right now is far from what it could potentially become.
We are constantly making improvements to increase the performance of your renders, thus minimising the time you have to wait for your videos to be done as much as possible. Amongst some of the ways of how we do this are: Running our C++ renderers as individual nodes and scaling the number of these in tune with the minute-by-minute demand on our website; Decreasing load times on the site using smarter caching and CDN techniques; and of course talking to our users who help us along the way by telling us if anything needs fixing :)
We are far from being done here at VEED and there are million things we are constantly thinking about how we potentially could improve our code even more. Perhaps, instead of rendering text on the frontend we could use headless browsers again, or maybe we could make rendering even faster using GPU acceleration or by using a more sophisticated graphics engine like Vulkan. We don’t know yet.
What we do know though is that the future of video editing is in the cloud. After 2 years of hard work our incredibly talented team has managed to build a powerful video API that enabled millions of users to create and publish their video content with ease.
Something we were passionate about from the very start is sharing our findings and knowledge with the world. As part of this we are really excited to announce that from now on, if you are a developer, you can use our technology and architecture to create your own video editing tools without the pain of going through all of the hurdles that we have. If you are curious check it out our video editing api
If you made it this far, thanks for sticking with us through the little story of how we have built our technology and how we are trying to improve it in the years to come. We hope this gave you some visibility into how and why decisions are made and provided a little bit of entertainment.