From the “this might not be a good idea” department comes the announcement by Microsoft of VASA-1. Here’s the TL:DR on this:
We introduce VASA, a framework for generating lifelike talking faces of virtual charactors with appealing visual affective skills (VAS), given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512×512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.
I’ll get to why I am lukewarm at best with this. But first, let’s see what Kevin Surace, Chair, Token has to say on this:
Before Microsoft there have already been several other demonstrations of animating single face images and cloning voices. So we have been able to experience this for many months. Microsoft’s entry here is excellent and state of the art across all models I have seen. The implications for personalizing emails and other business mass communication is fabulous. Even animating older pictures as well. To some extent this is just fun and to another it has solid business applications we will all use in the coming months and years.
Of course one can replace a live webcam with a virtual version of yourself especially when you have a bad hair day. But of course the images we see today are already a digital reproduced image of you. Meaning the webcam is gathering pixels processing them compressing them sending them across the country and recomposing it on someone’s screen. This is arguably the next extension of that by manipulating the pixels in real-time so that you can truly look your best. And its still your voice and your words.
All synthetic media is democratizing what Hollywood could do with CGI for many years. All of this will lead to low cost content creation at a scale we have never seen. And that’s great for creators…even if overloading for the viewers.
Of course we continue down a road of being able to produce more convincing deep fakes at many levels. Arguably that train left the station when Photoshop was introduced. This continues to take us closer to perfect video and audio representations of ourselves with and without our permission. Of course the major models will include a watermark stating this is AI generated. But in time open source models will emerge which don’t.
We have been photoshopping ourselves for decades. Improving our looks and erasing blemishes. Is that ethical? Where does it become unethical? We all want to be and look our best. And multiply ourselves. When used properly by us, this tech does that amazingly well.
CS and entertainment are obvious. As is marketing and mass communications. Its basically a digital twin of ourselves or perhaps of our relative or a coworker (all with permission). How about birthday cards fully customized for you from a celebrity? Or when you are sick sending a video of you looking your best? Its all becoming possible and will be right in our pockets in the coming year.
Here’s my $0.02 worth. I can see scenarios where the following can happen:
- This could allow people to fake video chats
- This could make real people appear to say things they never actually said
- This could allow harassment from a single social media photo
I think that Microsoft needs to demonstrate and speak to how they will gatekeep this so that it’s used with the best of intentions rather than the worst of intentions. That would take me from being lukewarm to something more positive.
Microsoft Introduces VASA-1…. Which Might Not Be The Best Thing For Us Humans Just Yet
Posted in Commentary with tags Microsoft on April 19, 2024 by itnerdFrom the “this might not be a good idea” department comes the announcement by Microsoft of VASA-1. Here’s the TL:DR on this:
We introduce VASA, a framework for generating lifelike talking faces of virtual charactors with appealing visual affective skills (VAS), given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512×512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.
I’ll get to why I am lukewarm at best with this. But first, let’s see what Kevin Surace, Chair, Token has to say on this:
Before Microsoft there have already been several other demonstrations of animating single face images and cloning voices. So we have been able to experience this for many months. Microsoft’s entry here is excellent and state of the art across all models I have seen. The implications for personalizing emails and other business mass communication is fabulous. Even animating older pictures as well. To some extent this is just fun and to another it has solid business applications we will all use in the coming months and years.
Of course one can replace a live webcam with a virtual version of yourself especially when you have a bad hair day. But of course the images we see today are already a digital reproduced image of you. Meaning the webcam is gathering pixels processing them compressing them sending them across the country and recomposing it on someone’s screen. This is arguably the next extension of that by manipulating the pixels in real-time so that you can truly look your best. And its still your voice and your words.
All synthetic media is democratizing what Hollywood could do with CGI for many years. All of this will lead to low cost content creation at a scale we have never seen. And that’s great for creators…even if overloading for the viewers.
Of course we continue down a road of being able to produce more convincing deep fakes at many levels. Arguably that train left the station when Photoshop was introduced. This continues to take us closer to perfect video and audio representations of ourselves with and without our permission. Of course the major models will include a watermark stating this is AI generated. But in time open source models will emerge which don’t.
We have been photoshopping ourselves for decades. Improving our looks and erasing blemishes. Is that ethical? Where does it become unethical? We all want to be and look our best. And multiply ourselves. When used properly by us, this tech does that amazingly well.
CS and entertainment are obvious. As is marketing and mass communications. Its basically a digital twin of ourselves or perhaps of our relative or a coworker (all with permission). How about birthday cards fully customized for you from a celebrity? Or when you are sick sending a video of you looking your best? Its all becoming possible and will be right in our pockets in the coming year.
Here’s my $0.02 worth. I can see scenarios where the following can happen:
I think that Microsoft needs to demonstrate and speak to how they will gatekeep this so that it’s used with the best of intentions rather than the worst of intentions. That would take me from being lukewarm to something more positive.
Leave a comment »