You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There seems to be a new 1.7B-parameter Diffusion-based model by ModelScope allowing text2video synthesis as noted by AKHaliq https://twitter.com/_akhaliq/status/1637321077553606657?s=20. Both the model implementation and weights (downloaded with their pipeline) are in open access and it's already possible to launch it via HuggingFace's spaces. However, the model lacks a lot of possible optimizations, especially concerning LowVRAM mode, and accessibility options, and I believe it would benefit greatly from the help of Diffusers community.
Example: monkey playing on drums
tmp2tkrr492.mp4
At this time the model should be fitting around 16 gbs of VRAM, but since it's a combination of 4 gb, 6 gb, and 5 gb models, I believe with half precision and sequential pipeline it will be eventually possible to launch it on modern consumer hardware.
The license is Apache-2.0 license, so there will be no problems with using the code as the reference.
Open source status
The model implementation is available
The model weights are available (Only relevant if addition is not a scheduler).
Model/Pipeline/Scheduler description
Hello!
There seems to be a new 1.7B-parameter Diffusion-based model by ModelScope allowing text2video synthesis as noted by AKHaliq https://twitter.com/_akhaliq/status/1637321077553606657?s=20. Both the model implementation and weights (downloaded with their pipeline) are in open access and it's already possible to launch it via HuggingFace's spaces. However, the model lacks a lot of possible optimizations, especially concerning LowVRAM mode, and accessibility options, and I believe it would benefit greatly from the help of Diffusers community.
Example: monkey playing on drums
tmp2tkrr492.mp4
At this time the model should be fitting around 16 gbs of VRAM, but since it's a combination of 4 gb, 6 gb, and 5 gb models, I believe with half precision and sequential pipeline it will be eventually possible to launch it on modern consumer hardware.
The license is Apache-2.0 license, so there will be no problems with using the code as the reference.
Open source status
Provide useful links for the implementation
HuggingFace space:
https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis
All the parts of the model at HuggingFace:
https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis/tree/main
The model PyTorch implementation:
https://github.com/modelscope/modelscope/tree/master/modelscope/models/multi_modal/video_synthesis
Google Colab from the devs:
https://colab.research.google.com/drive/1uW1ZqswkQ9Z9bp5Nbo5z59cAn7I0hE6R?usp=sharing
License: Apache-2.0 license