OpenAI cofounder Ilya Sutskever – “The world isn’t just text. Humans don’t just talk: we also see. A lot of important context comes from looking.”

Open Ai has just released two new deep learning models intertwined, combining both language and images together in a way that makes AI better at understanding both words and what they refer to. Very similar to the way the team at Open AI was able to train a single Deep Learning model (gpt-3) to understand and output language, stories, replies, jokes, predictions etc. ; by just throwing a vast number of text parameters for it to process, the goal now is to increase the deep learnings models understanding of what it generates by combining text with image to see what words and sentences actually mean. 

This new combination of Neural Networks involving Text Generated outcomes , and Image GPT has produced some pretty amazing results. When reading over the newly released blog published by Open Ai, you will see that the research crew has left some interactive options available to tinker with a set of different descriptive commands so you can see the different outputs the network can produce. Some requests work better than other but overall it is truly amazing how this Deep Learning network is able to work through such a large variety of topics and commands correctly outputting the requested resulted asked of it. 

[ For each text input, only the top 32 samples out of 512 are presented. In other words, DALL-E generated plenty of other images of avocado penguins and doughnut chairs that were less impressive ( This reranking being done was being processed by the Neural Network dubbed ‘CLIP’ … It is easy to get an idea from this of how limitless this technology can be when being used for a specific purpose or topic. ]

The Network is able to generate many different visual concepts ; Color , Size , Setting (Environment) , Texture , Animals , and pretty much any other basic description you can think of. 

It has been reported that this new type of integration between different transformer models (in this case Language and Image) is the dawn of a new AI paradigm known as “Multimodal AI” [ Multimodal AI : Systems that are capable of interpreting, synthesizing and translating between multiple informational modalities ]. It is easy to understand why this new method of combining multiple Artificial Intelligent Pre.trained models is so important when experiencing first hand from the blog (published by OpenA) the ways this model is able to almost exactly render a image from any descriptive text inputed. 

[ “ Most AI systems in existence today deal with only one type of data. NLP models (e.g., GPT-3) handle only text; computer vision models (e.g., facial recognition systems) handle only images. This is a far less rich form of intelligence than what the human brain achieves effortlessly. “ ]

Humans receive information and integrate with each other and the world through not one but Five collective Senses all interpreted together in there own unique way. By combining Sight, Sound, Touch, Smell and Taste we are able to understand and determine what things are and mean to us. This new impressive example by Open Ai of how important Multimodal Network is for the future of AI, we can only imagine how well many different Networks will work paired together in the future months and years ahead. As AI learns to incorporate multiple Neural Networks together in increasingly sophisticated ways, systems will eventually be built that can seamlessly work across Audio, Video, Speech, Images, Written Text (NLP), and beyond.

As I’m sure there are many limitations not mentioned in this paper, I can only imagine how fast these models will grow as more computer scientists spend time integrating even more Parameters and Sensor Networks into New Multimodal systems.

I think it is important to continue following this type of work published by Open – ai , and to do research into he ‘CLIP’ network to get an understand of how and why the Network ranks the outcomes it produces . 

[ LNK ]