You can use pictures and voice commands

admin

2 years ago

The Verge: https://www.theverge.com/2023/9/25/23886699/chatgpt-pictures-voice-commands-ai-chatbot-openai

The Chat gtt App: Synthetic Voices for Speech-to-Visual Conversations (with links between OpenAI and VoiceBots)

The new version of the ChatGPT app has a headphones icon in the upper right and photo and camera icons in an expanding menu in the lower left. These voice and visual features work by converting the input information to text, using image or speech recognition, so the chatbot can generate a response. The app then responds via either voice or text, depending on what mode the user is in. The new chat gtt app can read and respond to text messages if you ask it to, despite the fact that your voice query is actually being processed as text. It will respond in one of five voices, wholesomely named Juniper, Ember, Sky, Cove, or Breeze.

OpenAI’s excellent Whisper model does a lot of the speech-to-text work, and the company is rolling out a new text-to-speech model it says can generate “human-like audio from just text and a few seconds of sample speech.” You can choose from five options, but OpenAI thinks the model has more potential than that. OpenAI is working with various audio services to translate into other languages the sound of their voice. There are lots of interesting uses for synthetic voices, and OpenAI could be a big part of that industry.

The fact that you can use just a few seconds of audio to build a capable synthetic voice opens the door for a variety of problematic use cases. “These capabilities also present new risks, such as the potential for malicious actors to impersonate public figures or commit fraud,” the company says in a blog post announcing the new features. The model isn’t available for broad use, but it will be more restricted to certain use cases and partnerships.

ChatGPT’s Back-and-Forth: Exploiting Language and Natural Language in the Search for a Multimodal Artificial Intelligence

The image search is similar to Goggles. You take a picture of something that interest you, and you will get a response on what you are asking for. You can also use the app’s drawing tool to help make your query clear, or speak or type questions to go along with the image. This is where ChatGPT’s back-and-forth nature is helpful: rather than doing a search, getting the wrong answer, and then doing another search, you can prompt the bot and refine the answer as you go. (This is a lot like what Google is doing with multimodal search, too.)

The models that power the OpenAI chatbot were created using a large amount of text from various sources on the web. Many experts believe that animal and human intelligence uses audio and visual data similar to how artificial intelligence uses text.

Google’s next major AI model, Gemini, is widely rumored to be “multimodal,” meaning it will be able to handle more than just text, perhaps allowing video, images, and voice inputs. “From a model performance standpoint, intuitively we would expect multimodal models to outperform models trained on a single modality,” says Trevor Darrell, a professor at UC Berkeley and a cofounder of Prompt AI, a startup working on combining natural language with image generation and manipulation. If we build a model using just language, it will only be able to speak the language of its creators.