VisGuide - Context & Part 1
A while ago we created a project to explore the capabilities of AI and Vision to see if we could create a cost effective solution, with commodity hardware, that would have a meaningful impact and improve someones quality of life using AI.
We created a solution called VisGuide that provided the user with a verbal description of their surroundings and helped to both guide and keep them safe, empowering their independence and participate in experiences that they had previously been excluded from.
As you will see below, we managed to create a pretty cool solution and provided the code as an Open Source project. With the recent release of Open AI's Real-Time Voice API we thought it was time that we revisited this project and refreshed it based on the new capabilities. Part 1 (original VisGuide) is below and we are actively working on Part 2, watch this space......
Motivation & Inspiration

Project Scope
The aim of VisGuide is straightforward: to provide visual assistance to the partially sighted or blind using commodity hardware and GenAI. This initiative explores the potential of Generative AI to serve a practical, impactful purpose, reflecting a broader trend towards technology solutions that enhance quality of life for individuals with disabilities.
VisGuide Foundations
- Generative AI: Utilising OpenAI’s vision models, VisGuide interprets visual data to provide verbal descriptions of the user’s surroundings.
- ElevenLabs: The platform is used for voice synthesis, ensuring the verbal feedback is both clear, natural and that the text to speech process is as fast as possible.
- ChatGPT & Github CoPilot: The development process leveraged a combination of ChatGPT and Github CoPilot for code development. I found that both worked well but mainly used ChatGPT as the user experience was simple and when it didn’t do what I needed, I just changed my prompt to be more specific.

Open Source Code
Heres an example of the “guide” narration:
- narration_example00:00
Modes & Use's
- Guide Mode
This is intended to provide a brief and succinct description of the scene to help the user better navigate and avoid obstacles and risk.
- Tourist Mode
This is where the user can participate in “sight seeing” and have a colourful and artistic narration of the general scene. Why shouldn’t partially sighted and or blind people be able to participate in “sight seeing”. This, combined with the users other senses, helps users to appreciate what their sighted peers can experience.
Impact & Potential
VisGuide is a working example of digital tools and services helping with a physical problem and explores how it can make a real difference. It explores offering increased independence for visually impaired users and showcases the potential of Generative AI to address and solve real-world challenges.
Whats Next?
- Interactive Uses
It would be ideal to enable the user to “query” the scene. For example the user might ask “Where is the road crossing” or “Please guide me to the door”. This is where user input updates or augments the prompt to influence the narration response. - Integrated Hardware
Wearing a Raspberry Pi around your neck with a battery pack is hardly ideal but great for development. Ideally, VisGuide would work with something like Meta’s Smart Glasses which have the camera and audio output built in and would provide a unobtrusive user experience. - Local Processing
VisGuide currently requires internet access to leverage the SaaS LLM services. Mobile phones are incredibly powerful compute devices these days and so it would be ideal to have all processing performed locally and remove the dependency on connectivity and the internet however this would significantly increase the complexity.
Conclusion
