VisGuide - A Generative AI Project for Visual Assistance

VisGuide - Context & Part 1

A while ago we created a project to explore the capabilities of AI and Vision to see if we could create a cost effective solution, with commodity hardware, that would have a meaningful impact and improve someones quality of life using AI.

We created a solution called VisGuide that provided the user with a verbal description of their surroundings and helped to both guide and keep them safe, empowering their independence and participate in experiences that they had previously been excluded from.

As you will see below, we managed to create a pretty cool solution and provided the code as an Open Source project. With the recent release of Open AI's Real-Time Voice API we thought it was time that we revisited this project and refreshed it based on the new capabilities. Part 1 (original VisGuide) is below and we are actively working on Part 2, watch this space......

Motivation & Inspiration

Project Scope

The aim of VisGuide is straightforward: to provide visual assistance to the partially sighted or blind using commodity hardware and GenAI. This initiative explores the potential of Generative AI to serve a practical, impactful purpose, reflecting a broader trend towards technology solutions that enhance quality of life for individuals with disabilities.

VisGuide Foundations

VisGuide’s hardware setup is simple, utilising a Raspberry Pi Zero with a camera and a power bank for mobility. Connectivity is achieved through a mobile phone hotspot, enabling the device to operate in diverse environments. The choice of hardware underscores the project’s commitment to accessibility and affordability.

The software, written in Python, is the core of VisGuide. It incorporates:

Generative AI: Utilising OpenAI’s vision models, VisGuide interprets visual data to provide verbal descriptions of the user’s surroundings.
ElevenLabs: The platform is used for voice synthesis, ensuring the verbal feedback is both clear, natural and that the text to speech process is as fast as possible.
ChatGPT & Github CoPilot: The development process leveraged a combination of ChatGPT and Github CoPilot for code development. I found that both worked well but mainly used ChatGPT as the user experience was simple and when it didn’t do what I needed, I just changed my prompt to be more specific.

Open Source Code

You can run it on your Mac to test it and it runs on a Raspberry Pi Zero.

It is not yet optimised for speed and runs a little slow on the Raspberry Pi but works much quicker on the Mac.

The VisGuide project is hosted on GitHub and Open Source so you can download it and get it working locally to experiment with the combination of AI and Vision:

https://github.com/hirsts/visguide

Heres an example of the “guide” narration:

00:00

narration_example
00:00

Modes & Use's

VisGuide can be run in either continuous or on demand mode. On demand mode is where the user presses a button to capture the scene and have it narrated and continuous mode is where it runs in a loop every 5 seconds.

VisGuide has two primary use’s, the first is as a visual assistant or “guide” and the second is “tourist” mode where it describes the scene with colourful language to create a mental picture for the user.

Guide Mode
This is intended to provide a brief and succinct description of the scene to help the user better navigate and avoid obstacles and risk.

Tourist Mode
This is where the user can participate in “sight seeing” and have a colourful and artistic narration of the general scene. Why shouldn’t partially sighted and or blind people be able to participate in “sight seeing”. This, combined with the users other senses, helps users to appreciate what their sighted peers can experience.

Impact & Potential

VisGuide is a working example of digital tools and services helping with a physical problem and explores how it can make a real difference. It explores offering increased independence for visually impaired users and showcases the potential of Generative AI to address and solve real-world challenges.

Whats Next?

Interactive Uses
It would be ideal to enable the user to “query” the scene. For example the user might ask “Where is the road crossing” or “Please guide me to the door”. This is where user input updates or augments the prompt to influence the narration response.
Integrated Hardware
Wearing a Raspberry Pi around your neck with a battery pack is hardly ideal but great for development. Ideally, VisGuide would work with something like Meta’s Smart Glasses which have the camera and audio output built in and would provide a unobtrusive user experience.
Local Processing
VisGuide currently requires internet access to leverage the SaaS LLM services. Mobile phones are incredibly powerful compute devices these days and so it would be ideal to have all processing performed locally and remove the dependency on connectivity and the internet however this would significantly increase the complexity.

Conclusion

VisGuide’s development journey illustrates the power of combining simple tools with cutting-edge AI to create solutions that have a real-world impact.

As the project moves forward, it’s intended to be thought provoking and demonstrate the profound capabilities of technology when applied with purpose and vision.

This project is more than just a technological achievement; it’s a pathway to greater accessibility and independence for those it aims to serve.

Part 2 is in development and we hope to share it soon.