Show HN: LLaVaVision: An AI "Be My Eyes"-like web app with a llama.cpp backend

devinprater · on Nov 6, 2023

A "Be My Eyes web app"? Well whatever. I guess Be My Eyes is just gonna be known for its Be My AI anyway. As soon as Llava is able to read text, this is gonna be pretty nice. I just wish it wasn't basically using Be My Eyes as a name to jump off of. This kind of local tech will be pretty amazing once it's fast enough to help with video games, or inaccessible apps. Add the ability for the AI to move the mouse in some of the GUI controller AI packages, and I could just say "click the next button and tell me what's on the screen afterwards" kind of thing.

enoch2090 · on Nov 6, 2023

This is cool! One thing that can be easily improved is that the narrator is still trying to describe everything, while people using Be My Eyes they usually got specific purposes (or "attention"). Would be nice to also implement whisper.cpp and use query inputs with some CoT or prompting to refine the description.

lxe · on Nov 6, 2023

This is an awesome idea! Yeah refining and asking more questions about the image could make it viable. The BakLlava model is pretty impressive, but I do need to tune the prompts and the hyperparams as well. Or even do a small finetune. It’s a fun space to dive into.

umtksa · on Nov 6, 2023

awesome, love the idea and maybe putting a default prompt like:

"this images comes from the persons POV that ask you questions"

instead of "a person holding something" "you are holding xyz" wold be better

yousnail · on Nov 6, 2023

This called my garage ‘fairly run-down’. Needs tuning.

darylteo · on Nov 6, 2023

The model or the garage?

lxe · on Nov 6, 2023

I noticed that it "hallucinates" in the most direct sense of the word as the description goes on.

ggerganov · on Nov 6, 2023

I've found lowering the temperature and disabling the repetition penalty can help [0]. My explanation is that the repetition penalty penalizes the end of sentences and sort of forces the generation to go on instead of stopping.

[0] https://old.reddit.com/r/LocalLLaMA/comments/17e855d/llamacp...

DANmode · on Nov 6, 2023

Human perception of how weathered or safe their surroundings are seems to vary extraordinarily.

Posting a (cropped?) picture would make this comment way more interesting.

tsunamifury · on Nov 6, 2023

Or…

ReD_CoDE · on Nov 7, 2023

Is there any solution that just identifies the object as accurate as possible? For instance when it sees a t-shirt, can be identify the exact brand and model?

lxe · on Nov 7, 2023

Yes, GPT Vision / ChatGPT is very accurate. CogVLM is also very powerful as a self-hosted solution.

FrenchDevRemote · on Nov 6, 2023

Do you know how the model accuracy compares to GPT-4 vision?

lxe · on Nov 6, 2023

Let me see if I can find existing benchmark results. My gut feeling is that GPT4 is better overall... But BakLLaVA model is just incredibly small for how powerful it is.

thih9 · on Nov 6, 2023

Impressive!

Is the chatgpt transcript from the free chatgpt version (3.5) or the paid one (v4)?

leodriesch · on Nov 6, 2023

The purple icon in the shared transcript indicates GPT-4.

birdyrooster · on Nov 6, 2023

Watching the chat gpt workflow is cool because it captures your intent along with the code change without need for a commit comment

lxe · on Nov 6, 2023

It’s been nothing short of “Jarvis”. You do have to ask things correctly to get correct answers, but it accelerates pretty much everything I do on a computer by a factor of 100.

anon23432343 · on Nov 6, 2023

Wait an actual cool peace tech on the HN front page?

What happened here 0_0

Cool peace of tech!! keep on doing great work.