Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: LLaVaVision: An AI "Be My Eyes"-like web app with a llama.cpp backend (github.com/lxe)
154 points by lxe on Nov 6, 2023 | hide | past | favorite | 19 comments
A simple mobile web app inspired by Fuzzy-Search/realtime-bakllava that uses llama.cpp server backend with multimodal mode to describe and narrate what the phone camera sees.

I built this thing in a few hours using a single ChatGPT thread to generate most things for me and iterate on this project. Here's the workflow: https://chat.openai.com/share/ea84ec69-5617-45e8-8772-ac2dcf...




A "Be My Eyes web app"? Well whatever. I guess Be My Eyes is just gonna be known for its Be My AI anyway. As soon as Llava is able to read text, this is gonna be pretty nice. I just wish it wasn't basically using Be My Eyes as a name to jump off of. This kind of local tech will be pretty amazing once it's fast enough to help with video games, or inaccessible apps. Add the ability for the AI to move the mouse in some of the GUI controller AI packages, and I could just say "click the next button and tell me what's on the screen afterwards" kind of thing.


This is cool! One thing that can be easily improved is that the narrator is still trying to describe everything, while people using Be My Eyes they usually got specific purposes (or "attention"). Would be nice to also implement whisper.cpp and use query inputs with some CoT or prompting to refine the description.


This is an awesome idea! Yeah refining and asking more questions about the image could make it viable. The BakLlava model is pretty impressive, but I do need to tune the prompts and the hyperparams as well. Or even do a small finetune. It’s a fun space to dive into.


awesome, love the idea and maybe putting a default prompt like:

"this images comes from the persons POV that ask you questions"

instead of "a person holding something" "you are holding xyz" wold be better


This called my garage ‘fairly run-down’. Needs tuning.


The model or the garage?


I noticed that it "hallucinates" in the most direct sense of the word as the description goes on.


I've found lowering the temperature and disabling the repetition penalty can help [0]. My explanation is that the repetition penalty penalizes the end of sentences and sort of forces the generation to go on instead of stopping.

[0] https://old.reddit.com/r/LocalLLaMA/comments/17e855d/llamacp...


Human perception of how weathered or safe their surroundings are seems to vary extraordinarily.

Posting a (cropped?) picture would make this comment way more interesting.


Or…


Is there any solution that just identifies the object as accurate as possible? For instance when it sees a t-shirt, can be identify the exact brand and model?


Yes, GPT Vision / ChatGPT is very accurate. CogVLM is also very powerful as a self-hosted solution.


Do you know how the model accuracy compares to GPT-4 vision?


Let me see if I can find existing benchmark results. My gut feeling is that GPT4 is better overall... But BakLLaVA model is just incredibly small for how powerful it is.


Impressive!

Is the chatgpt transcript from the free chatgpt version (3.5) or the paid one (v4)?


The purple icon in the shared transcript indicates GPT-4.


Watching the chat gpt workflow is cool because it captures your intent along with the code change without need for a commit comment


It’s been nothing short of “Jarvis”. You do have to ask things correctly to get correct answers, but it accelerates pretty much everything I do on a computer by a factor of 100.


Wait an actual cool peace tech on the HN front page?

What happened here 0_0

Cool peace of tech!! keep on doing great work.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: