More

zhwu · 2025-03-23T11:53:43 1742730823

Cloud services, such as autoscaling EKS or AWS Batch are mostly limited by the GPU availability in a single region. That limits the scalability of jobs that can run distributedly in a large scale.

AI batch inference is one of the examples, and this post found that by going beyond a single region, it is possible to speed up the important embedding generation workload by 9x, because of the available GPUs in the "forgotten" regions.

This can significantly increase the iteration speed for building applications, such as RAG, and AI search. We share our experience for launching a large amount of batch inference jobs across the globe with the OSS project SkyPilot.

TL;DR: it speeds up the embedding generation on Amazon review dataset with 30M items by 9x and reduces the cost by 61%.

zhwu · 2025-03-05T00:41:12 1741135272

This recent blog actually looks into the case with multiple writers and the distribution for the time for a writer to take the lock: https://blog.skypilot.co/abusing-sqlite-to-handle-concurrenc...

zhwu · on July 11, 2024

Dealing with all the Kubernetes pod configs / deployments is too much for an AI engineer. Being able to focus on the real model work would be super important.

zhwu · on Aug 2, 2023

The finetuning can tailor the model to have more customized knowledge, just like the identity knowledge of itself shown in the blog post. If you ask the original llama model, it should know nothing about SkyPilot or Vicuña, as it is trained on old knowledge from the internet.

However, finetuning still cannot get rid of the hallucination problem that all the chatbot suffers from. It depends on how accurate you expect the chatbot should be. The retrieval might be considered more accurate, as it will not make up solutions, but just return irrelevant answer in the worst case.

zhwu · on Aug 2, 2023

Great reference!

Just want to add about hosting your own LLM vs using ChatGPT. Cost is definitely a thing to consider, but it also depends on whether it is ok to share the requests to your product with OpenAI.

Also, something you cannot do with ChatGPT is to custom it with your own data, such as internal documents, etc. As shown in the blog, the model trained by ourselves can easily know its identity.

zhwu · on Aug 2, 2023

It is the underlying operational guide of the latest release of Vicuna-1.5: https://twitter.com/lmsysorg/status/1686794639469371393

zhwu · on July 19, 2023

This is cool! The Llama 2-70B can be hosted in my own cloud environment.

zhwu · on May 25, 2023

It seems training the Vicuna on custom dataset could be quite easy as well, according to the following: https://github.com/skypilot-org/skypilot/tree/master/llm/vic...

zhwu · on May 25, 2023

Very interesting! Quite surprised to see PaLM-2 ranked even lower than open-sourced Vicuna.

zhwu · on April 13, 2023

SkyPilot is actually the tool that helps you find the resources on any cloud, including AWS, GCP, Azure, IBM (comming soon) or even Lambda Clouds. It can automatically search for the spot instances across all the regions and clouds, based on the availability and prices.