I think the easiest and safest is to create a docker image that can execute code and display everything in a iframe and pass data back and forth between the llm client and the execution server. I haven't looked at claude artifacts but I suspect that is how it works.
To make the long story short, you can manipulate LLM responses (I want this for testing/cost reasons) in my chat app, so it's not safe to trust the LLM generated code. I guess I could make it possible to not execute any modified LLM responses.
However, if the chat app was designed to be used by one user, evaling would not be an issue.