Hacker Newsnew | past | comments | ask | show | jobs | submit | ycombiredd's commentslogin

Hmm.. My first thought is that great, now not only will e.g., HR/screening/hiring hand-off the reading/discerning tasks to an ML model, they'll now outsource the things that require any sort of emotional understanding (compassion, stress, anxiety, social awkwardness, etc) to a model too.

One part of me has a tendency to think "good, take some subjectivity away from a human with poor social skills", but another part of me is repulsed by the concept because we see how otherwise capable humans will defer to "expertise" of an LLM due to a notion of perceived "expertise" in the machine, or laziness (see recent kerfuffles in the legal field over hallucinated citations, etc.)

Objective classification in CV is one thing, but subjective identification (psychology, pseudoscientific forensic sociology, etc) via a multi-modal model triggers a sort of danger warning in me as initial reaction.

Neat work, though, from a technical standpoint.


Appreciate the feedback truly. It's an interesting concept to explore, deferring human "expertise" to technology has been happening throughout the years (most definitely accelerated in recent times), for which we have found ways to adapt / abstract over the work being deferred, but the growing pains are probably the most acute when such deferment happens rapidly, as in the case of AI.

Don't want this to turn into a Matt Damon in Elysium type of situation for sure with that scene with the parole officer hahah (which would stem from a poor integration of such subjective signals into existing workflows, more so than the availability of those signals)

For emotional intelligence, I personally see this as a prerequisite for any voice / language model that's interacting with humans, just like how an autonomous car has to be able to identify a pothole, so does a voice / video agent navigating a pothole in a conversation.


You cause me to have an additional thought on the topic which is that as much as I expressed a sense of dread at the inevitable use of this sort of tech in hiring pipelines (not by agents, necessarily, but as a sort of HUD overlay on a video call between humans was my initial envisioned use case.) But I suppose that just as the AI interviewer bots that I thus far have refused to engage with will inevitably be unavoidable if one is on the job hunt, so will the use of this sort of multi-modal sentiment analysis be inevitable. (Same with the justice system use case you referenced in your metaphor, and probably therapists and such as well will follow.)

As such, I wish you the best of luck with this project - earnestly so - because if, as I suggest, it is inevitable... we want such a system to be as good as possible.

An aside: another inevitable use case just came to mind - that of the cheap, shoddily implemented and poorly tested (along with the insecure, surveillance-adjacent products that will proliferate) kid's toys with embedded AI and the sardonically-humorous privacy mishaps and unintended actions from such low-quality implementation toys being sold (see: the current LLM-enabled kids toys currently popping up routinely at retailers.) ha! Sorry I keep taking your cool demo to dystopian extremes. :)

Oh, one more thing... Upon re-reading my previous comment, I recognize that the description of my visceral reaction as on of being being "repulsed by the thought" could literally be read as me calling your system "repulsive", which was not my intent. I think your tech is cool, and was just trying to convey two conflicting feelings that occurred within me when thinking about the future commercial use cases. I hope your systems works great so that if it does find market fit with such use cases, that, well... if it's inevitable - as the last few years of "LLMs everywhere!" has forced us all to adapt (accept or reject it, it still requires new effort) - we should hope for a good and working system, so I hope you succeed in making one.

Lastly, to your self-driving/potholes analogy... I do think that that fits more in line with my "objective CV classification" category; I think a closer fit to what you're building would be "self-driving car having to handle the Trolley Car Problem", with the nuances of human value judgements etc; does the car swerve into two adults vs one child? And so on. Pothole classification is more objective while driving into it, swerving to avoid it, classifying pedestrians and choosing one to possibly collide with, etc are subjective and more complicated (as is your system and the functions it can perform.)

Best of luck!


HR: 1187 at Hunterwasser.

Candidate: That's the hotel.

HR: What?

Candidate: Where I live.

HR: Nice place?

Candidate: Yeah, sure. I guess. Is that part of the test?

HR: No. Just warming you up, that's all.


"It's a test - designed to provoke an emotional response. "

I was going to follow this with something like "except the role of analyzing the emotional response is reversed", and then I wanted to expound with an "ooh but.. wait, there's another metaphor here since ..." but thought I've already potentially approached "spoiler alert" territory so I'll just stop there. Those who know the reference I am replying to will know; those who don't, well, don't google any of this or its parent cuz spoiler alert


I remember setting up CruiseControl when I was at a J2EE shop. That and Mantis, but I don't remember which was before which.

"Lawful Intercept".

Some may find this interesting https://www.fcc.gov/calea


It might be worth mentioning the concept of "stub resolver" and clarifying a bit that a nameserver is a resolver. That might be being pedantic, but thought it might be worth clarifying that the difference conceptually may just be what the particular dns server answering the query is authoritative for, if anything.

One other thing that might be worth a mention is the concept of the OS' resolver and "suffix search order", with an example of connecting (https, ping, ssh, whatever protocol) to a host using just the hostname, and the aforementioned mechanism that (probably) allows this to connect to the FQDN you want. (Also, now that I type that, do you mention "FQDN" at all? If not, maybe should.)

On that note one final thought that occurs to me is the error/confound that may occur if a hostname is entered and is not resolved, but does resolve with one of the domain suffixes attached on a retry (particularly can be confusing with a typo coupled with a wildcard A record in a domain, for example.) I recognize that the lines that look like DNS records are not explicitly stated to be in a format for any particular dns server software, and even if they were, they're snippets without larger context so we don't know what the $ORIGIN for the zone might be, an adjacent concept you might want to explore, even if just for your own edification is that of the effect of a terminating "." at the end of a hostname, either at resolution or configuration time.

Just offering feedback that might help you add to the article.


I don't care if this is an advertisement for buildkite masquerading as a blog post or if this is just an honest rant. Either way, I gotta say it speaks a lot of truth.

Will absolutely confirm this is was a (lovely) surprise for the team at BK to read, not an ad or commission or anything of the sort

Tangentially related, I once wanted to render a NetworkX DAG in ASCII, and created phart to do so.

There's an example of a fairly complicated graph of chess grandmaster PGM taken from a matplotlib example from the NetworkX documentation website, among some more trivial output examples in the README at https://github.com/scottvr/phart/blob/main/README.md#example...

(You will need to expand the examples by tapping/clicking on the rightward-facing triangle under "Examples", so that it rotates to downward facing and the hidden content section is displayed)


Yes. This type of behavior was what I was referring to in an earlier comment mentioning flashbacks to seeing logs from named filled with "cannot have cname and other data", and slapping my forehead asking "who keeps doing this?", in the days when editing files by hand was the norm. And then, of course having repeats of this feeling as tools were built, automations became increasingly common, and large service providers "standardized" interfaces (ostensibly to ensure correctness) allowing or even encouraging creation of bad zone configurations.

The more things change, the more things stay the same. :-)


You just caused flashbacks of error messages from BIND of the sort "cannot have CNAME and other data", from this proximate cause, and having to explain the problem many, many times. Confusion and ambiguity of understandings have also existed since forever by people creating domain RR's (editing files) or the automated or more machined equivalents.

Related, the phrase "CNAME chains" causes vague memories of confusion surrounding the concepts of "CNAME" and casual usage of the term "alias". Without re-reading RFC1034 today, I recall that my understanding back in the day was that the "C" was for "canonical", and that the host record the CNAME itself resolved to must itself have an A record, and not be another CNAME, and I acknowledge the already discussed topic that my "must" is doing a lot of lifting there, since the RFC in question predates a normative language standard RFC itself.

So, I don't remember exactly the initial point I was trying to get at with my second paragraph; maybe there has always been some various failure modes due to varying interpretations which have only compounded with age, new blood, non-standard language being used in self-serve DNS interfaces by providers, etc which I suppose only strengthens the "ambiguity" claim. That doesn't excuse such a large critical service provider though, at all.


So, I posted this link. I actually did so assuming it likely already had already been submitted, and I wanted to discuss this with people more qualified and educated in the subject than I. The authors of this paper are definitely more qualified to publish such a paper than I am; I'm not an ML scientist and I am not trying to pose as one. The paper made me feel a sort of way, and caused a bunch of questions to come to mind I didn't find answers to in the paper but, as I'm willing to suppose, maybe I'm not even qualified to read such a paper. I considered messaging the authors someplace like Twitter or in review/feedback on the Arxiv submission (which I probably don't have access to do with my user anyway, but I digress.) I decided that might make me seem like a hostile critic, or maybe likely, I'd just come off as an unqualified idiot.

So... HN came quickly to mind as a place where I can share a thought, considered opinion, ask questions, with potential to have them be answered by very smart and knowledgeable folks on a neutral ground. If you've made it this far into my comment, I already appreciate you. :)

Ok so... I've already disclaimed any authority, so I will get to my point and see what you guys can tell me. I read the paper (it is 80+ pages, so admittedly I skimmed some math, but also re-read some passages to feel more certain that I understood what they are saying.

I understand the phenomenon, and have no reason to doubt anything they put in the paper. But, as I mentioned, while reading it I had some intangible gut "feelings" that seeing that they have math to back what they're saying could not resolve for me. Maybe this is just because I don't understand the proofs. Still, I realized when I stopped reading at it that it actually wasn't anything that they said, it was what it seemed to my naive brain was not said, and I felt like it should have been.

I'll try to get to the point. I completely buy that reframing prompts can reduce mode collapse. But, as I understand it, the chat interface in front of the backend API of any LLM tested does not have insight into logits, probs, etc. The parameters passed by the prompt request, and the probabilities returned with the generations (if asked for by the API request) do not leak, are not provided in the chat conversation context in any way, so that when you prompt an LLM to return a probability, it's responding with, essentially, the language about probabilities it learned during its training, and it seems rather unlikely that many training datasets contain actual factual information about their own contents' distributions for the model during training or RLHF to "learn" any useful probabilistic information about its own training data.

So, a part of the paper I re-read more than once says at one point (in 4.2): "Our method is training-free, model-agnostic, and requires no logit access." This statement is unequivocally obviously true and honest, but - and I'm not trying to be rude or mean, I just feel like there is something subtle I'm missing or misunderstanding - because, said another way, that statement could also be true and honest if it said "Our method has no logit access, because the chat interface isn't designed that way", and here's what immediately follows then in my mind, which is "the model learned how humans write about probabilities and will output a number that may be near to (or far away from) the actually prob of the token/word/sentence/whathaveyou, and we observed that if you prompt the model in a way that causes it to output a number that looks like a probability (some digits, a decimal somewhere), along with the requested five jokes, it has an effect on the 'creativity' of the list of five jokes it gives you."

So, naturally, one wonders what, if any actual correlation there is between the numbers the LLM generates as "hallucinated" (I'm not trying to use the word in a loaded way; it's just the term that everyone understands for this meaning, with no sentiment behind my usage here) probabilities for the jokes it generated, and the actual probabilities thereof. I did see that they measured empirical frequencies of generated answers across runs and compared that empirical histogram to a proxy pretraining distribution, and that they acknowledge that they did no comparison or correlation of the "probabilities" output by the model, and they clearly state it. So without continuing to belabor that point, this is probably core to my confusion about the framing of what the paper says that the phenomenon indicates.

It is hard for me to stop asking all the slight variations on these questions that lead me to write this, but I will stop, and try to get to a TL;DR I think dear HN readers may appreciate more than my exposition of befuddlement bordering on dubiousness:

I guess the TLDR of my comment is that I am curious if the authors examined any relationship between the LLM verbalized "probabilities" and actual model sampling likelihoods (logprobs or selection frequency). I am not convinced that the verbalized "probabilities" themselves are doing any work other than functioning as token noise or prompt reframing.

I didn't see a control for, or even a comparison to/against multi-slot prompts with arbitrary labels or non-semantic "decorative" annotation. In my experience poking and prodding LLMs as a user, desiring to influence generations in specific and sometimes unknown ways, even lightweight slotting without probability language substantially reduces repetition, which makes me wonder how much of the gain from VS is attributable to task reframing, as opposed to the probability verbalization itself.

This may not even be a topic of interest for anyone, and maybe nobody will even see my comment/questions, so I'll stop for now... but if anyone has insights, clarifications, or can point out where I'm being dense, I actually have quite a bit more to say and ask about this paper.

I can't really explain why I just had to see if I could get another insightful opinion on this paper (I usually don't have such a strong reaction when reading academic papers I may not fully understand, but there's some gap in my knowledge (or less likely, there's something off about the framing of the phenomenon described), and it's causing me to really hope for discussion, so I can ask my perhaps even less-qualified questions pertaining to what boils down to mostly just my intuition (or maybe incomprehension. Heh.)

Thanks so much if you've read this and even more if you can talk to me about what I've used too many words to try to convey here.


Hello! I'm one of the main authors of the paper. Thanks for engaging with our work so thoughtfully – that's a very clear and valid question.

We didn't get around to addressing this within the paper itself – 80 pages is a lot, and deadlines, etc. But I have unpublished experiments that show that in a reasonably broad setting I'm doing some work in, verbalized probabilities are restoring a distribution that looks almost identical to the base distribution. It is not possible to demonstrate this on frontier models, since their public models are already mode-collapsed, and they don't share the base model or logprobs anyway. But I've established this to my personal satisfaction on large local models which offer base / post-trained pairs.

To share some intuition on why one might believe this is occurring: there are a bunch of tasks implicit in the pre-training corpus that encourage the model to learn this capability. Consider sentences in news and research articles like: "Scientists discover that [doing something] increases [some outcome] on [some population] by X%". It seems quite natural that the model might learn a pathway by which it can translate its base probabilities into the equivalent numeric tokens in order to "beat" the task of reducing loss on the "X%" prediction. I can even almost visualize how this works mechanically in terms of what the upper layers of an MLP would do to learn this, i.e. translating from weights into specific token slots. And this is almost certainly more parameter-efficient than constructing an entire separate emulated reality for filling in X. Although I'm not ruling out that the latter might still be happening – perhaps some future interp research might be able to validate this!

I'm actually working on a paper that packs up some of the above findings in passing. But if helpful in the meantime, this is also building on related work by Tian et al. 2023, "Just Ask for Calibration" [1] and Meister et al. 2024, "Benchmarking Distributional Alignment of LLMs" [2], that give some extra confidence here. Their findings indicate that whether or not they are rooted in the model's base probabilities, they seem to be useful for the purposes that people care about. (Oh, and you can probably set up an experiment to verify this independently with vLLM in a few Claude Code requests!)

Hope that was helpful – feel free to ping with follow-ups! (Although replies might be a little delayed, I happened to see this at a good time; having quite a crunchy week)

[1] https://arxiv.org/abs/2305.14975

[2] https://arxiv.org/abs/2411.05403


Maybe I am missing something or am just naive, but isn't it fairly common for social media accounts of well-known figures to be taken over (hacked/phished/whatever) for the purpose of shilling some crypto scam? Launching a memecoin and then very quickly (30 min later, apparently) rugpulling seems like it would at least as likely fit that type of scam as it would being one where the public figure themselves is actually behind the scam.

Not making a claim as to what is actually true, just positing explanations. Heck, maybe plot twist: it is actually Eric Adams behind it, but the "account takeover" possibility was planned to serve as plausible deniability.

You know... like "an actor that's playing a dude, disguised as another dude" type thing.



Just pointing out, this clip could have been done with AI just as well.


Yeah but I doubt it. These people have PR teams and could have easily released a statement if this was fake.


Yeah, just following up to my grandparent comment to say "wow. Holy shit. It is how it looks." I'm not sure why I was surprised; maybe I'm an optimist, or as I suggested in my first comment, a bit naive.

In my defense, I don't think I'm stupid; I just don't want to believe so many people in power are cartoonishly evil, so I tend to look for explanations that don't require it. I think my internal sense of the world wants there to be a distinction between, say, average cryptoscammer evil buffoonery and the people in positions where at least ostensibly they try to present as a good guy while trying to keep their evildoings secret. This story gives me some sort of cognitive dissonance, and while reflecting on that fact, I get a bit sad. This world is bonkers.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: