More

UglyToad · 2025-08-04T14:21:01 1754317261

Yes this is generally the fallback approach if finding the objects via the index (xref) fails. It is slightly slower but it's a one time cost, though I imagine it was a lot slower back when PDFs were first used on the machines of the time.

UglyToad · 2025-08-04T00:08:40 1754266120

If you don't have a known set of PDF producers this is really the only way to safely consume PDF content. Type 3 fonts alone make pulling text content out unreliable or impossible, before even getting to PDFs containing images of scans.

I expect the current LLMs significantly improve upon the previous ways of doing this, e.g. Tesseract, when given an image input? Is there any test you're aware of for model capabilities when it comes to ingesting PDFs?

simonw · 2025-08-04T00:31:21 1754267481

I've been trying it informally and noting that it's getting really good now - Claude 4 and Gemini 2.5 seem to do a perfect job now, though I'm still paranoid that some rogue instruction in the scanned text (accidental or deliberate) might result in an inaccurate result.

UglyToad · 2025-08-04T00:03:17 1754265797

You're right, this was a fairly common failure state seen in the sample set. The previous reference or one in the reference chain would point to offset of 0 or outside the bounds of the file, or just be plain wrong.

What prompted this post was trying to rewrite the initial parse logic for my project PdfPig[0]. I had originally ported the Java PDFBox code but felt like it should be 'simple' to rewrite more performantly. The new logic falls back to a brute-force scan of the entire file if a single xref table or stream is missed and just relies on those offsets in the recovery path.

However it is considerably slower than the code before it and it's hard to have confidence in the changes. I'm currently running through a 10,000 file test-set trying to identify edge-cases.

[0]: https://github.com/UglyToad/PdfPig/pull/1102

farkin88 · 2025-08-04T00:32:19 1754267539

That robustness-vs-throughput trade-off is such a staple of PDF parsing. My guess is that the new path is slower because the recovery scan now always walks the whole byte range and has to inflate any object streams it meets before it can trust the offsets even when the first startxref would have been fine.

The 10k-file test set sounds great for confidence-building. Are the failures clustering around certain producer apps like Word, InDesign, scanners, etc.? Or is it just long-tail randomness?

Reading the PR, I like the recovery-first mindset. If the common real-world case is that offsets lie, treating salvage as the default is arguably the most spec-conformant thing you can do. Slow-and-correct beats fast-and-brittle for PDFs any day.

UglyToad · 2025-08-03T23:23:11 1754263391

Yes, you're right there are Linearized PDFs which are organized to enable parsing and display of the first page(s) without having to download the full file. I skipped those from the summary for now because they have a whole chunk of an appendix to themselves.

UglyToad · on Nov 23, 2024

FWIW they are acting, these things just take a while, current phase of gathering comments ends December 2nd https://www.fdic.gov/news/press-releases/2024/fdic-proposes-...

UglyToad · on Oct 29, 2024

The point, which seems to be routinely massively downvoted on here, is that both things can be true at once:

- these drugs are good and a paradigm shift in the treatment of obesity (and have other benefits)

- we must not lose sight of the need to address a thoroughly sick food industry that necessitate so many people needing to use these. Junk food advertising, lack of subsidies for fresh vegetables, HFCS, food deserts, etc.

Chile is experimenting with banning junk food ads to children and is seeing some early behaviour changes.

The point which people seem to be wilfully missing is that we can have both these drugs and advocate for cracking down on a food system that deliberately poisons everyone in society. Having everyone be on this drug because we shrug and say "free market innit" while big corps continue to feed us crap is not a solution, obviously.

Sakos · on Oct 29, 2024

"Fixing" the food industry isn't possible for as long as they have billions to sink into influencing politics. Trying to find a market or political solution has failed. Full stop. The fact that you're still trying to find some way to make it work is embarrassing and depressing. It's time to attack the problem from another direction, one that will also ensure these companies either go bankrupt, lose relevance and power and/or evolve into a form that's less parasitic and more beneficial to us as a species. GLP-1 can be one tool to help us do that.

SpicyLemonZest · on Oct 29, 2024

We can only crack down on a "food system that deliberately poisons everyone in society" if such a system actually exists.

* Food deserts are a problem, but the vast majority of Americans don't live in one. We just don't typically want to eat a pile of fresh veggies when there's other options available.

* Criticisms of HFCS are, as far as I can tell, entirely viral misinformation - not once have I seen someone point to concrete evidence that HCFS is worse than table sugar.

It seems to me that this entire idea of a poisonous food system is an epicycle to avoid the obvious conclusion, that our bodies are calibrated on average to eat ourselves into obesity when we have the means to do so. If you don't start from the premise that there must be an external reason we're getting heavier, it's very hard to explain why potato chips should be any more unhealthy than a traditional breakfast of potatoes and bacon.

astrange · on Oct 29, 2024

IIRC food deserts are a demand issue, not supply. The reason healthy food doesn't exist in those neighborhoods is because it closed because people didn't go there.

SpicyLemonZest · on Oct 29, 2024

I've heard that too, but even if true it's still a problem for the minority of people in the area who would have liked to get fresh veggies and such.

UglyToad · on July 26, 2024

I've been experimenting with this, it makes testing trivial and removes the coupling that inevitably occurs with multi method interfaces.

However I think there's one missing enhancement that would turn it from esoteric and difficult to reason about to actually usable that the language will never get.

This is being able to indicate a method implements a delegate so that compilation errors and finding references work much more easily.

E.g. suppose you have:

    delegate Task<string> GetEntityName(int id)

    public async Task<string> MyEntityNameImpl(int id)

I'd love to be able to mark the method:

    public async Task<string> MyEntityNameImpl(int id) : GetEntityName

This could just be removed on compile but it would make the tooling experience much better in my view when you control the delegate implementations and definitions.

jayd16 · on July 26, 2024

If you want to enforce things, use an interface. If you want to accept anything that fits use a delegate.

I'm not sure I understand your use case where you need to conflate the two. You want to enforce the contract but with arbitrary method names?

I suppose you could wire up something like this but it's a bit convoluted.

    interface IFoo {
     string F(String s);
    }
    
    class Bar {
     public string B(String s){
      return "";
     }
    }

    // internal class, perhaps in your test framework
    class BarContract : Bar, IFoo {
     public string F(string s) => B(s);
    }

UglyToad · on July 26, 2024

My aim is to use dependency injection to inject the minimal dependency and nothing more. Versus the grab bag every interface in a medium-complexity C# project eventually devolves into.

I've had this on my blogpost-to-write backlog for a year at this point but in every project I've worked on an interface eventually becomes a holding zone for related but disparate concepts. And so injecting the whole interface it becomes unclear what the dependency actually is.

E.g. you have some service that does data access for users, then someone adds some Salesforce stuff, or a notification call or whatever. Now any class consuming that service could be doing a bunch of different things.

The idea is basically single method interfaces without the overhead of writing the interface. Just being able to pass around free functions but with the superior DevX most C# tools offer.

I guess I want a more functional C# without having to learn F# which I've tried a few times and bounced off.

neonsunset · on July 26, 2024

If anything, there is little reason to use a named delegate over the Func nowadays too. The contract in this case is implied by you explicitly calling a constructor or a factory method so a type confusion, that Go has, cannot happen.

UglyToad · on July 26, 2024

The idea with the named delegate would be if you need some way to:

    delegate Task<string> GetUserEmail(int userId);

This provides more guidance than taking in a:

    Func<int, Task<string>> getUserEmail

If you can annotate implementations of the delegate the tooling support becomes even nicer. Not all Funcs with the same shape have the same semantics, in my ideal C#-like language.

Edit: I completely forgot the main reason which is if using a DI container it can inject the named delegate for you correctly in the constructor. Versus only being able to register a single func shape per container.

UglyToad · on July 23, 2024

Having built recurring stuff in the past (date based with no time component, luckily for me) I think you gain a lot of usability gains for generating a row for each occurrence of the event.

Inevitably the user will come back and say "oh, I want it monthly except this specific instance" or if it's a time based event "this specific one should be half an hour later". You could just store the exceptions to the rule as their own data-structure but then you need to correlate the exception to the scheduler 'tick' and if they can edit the schedule, well, you're S.O.O.L either way but I think having concrete occurrences is potentially easier to recover from.

UglyToad · on April 10, 2024

But the problem is the accounting jargon is counter (contra?) to the layman's gut understanding.

If I get credited or I use a credit card money came from nowhere, woohoo. If I have a debit well that sounds like debt and my money decreased, boo.

I get that actually there's a good reason for the names but a field that doggedly sticks to non intuitive jargon that runs counter to every usage yet encountered for outsiders could do with some different non-overloaded terms.

UglyToad · on March 23, 2024

Sounds sort of like the Citroen Ami? Slightly below your ideal range and quite expensive too but the same concept.