Hacker Newsnew | past | comments | ask | show | jobs | submit | MrTravisB's commentslogin

Regarding the browser instances: While VM boot times have definitely improved, accessing a site through a full browser render isn't always the most efficient way to retrieve information. Our goal is to get the most up-to-date information as fast as possible.

For example, something we may consider for the future is balancing when to implement direct API access versus browser rendering. If a website offers the same information via an API, that would almost always be faster and lighter than spinning up a headless browser, regardless of how fast the VM boots. While we don't support that hybrid approach yet, it illustrates why we are optimizing for the best tool for the job rather than just defaulting to a full browser every time.

Regarding robots.txt: We agree. Not all potential customers are going to want a service that respects robots.txt or other content-owner-friendly policies. As I alluded to in another comment, we have a difficult task ahead of us to do our best by both the content owners and the developers trying to access that content.

As part of Mozilla, we have certain values that we work by and will remain true to. If that ultimately means some number of potential customers choose a competitor, that is a trade-off we are comfortable with.


thank you so much, great to hear the thinking behind these considerations :)

This is a valid perspective. Since this is an emerging space, we are still figuring out how to show up in a healthy way for the open web.

We recognize that the balance between content owners and the users or developers accessing that content is delicate. Because of that, our initial stance is to default to respecting websites as much as possible.

That said, to be clear on our implementation: we currently only respond to explicit blocks directed at the Tabstack user agent. You can read more about how this works here: https://docs.tabstack.ai/trust/controlling-access


This tension is so close to a fundamental question we’re all dealing with, I think: “Who is the web for? Humans or machines?”

I think too often people fall completely on one side of this question or the other. I think it’s really complicated, and deserves a lot of nuance. I think it mostly comes down to having a right to exert control over how our data should be used, and I think most of it’s currently shaped by Section 230.

Generally speaking, platforms consider data to be owned by the platform. GDPR and CCPA/CPRA try to be the counter to that, but those are also too-crude a tool.

Let’s take an example: Reddit. Let’s say a user is asking for help and I post a solution that I’m proud of. In that act, I’m generally expecting to help the original person who asked the question, and since I’m aware that the post is public, I’m expecting it to help whoever comes next with the same question.

Now (correct me if I’m wrong, but) GDPR considers my public post to be my data. I’m allowed to request that Reddit return it to me or remove it from the website. But then with Reddit’s recent API policies, that data is also Reddit’s product. They’re selling access to it for … whatever purposes they outline in the use policy there. That’s pretty far outside what a user is thinking when they post on Reddit. And the other side of it as well — was my answer used to train a model that benefits from my writing and converts it into money for a model maker? (To name just an example).

I think ultimately, platforms have too much control, and users have too little specificity in declaring who should be allowed to use their content and for what purposes.


Thanks for the feedback. We are definitely not trying to hide it. We actually do have pricing listed in the API section regarding the different operations, but we could definitely work on making this clearer and easier to parse.

We are simply in an early stage and still finalizing our long-term subscription tiers. Currently, we use a simple credit model which is $1 per 10,000 credits. However, every account receives 50,000 credits for free every month ($5 value). We will have a dedicated public pricing page up as soon as our monthly plans are finalized.

Regarding semantic data, our JSON extraction endpoint is designed to extract any data on the page. That said, we would love to know your specific use cases for those ontologies to see if we can further improve our support for them.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: