Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora
Tools

Miasma: Developers Build 'Endless Poison Pit' to Trap and Waste AI Web Scrapers

An open-source tool named Miasma generates infinite, procedurally-created pages of plausible-looking but meaningless content designed to trap AI web crawlers in an endless loop — draining compute, time, and API credits from scraping operations targeting websites without consent.

D.O.T.S AI Newsroom

D.O.T.S AI Newsroom

AI News Desk

3 min read
Miasma: Developers Build 'Endless Poison Pit' to Trap and Waste AI Web Scrapers

A new open-source project named Miasma, released this week on GitHub by developer Austin Weeks, takes an adversarial approach to the AI data scraping problem: rather than blocking crawlers, it traps them.

The concept is a honeypot at scale. When Miasma detects a request pattern consistent with AI training data collection — high-velocity crawling, unusual user-agent strings, the signature behaviour of bulk scrapers — it serves the crawler a procedurally generated page that appears coherent but contains no meaningful information. Each page links to further procedurally generated pages, creating a graph of plausible-looking content that a crawler can traverse indefinitely without extracting anything of value.

The Mechanism

Miasma's approach is technically elegant in its simplicity. Rather than maintaining a blocklist of known scrapers — a perpetual arms race — it focuses on behavioural detection. Legitimate browsers exhibit characteristic interaction patterns: they load resources selectively, render JavaScript, follow user navigation flows. Automated scrapers optimise for throughput: they request pages sequentially, skip resource-intensive rendering, and often ignore standard crawl-delay signals.

When the system identifies a likely scraper, it redirects the session to the poison pit — a dynamically generated namespace of URLs that the scraper believes are real pages. The content generator produces text that is grammatically plausible and topically adjacent to the real site's content, but semantically empty. Trained on this content, a language model would learn nothing useful — and might actively be degraded by the noise.

The Broader Context

Miasma arrives as the question of AI training data consent has moved from academic discussion to active litigation. The New York Times, several major news organisations, and numerous individual creators have filed suits arguing that AI companies scraped their content without license. The legal outcomes remain uncertain, but the effect has been to push developers and site owners toward technical countermeasures rather than waiting for legal resolution.

Previous countermeasures — robots.txt extensions like AI-specific disallow rules, services like Cloudflare's AI scraper blocking — operate on the same principle: identify and block. Miasma's contribution is the active deception layer. By making the scraper believe it is successfully collecting data, the tool wastes the scraper's resources rather than simply diverting them elsewhere.

The project has generated significant interest in developer communities, accumulating several hundred GitHub stars within 24 hours of publication. Whether it proves effective against the most sophisticated scraping operations — which increasingly use residential proxy networks and browser automation to mimic legitimate traffic — remains to be seen. But it represents a meaningful escalation in the technical toolkit available to site owners on the consent side of the AI training data debate.

Back to Home

Related Stories

Astropad's Workbench Turns a Mac Mini Into an AI Agent Server You Control From Your Phone
Tools

Astropad's Workbench Turns a Mac Mini Into an AI Agent Server You Control From Your Phone

Astropad, the company behind the Luna Display hardware that lets iPads function as Mac monitors, has built a new product for a new era: Workbench lets users remotely monitor and control AI agents running on Mac Minis from an iPhone or iPad. It is remote desktop software reimagined not for IT support but for the AI agent operator — the person who needs to check on autonomous workflows without being at their desk.

D.O.T.S AI Newsroom
Microsoft's Bing Team Open-Sources Harrier, a Multilingual Embedding Model That Tops the MTEB v2 Benchmark
Tools

Microsoft's Bing Team Open-Sources Harrier, a Multilingual Embedding Model That Tops the MTEB v2 Benchmark

Microsoft's Bing search team has released Harrier as an open-source embedding model, and it tops the multilingual MTEB v2 benchmark while supporting over 100 languages. The release is significant not just for the benchmark numbers but for the source: a search team that has spent decades optimizing retrieval systems has built an embedding model for the exact use case — semantic search and retrieval — that underpins most production RAG applications.

D.O.T.S AI Newsroom
Stability AI Pivots to Enterprise With Brand Studio — a Platform for Brand-Consistent AI Image Generation
Tools

Stability AI Pivots to Enterprise With Brand Studio — a Platform for Brand-Consistent AI Image Generation

Stability AI, the company that made open-source image generation mainstream with Stable Diffusion, is repositioning for enterprise with Brand Studio. The platform lets creative teams train brand-specific image models, automate visual production workflows, and route tasks to the best-suited AI model — a commercial play from a company that built its name on open access.

D.O.T.S AI Newsroom