Sceliphron is a TLD finding and technical survey spider. It visits only the root page of a domain, and only follows links to external, uncrawled domain names. This produces a wide, shallow mapping of the World Wide Web that ignores most deep web content.
Besides shallow mapping, Sceliphron is designed to identify and enumerate web technologies used on websites it crawls. It uses this information in aggregate to identify trends in software adoption, ranging from choice of webserver host to use of JS libraries and trackers. In the future, some of these statistics will be released publicly; others will be made available privately for a minimal cost.
Sceliphron does not scrape content or use AI to analyze content. It is designed to be a well-behaved crawler, and will obey directives in the /robots.txt file of all domains it visits. See the For Webmasters section for more information on how to disallow or block this crawler.
Sceliphron is named after Sceliphron caementarium, also called the Yellow-Legged Mud Dauber. These solitary wasps hunt over large areas compared to common social wasps, which inspired the name. Despite their large size, S. caementarium has a docile temperament and a very mild sting. A single individual may capture more than fifty spiders per day.
This crawler collects the following data from domains that it visits:
This is the full extent of information collected for each domain crawled. At no point should the crawler attempt to access resources outside of the /robots.txt and root pages of a site. Additionally, the crawler should never attempt to retrieve binary data such as images or videos.
No content retrieved during the crawl is made publicly available, mirrored to any third-party services, or used for any reason other than analysis. In addition, this data is not passed to or processed by AI services of any kind. Any data that could be considered sensitive, such as specific versions of server software or JS libraries in use, will only be released in the form of anonymized aggregates. This is to avoid identifying individual domains that may be using vulnerable or outdated software.
Sceliphron is designed to be "well-behaved". It should only fetch the root and /robots.txt resources of any given domain before moving on to another domain. Recrawling is slow - expect a minimum of 30 days to upwards of a year between visits. In most circumstances, it should not be necessary to disallow the crawler and crawl traffic will be neglible.
If an HTTP 429 (Too Many Requests) error is encountered, the crawler will automatically sleep for 30 seconds. Domains that return 429 errors are put in a queue and similar domains are avoided until the domain drops out of the queue or one hour passes. The crawler also discourages crawling consecutive similar domain names (i.e. subdomains) to help spread requests out over longer time periods.
The crawler will respect any robots.txt directive using the name Sceliphron. Here is an example robots.txt file that will block crawling by the bot:
User-agent: sceliphron
Disallow: /
Note that the robots.txt file is domain-specific. This means that subdomains will need their own robots.txt files if you wish to disallow crawler access to them. The crawler considers all subdomains of a domain to be unique, so websites that use procedurally-generated subdomains may receive excess crawl traffic.
Sceliphron nodes will always identify themselves with a User-Agent header. This header takes the format of:
Mozilla/5.0 (compatible; Sceliphron/VERSION; +https://NODE.sceliphron.net)
Where VERSION is the major revision number of the crawler, and NODE is the name of the node responsible. Each node can be reached via a matching name subdomain of sceliphron.net.
Name | Country | IP Address | Subdomain |
---|---|---|---|
Wasp | US πΊπΈ | 209.38.64.45 | wasp.sceliphron.net |
Wespe | DE π©πͺ | 209.38.201.210 | wespe.sceliphron.net |
Tebuan | SG πΈπ¬ | 178.128.25.137 | tebuan.sceliphron.net |
Wesp | ZA πΏπ¦ | 139.84.227.74 | wesp.sceliphron.net |
Avispa | CL π¨π± | 64.176.3.11 | avispa.sceliphron.net |
Each node is completely independent. Your site may be crawled independently by each node, although this is reliant on each crawler eventually finding your site via links.
If you believe you have encountered a misbehaving crawler, or otherwise have concerns about the operation of the crawler on your domains that cannot be resolved by disallowance, please send an email to the following address:
For business inquiries, general questions, or other concerns, please use the email address below.
Sceliphron Β© MSF, 2025