
The URL scraping/crawling functionality for Advanced Mode lets you train an DirectoryRobot bot by entering the domain root (such as example.com), after submitting our bot will crawl the site.
If you’re experiencing issues with this feature, there are a few potential reasons why:
- The crawler can only crawl around 30 pages currently. This is enough for the vast majority of small business websites. If you need more, one option is to use one of the many “URL to PDF” tools on the market, many of them free (Google that “URL to PDF” term), and upload the data as PDF’s instead.
- If nothing is returned at all, the first thing to try is to refresh the page and try again
- If nothing is returned still (or only 1 page is returned, the root), this means that either our crawler is being blocked (for example, our server is in the UK, and some international sites, usually ones with regulatory issues, block UK visitors), or that our crawler is unable to navigate the page - you could right click and “view page source” of the website you’re trying to scrape to get an idea of what our bot sees… are all the links easily viewable? One other scenario we’ve seen is where there’s a pop-up which disables the site until something is agreed to (like an age verification popup on sites for alcohol brands), this also prevents out bot from navigating the site
In the rare instances our crawler is blocked, you’ll have to train using a tool like URL to PDF, or simply use classic AI mode and copy/paste the context you’d like the bot to know about into the “business context” field.
We’ll continue to develop the crawling functionality over the coming months to overcome some of these limitations.
Comments