Crawling

Crawling (also sometimes called “spidering”) is a common technique computers use to discover the content of a website. Major search engines like Google rely on crawling, as does Silktide.

Crawling is a simple process:

  1. Download a webpage
  2. Remember all the pages that webpage links to
  3. If you have pages you haven’t downloaded yet, repeat from #1

This is somewhat simplified, but it illustrates several important concepts:

You can only crawl pages that are linked to

If a page isn’t linked to, there is no way for crawling to discover it. This is important in both Silktide and Google, as a web address that is (say) written on a poster but never linked to on your website is known as an ‘orphaned page’, and will never be crawled.

Crawling takes time

To crawl a website, a page must be downloaded to find more pages which find more pages … and so on, until all pages are found. Most crawlers – including Google and Silktide – will download multiple pages at once to speed this process up, but it still takes time. If you try to download a website too quickly you will often put too much demand on the website and cause it to crash.

Crawling can go on forever

Many websites include so-called ‘spider traps’ which go on forever. A common example is a calendar. Typically a calendar contains link a link to view the next day, and next day, and so on. These can continue until the year 300,000 AD and beyond. A crawler doesn’t understand that following these links makes no sense, and will just continue to try to find the end of a website that doesn’t exist. As a result, most crawlers have some built in constraints, to make them give up if they find too many pages. But these constraints can prevent ‘real’ pages from being discovered. To get around this in Silktide, you need to teach the crawler not to download specific pages.

Need more help?