PLATFORM
  • Tails

    Create websites with TailwindCSS

  • Blocks

    Design blocks for your website

  • Wave

    Start building the next great SAAS

  • Pines

    Alpine & Tailwind UI Library

  • Auth

    Plug'n Play Authentication for Laravel

  • Designer comingsoon

    Create website designs with AI

  • DevBlog comingsoon

    Blog platform for developers

  • Static

    Build a simple static website

  • SaaS Adventure

    21-day program to build a SAAS

Written By
Views

Create a Web Crawler 101

If you would like to learn how to create a web crawler, spider, or sometimes referred to as a bot... It is actually a lot simpler then you may think. In this short post I will display just how easy it is to obtain a mark-up from another website and then you will easily be able to see how you can parse the data to use for your own evil pleasure. The PHP code to get the markup of another site can be done with one function call file_get_contents as shown below:

<?php
$webpage = file_get_contents('http://www.tonylea.com');
?>

Now, the variable $webpage contains all the mark-up (source) for http://www.tonylea.com.

Okay, so basically if we want to parse the data we could do something like the following:

<?php
$url = 'http://www.tonylea.com';
$webpage = file_get_contents($url);
function get_images($page)
{
     if (!empty($page)){
          preg_match_all('/<img([^>]+)\/>/i', $page, $images);
          return !empty($images[1]) ? $images[1] : FALSE;
     }
}
function get_links($page)
{
     if (!empty($this->markup)){
          preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
          return !empty($links[1]) ? $links[1] : FALSE;
     }
}

$images = get_images($webpage);
foreach($images as $image)
{
     echo $image.'<br />';
}
?>

In the above example we have gotten the mark-up from the specified URL and gotten the values contained in the 'a' tags and the 'img' tags. The code then prints out the data that is in the 'img' tags. With a bit more parsing you can display images and links from the page you have scraped or crawled.

Very cool stuff, that you can collaborate upon to do some awesome web crawling :)

Comments (0)

loading comments