If you’ve ever worked with Web Scrapers, you’ll know that the most irritating thing to see is a CAPTCHA. It is put in place to prevent exactly what we’re making and pretty infuriating really! So I set out to build a system that could quite simply, beat the captcha.
Now a little bit about CAPTCHAs. The name stands for Completely Automated Public Turing test to tell Computers and Humans Apart.
CAPTCHA: Completely Automated Public Turing test to tell Computers and Humans Apart
They are used to tell computers and humans apart by placing a challenge that only humans can solve. You might have seen the above symbol around the internet. The challenge might be something like what follows.
Now, this particular system “reCAPTCHA” is run by Google and let’s just say it’s pretty good at what it does. It tracks mouse movements, and a bunch more complicated stuff and then shows you these random images for an even stricter check to make sure you’re human.
We can’t break those. Sorry if you wanted to but at least right now, not happening. With all of Google’s might behind them, they’re pretty rock solid.
But all hope is not lost! Sometimes websites implement their own CAPTCHA system. And those are usually nowhere near as secure as Google’s and we will try cracking those :)
Here’s an example from the FedEx website. This has simple jumbled up letters and is easier to solve than the pictures.
But, some websites, in a bid to make CAPTCHAs even easier on humans have replaced the words and images with math problems! And that’s the sweet spot. These captchas are totally easy to fool and that’s what we will do today.
The website I’ll be targeting in this article is the Indian Motor Vehicle Registry “Vahan”. Particularly, the vehicle search page. You can view the page with this URL https://vahan.nic.in/nrservices/faces/user/searchstatus.xhtml
UPDATE: The captcha format used on the Vahan website has changed since this article was written.
In order to make the scraper, I’ll be using Python and Selenium. They’re pretty easy to use and if you want a deeper look into using Selenium for making scrapers, you can check out this article.
Step 1: Preparing the Scraper
The first thing to do is import all the modules required. In your python file add the following lines. We’re using Selenium for scraping, Requests to deal with the OCR API, and PIL to handle working with images.
Next, we’ve got to link the web driver and provide the website URL.
Make sure to replace the links in the web driver and URL with your own.
Step 2: Hunting for the CAPTCHA
Now that we can see the webpage, it’s time to hunt for the CAPTCHA. We won’t be able to directly extract the text from the code, so we take a different approach.
First we take a screenshot of the entire webpage and save it to “image.png”.
Then we search for the element by its XPath. An easy way to find an element’s XPath is using the Chrome developer tools menu.
Make sure to replace the XPath with the path of the CAPTCHA you are hunting for.
Then, we also find the location and size of the CAPTCHA.
Using, PIL, we can read the entire screenshot and crop it to only contain the CAPTCHA. This is pretty easy as we already know the coordinates of the CAPTCHA. Once done, this new cropped image of just the CAPTCHA is saved as “cropped.png”.
Step 3: Extracting text from the CAPTCHA
There are multiple ways to go about this as it is effectively now an OCR problem. You can use a Cloud ML provider or use a model on your local machine.
The route I’ve chosen to go with is Azure Cognitive Services’ Read API. It’s very easy to use, fast and free! (Up to 5,000 requests a month, and very cheap even after that.) You can learn more about it here.
We first define the subscription key and endpoint. You’ll get those when creating a Cognitive Services resource on Azure.
Next, we open our cropped image and send it to the Read API URL as an octet-stream which basically means the raw bytes. The language “English” parameter is included to give the model a hint of what to look for.
This API call returns a response containing a location to find our actual answer and not the answer itself. This is so that the program can continue executing and the flow isn’t held up.
Next, we start hitting the url that the response specified untill we get an answer. This is known as Polling.
When the API finally returns our processed output, we store it in the variable result and print it.
Observe the printed output carefully. It’s a JSON with a bunch of values but somewhere in the middle, an attribute called “text” should have our processed output!
Our image has successfully been converted to text and now is the easiest part, actually solving it!
Step 4: Solving the CAPTCHA
Now depending on how your CAPTCHA looks, the method of solving may be different. You’ll have to be a bit of recon and a few trials to figure it out. In my case, there are 6 possible CAPTCHAs (Greater, Lesser, +, -, *, /).
The way I go about this is to split the string and search for the corresponding symbols to figure out what operation to perform.
The method you use for your CAPTCHA may differ widely.
Step 5: Proceed Scraping!
Now that the CAPTCHA is cracked, submitting the form with any value we would like is not a problem. No obstacles to stop this bot!
In my case, I’m searching the vehicle database, but you can modify this to any website you please!
And that’s how we can circumvent a CAPTCHA to easily scrape data. You can find all the code used in this article in the following Github repository.
Code that is optimised to break captchas from a certain Indian Vehicle database. Learn how it works here…
Have fun and scrape safe! Remember to always follow the rules of the website you’re scraping.