Close
web crawler Data extraction

How to write a web crawler for Large Language Models at scale (Part – 1)

Introduction

In today AI world, LLM(large language model) like Chat GPT from OpenAI has become overnight famous. But do you know how this models are getting their training data? If not then you are correct place.

It the web crawlers that feed the data to this models. We all might have heard about the crawler word definitely. In case you are not aware let me define it. A web crawling is process which finds new pages from the existing page link. What it needs is a starting page url.

As this blog is a bit big I will split this in 3 parts. In this part, I shall discuss more about how to extract specific data from any web page/html. What is exact process that should be followed?

Below is the order in which I will pan out details. For illustration purpose I will be taking an example of amazon.com’s product page. I shall be using ruby language (note that this process is the same if you use any other language)

  • What is robots.txt?
  • How to download a web page?
  • How to parse the HTML page to dom?
  • How to extract data from dom using XPATH?
  • Dumping the data in the Json file.
  • Conclusion

What is robots.txt?

Robots.txt is basically a guideline for the crawlers, where it clearly defines which part of the sites are not allowed to crawl. If you open https://www.amazon.com/robots.txt you would see lots of entries that say Allow, Disallow.

What it actually means is that if the path mentioned under Allow: then crawler is allowed to scrap pages underneath otherwise not. Just disclaimer it is unethical to scrap data from Disallow paths. So always check the robots.txt before scraping any site.

Sometimes sites allow you to scrap data via their public API’s only not otherwise for instance twitter. Below is an example for robots.txt.

Disallow: /creatorhub
Disallow: /creatorhub/*
Disallow: /slp/s$
Disallow: /-/
Allow: /-/es/
Allow: /-/en$
Allow: /-/zh_TW/
Allow: /-/zh_TW$
Allow: /-/he/
Allow: /-/he$
Allow: /gp/offer-listing/B000
Allow: /gp/offer-listing/9000

How to download a web page?

There are many tools that can be used to download a web page, for instance curl. But we are trying to build an webcrawler program so lets use the ruby language open-uri modules API’s to download the page. This is basically a GET request to fetch the page. Sometimes we need to use POST requests to get desired data. To keep this blog a bit simple we will use GET primarily. I shall also explain POST request in some other blog.

require 'open-uri'
page = URI.open("https://www.amazon.in/Samsung-Fully-Automatic-WA62M4100HY-TL-Imperial/dp/B0747XV38N", "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36")
html = page.read
puts html

How to parse the HTML page to DOM?

DOM is Document Object Model which is an application programming interface (API) for HTML and XML documents. It defines the logical structure of documents and the way a document is accessed and manipulated.

In case of webcrawling for extracting the data from the HTML page we need a way to put the HTML page is document structure format so that it will be easily manipulate it. In this blog post we will use nokogiri to parse HTML page to dom.

require 'nokogiri'
document = Nokogiri::HTML(html)

How to extract data from DOM using XPATH?

Xpath stands for XML Path Language; XPath uses “path like” syntax to identify and navigate nodes in an XML document. Let’s use this to extract some data from the downloaded page. I will demonstrate extraction of Product name, Rating, total Ratings, Brand etc from the page. I will use following xpath to extract the specific data needed. Don’t worry I will put down some nice links in reference for how to come up with the xpath, there are also some good tools like xpath helper as well.

product name: //span[contains(@id,’productTitle’)]
rating: //div[contains(@id,’averageCustomerReviews_feature’)]//span[contains(@class,’a-color-base’)]
total rating: //span[contains(@id,’CustomerReviewText’)]
brand: //th[contains(text(),’Brand’)]//following-sibling::td

puts "Title: " + document.xpath("//span[contains(@id,'productTitle')]").text.strip
puts "Rating: " + document.xpath("//div[contains(@id,'averageCustomerReviews_feature')]//span[contains(@class,'a-color-base')]").text.strip
puts "Total Customer Ratings: " + document.xpath("//span[contains(@id,'CustomerReviewText')]").first.text.strip
puts "Brand: " + document.xpath("//th[contains(text(),'Brand')]//following-sibling::td ").text.strip

Output:

Title: Samsung 6.2 kg Fully-Automatic Top load Washing Machine (WA62M4100HY/TL, Imperial Silver, Center Jet Technology)
Rating: 4.3
Total Customer Ratings: 15,484 ratings
Brand: ‎Samsung

Dumping the data in the Json file

There is no special reason why I have selected Json as an out file for the product data. This just for the demonstration you can use other ways also for dumping the scrapped data like XML, YAML files. In ruby it is fairly easy to work with json files as they very flexible in terms of the schema.

require 'open-uri'
require 'nokogiri'
require 'json'
page = URI.open("https://www.amazon.in/Samsung-Fully-Automatic-WA62M4100HY-TL-Imperial/dp/B0747XV38N", "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36")
html = page.read
document = Nokogiri::HTML(html)
title = document.xpath("//span[contains(@id,'productTitle')]").text.strip
rating = document.xpath("//div[contains(@id,'averageCustomerReviews_feature')]//span[contains(@class,'a-color-base')]").text.strip
total_reviews = document.xpath("//span[contains(@id,'CustomerReviewText')]").first.text.strip
brand = document.xpath("//th[contains(text(),'Brand')]//following-sibling::td ").text.strip
jsonHash= Hash.new
jsonHash["title"] = title
jsonHash["brand"] = brand
jsonHash["rating"] = rating
jsonHash["total_reviews"] = total_reviews
File.write('./productDetail.json', JSON.dump(jsonHash))

productDetail.json output:

{"title":"Samsung 6.2 kg Fully-Automatic Top load Washing Machine (WA62M4100HY/TL, Imperial Silver, Center Jet Technology)","brand":"‎Samsung","rating":"4.3","total_reviews":"15,484 ratings"}

Conclusion

Just to summarize this, I have used ruby just for demonstration purposes, on how to extract data from any HTML page. Any other language can also be used and it should just work fine. In the next blog part I shall discuss how to write web crawler to scrap data from entire website. So stay tuned!

References

https://developer.mozilla.org/en-US/docs/Web/XPath
https://developer.mozilla.org/enUS/docs/Web/API/Document_Object_Model/Introduction
https://www.oracle.com/in/database/what-is-json/