Introduction
This is a continuation of our discussion on web crawler for large language models at scale series. In part 1 we discussed how to get desired data from the HTML page for LLM, if not gone through it see here. In this part we shall talk about how to write web crawler for a website. As an example I will explain step by step how to scrap http://www.tajhotels.com. I will try to discuss all the steps in the order
- Scope out the problem, What do we want to do?
- URL Discovery module to get all the final URL’s?
- Data Extraction module to get desired data from the final URL pages?
- Dumping all the data in JSON.
- Challenges of web scraping and remedies.
- Conclusion.
Scope out the problem, What do we want to do?
Just a disclaimer before scraping any website make sure to check the robots.txt as mentioned in the previous part. Now let’s dive into this and define the scope of the problem.
In this blog our goal is to find all the final hotel pages for tajhotels, which contains information about the type of suites available in hotel per location.
Further on, final output of the extraction module would contain a set of json files. Where each json file should contain the entry of the hotel name, location and then hash containing all the Suite info including description and price.
URL Discovery module algorithm to get all the final URL’s?
In this module I will discuss a possible approach to get all links for the start URL. One way of thinking of this problem as tree data structure traversal. You have a root URL, underneath that URL’s per location, underneath that final hotel URL. Below depiction of the tree traversal of any website.
So to get all this product URL (i.e. hotel URL) we can do DFS or BFS traversal. We would take the BFS approach here for below reason
- Want to make sure without getting all the URL from the Level-1 we will not move to the next level.
- This approach can be scaled to create distributed crawlers. Where one app server works on a small amount of URL from a given level.
Now lets discuss the Algorithm for this approach.
- Create xpaths for each level, so that we can use the respective xpath at that level.
- While current level is not final level keep on traverse each level before proceeding to the next.
- If the current level is final level, push the final urls to the queue.
Lets dive in implementation of this module. If you see the code it is fairly straightforward to understand but getting to work in production is a big challenge which I will discuss down below.
require 'open-uri'
require 'nokogiri'
require 'json'
queue = []
levelHash = Hash.new
levelHash["level_1"] = "//a[contains(@class,'hotel-details')]/@href"
levelHash["level_2"] = "//div[contains(@class,'mr-list-hotel-price')]/a/@href"
queue.unshift "level_1"
queue.unshift "https://www.tajhotels.com/en-in/our-hotels/"
xpath = "";
productUrl = []
while not queue.empty?
url = queue.pop;
if(url.include? "level_1")
xpath = levelHash["level_1"];
queue.unshift "level_2"
elsif(url.include? "level_2")
xpath = levelHash["level_2"];
queue.unshift "final"
elsif(url.include? "final")
while(not queue.empty?)
productUrl.push(queue.pop())
end
else
page = URI.open(url, "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36")
document = Nokogiri::HTML(page.read)
count = 0
document.xpath(xpath).each do |link|
if link.text.include? "destination"
queue.unshift "https://www.tajhotels.com" + link.text
elsif link.text.start_with? "http"
queue.unshift link.text.split("?").first
elsif
queue.unshift "https://www.tajhotels.com" + link.text
end
end
end
end
Few final URL from the queue:
https://www.tajhotels.com/en-in/taj/taj-hotel-and-convention-centre-agra/
https://www.seleqtionshotels.com/en-in/taj-view-agra/
https://www.tajhotels.com/en-in/taj/taj-skyline-ahmedabad/
https://www.vivantahotels.com/en-in/vivanta-ahmedabad/
Data Extraction module to get desired data from the final URL pages?
Now we have found the final product URL’s we will process this final pages one by one. For the illustration purpose we will be extracting below items from the final product page, But it is up to us depending on the requirement. Though just a hint In the production environment we will have to take a distributed crawler approach for scale. Now let’s dive in code.
Hotel Name
Address
Check-in/check-out time
Room Type
Room Desc
productUrl.each do | url |
puts url
page = URI.open(url, "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36")
html = page.read
document = Nokogiri::HTML(html)
puts "HotelName: " + document.xpath("//div[contains(@class,'first-section')]/h1/span").text.strip
puts "Address: " + document.xpath("//span[contains(@class,'location-address')]/span").text.strip
puts "Check-In time: " + document.xpath("//li[contains(text(),'Check-in')]").first.text.strip
puts "Check-out time: " + document.xpath("//li[contains(text(),'Check-out')]").first.text.strip
document.xpath("//div[contains(@class,'rooms-and-suites-card')]//div[contains(@class,'rooms-and-suites-content')]").
each do | room |
puts "Room Type: " + room.xpath("./div[contains(@class,'room-title')]").first.text.strip
puts "Room description: " +
room.xpath("./div[contains(@class,'room- description')]").first.text.strip
end
end
Dumping all the data in JSON
We only need minor changes that we last discussed in the part 1 of webscraping discussion i.e. we have to add hash inside of hash for per RoomType.
fileCounter = 1;
productUrl.each do | url |
page = URI.open(url, "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36")
html = page.read
document = Nokogiri::HTML(html)
hotel = document.xpath("//div[contains(@class,'first-section')]/h1/span").text.strip
address = document.xpath("//span[contains(@class,'location-address')]/span").text.strip
checkIn = document.xpath("//li[contains(text(),'Check-in')]").first.text.strip
checkOut = document.xpath("//li[contains(text(),'Check-out')]").first.text.strip
jsonHash = Hash.new
jsonHash["HotelName"] = hotel
jsonHash["Address"] = address
jsonHash["Check_In"] = checkIn
jsonHash["Check_out"] = checkOut
roomCnt = 1
document.xpath("//div[contains(@class,'rooms-and-suites-card')]//div[contains(@class,'rooms-and-suites-content')]").each do | room |
roomType = room.xpath("./div[contains(@class,'room-title')]").first.text.strip
roomDesc = room.xpath("./div[contains(@class,'room-description')]").first.text.strip
innerJsonHash = Hash.new
innerJsonHash["RoomType"] = roomType
innerJsonHash["Desc"] = roomDesc
roomName = "Room_" + roomCnt.to_s
roomCnt = roomCnt + 1
jsonHash[roomName] = innerJsonHash
end
filePath = "productDetail_" + fileCounter.to_s + ".json"
File.write(filePath, JSON.dump(jsonHash))
fileCounter = fileCounter + 1
end
You can copy below output in here and see the results for yourself.
Output in ProductDetail_1.json:
{"HotelName":"Taj Hotel & Convention Centre, Agra","Address":"Taj East Gate Road, Agra, Uttar Pradesh, India, Uttar Pradesh, 282001, India","Check_In":"Check-in time: 2:00 PM","Check_out":"Check-out time: 12:00 Noon","Room_1":{"RoomType":"Deluxe Suite","Desc":"Welcome to the suite life. Our Deluxe Suite features a generous living room perfect for a family staycation. The upscale amenities, spacious bedroom and modern bathroom are complemented with handsome furnishings and lighting fixtures."},"Room_2":{"RoomType":"Luxury Suite","Desc":"The Luxury Suite is for those who don’t like to be hemmed in. Encompassing a generous 80 square metres, this suite comes with separate living and dining areas, a 4 fixture washroom with standalone bathtub, a separate powder room, a walk-in wardrobe and a writing table. Intricate inlay work, called Pietra Dura, adorns the walls. "},"Room_3":{"RoomType":"Presidential Suite","Desc":"The Presidential Suite is every bit as impressive as its name suggests. Encompassing an area of\n140 Sq Mt (the size of a medium-sized apartment), this suite features a luxurious bedroom, a\nvast living area with an 8-seater table, a bathroom with a sunken bathtub and a walk-in rain\nshower, a guest powder room and a separate walk-in wardrobe. Every room in the suite comes\nwith a view of the pool. The suite also opens out onto private lawns that are perfect to\nwelcome soft mornings and balmy evenings."},"Room_4":{"RoomType":"Superior Room","Desc":"Our superior rooms exude an air of eminence. Enveloped in soothing colours with wooden flooring, these rooms feature large, airy, sun-kissed windows, plush armchairs and ottomans and elegant drop-down lights."},"Room_5":{"RoomType":"Superior Room Pool View","Desc":"Wake up to tranquil mornings with refreshing views of the pool. The airy, well-lit accommodations in Agraare furnished with armchairs, ottomans and drop-down lights above the bedside tables."},"Room_6":{"RoomType":"Deluxe Room","Desc":"These generously appointed rooms in Taj Hotel & Convention Centre, Agra make for an opulent affair. With luxurious bathrooms that feature a standing bath tub and walk-in rain shower. Exquisite inlay work that showcases the local artisanship of Agra in all its glory. And swish furnishings and fittings that add to the experience."},"Room_7":{"RoomType":"Taj Club Room Taj Mahal View","Desc":"These Club Rooms are with an added allure: shimmering views of the Taj Mahal from our luxury hotel rooms. You can feel the aura of the world’s most beloved monument when you stay here. These swish rooms also feature modern furnishings and amenities. And luxurious bathrooms that with a standing bath tub and walk-in rain shower."},"Room_8":{"RoomType":"Taj Club Room","Desc":"Taj Club Rooms offer city view and are located on the top floor of the hotel, Exclusive benefits include one way airport transfer from Agra Airport, daily breakfast, evening cocktails hours at the club lounge and host of other privileges"}}
Challenges of webscraping and remedies
Web scraping is not that easy, Sometimes it becomes very challenging. Because the website from where the web pages are being scrape, have many different algorithms in place like bot detection, rate limiters other security algorithms. When in cases if the website detects that it is bot it keeps on sending 403 (Forbidden) http/https status code.
But there are remedies that can be applied to make it work.
- Add delay before requesting next page.
- Add retries if received incorrect page content received.
- Use a different user-agent when requested a web page.
- Use proxy addresses to fetch the page contains.
- Use distributed web crawler approach at URL discovery/Page fetcher stage.
- Sometimes the data you want to scrap comes via POST request rather a GET request. In such case we also have to write some more code to request for that that via POST request. We will discuss this some other blog how to write post request (hint: try using Post Master tool)
Conclusion
In this blog we discussed how to write a web crawler for any website and get desired for LLM to be trained on. Also Introduced the concept of a distributed crawler for better scaling. In the next part I shall share more details on the set of stages web crawlers need. Also we will discuss high level architectural design to crawl lot of websites in parallel. So stay tuned.