Introduction
Lets discuss the architecture design of how to run multiple webcrawler for different websites to collect desired data for large language Models. How one should go about designing such systems? In this discussion I shall pan out the details about the design in the following order.
- Define the scope of the problem.
- Back of the envelop estimation.
- Architecture design detailed discussion.
- Conclusion
Define the scope of the problem
Let me define the scope. This is an inhouse system where external users are limited, But we will keep the scope for a bigger scale.
Functional Requirements
- We should be able to add a webcrawler for a website, also should be able to run based on configurable frequency for instance daily/weekly/monthly.
- We should be able to remove/update a website webcrawler.
- Divide webcrawler processes into different stages.
- Should be able to rerun the webcrawler stage if needed.
- Should be able to log high level stats of each stage.
- Should support Analytics for a web crawler run.
Non-Functional Requirements
- The system should scalable.
- The system should be fault tolerant.
- The system should be secure.
Back of the envelop estimation webcrawler architecure
Lets do back of the envelope estimation for web crawler design, and estimate required resources. For doing that lets make some assumptions here
Space needed to keep all the data records
- We will run 2000 webcrawler monthly.
- Each website would have an average of 5000 Final urls and data records.
- The average size of each desired data record is 100KB.
- We want to support the data backup for 5 years for a web crawler run.
Total storage requirement for our infrastructure for 5 years = 60 * 2000 * 5000 * 100KB = ~600 TB
Space needed to keep HTML pages for week, and run a specific stage after final URL pages are fetched.
- Average page size 5 Mb.
Total size requirement for all the pages fetched for a week: 7 * 5000 * 2000 * 5 Mb = ~350 TB
So the total space requirements is ~ 950 TB
SQL db total size requirement:
- We want to store stats for crawler’s day run. Each stage of the crawler adds this stats.
- We also want to store the server path at each stage where the web crawler run.
Assuming the above two points are taking 50 Kb of storage.
Total db size for 5 years = 60 * 2000 * 50Kb = 6 Gb
No Sql DB total size requirement:
- Assuming half of the total webcrawler would opt for Analytics.
- We only want to support analytics for 1 year data back up.
- Assuming the average size of the data is 100KB
Total size requirement = 1000 * 12 * 5000 * 100KB = 6 TB
Now we know how much resources would be needed, but it is equally important to discuss the type of system being designed. This is a write heavy system clearly, as different stages will be writing the data frequently in comparison to reading.
Assuming a write : read ratio in 10 : 1
System Design Discussion
WebCrawler stages
Before we go any further let’s look at the stages of webcrawler?
- URL Discovery Stage – This stage will find the URL links under the start URL, then push the discovered urls to async processing queue.
- Page Fetcher Stage – This stage will pop URLs from the processing queue, further fetching the page.
- Data Extraction Stage – This stage will extract the desired data from the fetched page.
- Duplication Removal Stage – This stage will remove data record duplication across the website.
- Analytics Stage – This stage will push all the configued data to NoSQL db.
Evidently from the above set of webcrawler stages the Analytics stage is optional. Of course not every website data needs to be analyzed. For instance Twitter webcrawler can subscribe for Analytics stage, but not any static site webcrawler.
SQL database Schema
Below data needs to be stored in the database
[1] Details of all the sites webcrawler.
[2] Run status of each webcrawler stage.
[3] Each webcrawler stage has to add the following details
- The url discovery stage need to add, for instance how many urls were discovered?
- The page fetcher stage need to add, how many HTML pages were fetched?
- The data extraction stage need to add, how many records were extracted?
- The duplicate removal stage need to add, how many unique records were extracted?
You might be wondering why all this data is needed?
- This addresses functional requirements, such as rerun of webcrawler stage.
- Furthermore this data helps us provide some debug inputs. For instance if there is a huge disparity in the stats number from two consecutive stages.
Database tables schema
Webcrawler details
Website ID | Website Name | Date added | Frequency to crawl | Start URL | Analytics Needed |
Run Status
Website ID | Server Name | Status | TimeNow |
Discovery Stage details
Website ID | Total URL Discovery | TimeNow | ServerName | PathToList URLDiscoverd |
Page Fetcher details
Website ID | Total page fetched | TimeNow | ServerName | PathtoWherePageFetched |
Data Extraction stage
Website ID | Total records extracted | TimeNow | ServerName | PathToWhereDataExtracted |
Duplication Removal Stage
Website ID | Total records after Duplicate removal | TimeNow | PathToWhereDataExtacted |
Lets discuss a bit about why we want to add data in SQL databases for Initial 4 stages?
- Our system is write heavy, then multiple different servers are running different stages at any given time. Each of them are storing their data in database.
- We want the data to be consistent, then ACID property comes handy with SQL in the distributed system.
- We have limited data to be stored and the tables are dependent. Joining multiple tables for certain would be more convenient.
- Webcrawler stage schedulers need atomic data to schedule next stages.
We also want to make sure there should not be a single point of failure. It seems likely master-slave architecture needs to be employed with enough replication factor.
Lets discuss why the Analytics Stage should not use SQL db, but NoSql?
- The data is huge, moreover it is independent.
- Each website webcrawler might have a different kind of data to be added, then there is no common schema. It pivotal that db should support flexible schema.
- Data consisistency is not a hard requirement. Eventually consistency should work, then BASE model of no sql darabase can do wonders.
- NoSql db can scale horizontaly better with more data.
Now the question is which noSql type be used here? It should document type NoSql which can store the data in JSON, due to its flexible schema for example MongoD falutb.
What types of processing queues to be used different webcrawler stages
Url discovery, page fetcher webcrawler stages and Scheduler’s needs distributed async queue.
- That should be robust.
- That should support creating unique queues per webcrawler run. Then support push/pop data from that queue.
We can use Kafka Or RabitMQ, I would go with RabitMQ because it is more generic. meaning you can replace RabitMq with any other message broker that supports AMQP protocol. Whereas we cannot do the same with Kafka. Though Kafka scales better RabitMQ, but in our usecase RabitMq should suffice.
How users Interacts with the webcrawler architecure?
‘WebCrawler Management Tool’
The user interacts with the system via “Web Crawler Management Tool” Which support
- CRUD operations for webcrawler.
- Re-run of specific stage.
- Basic analytics support for each webcrawler stage.
- Can see the live status of the async processing queue.
All this functionality can be accessed via RestAPI’s.
‘Analytics Engine tool’
Users can also see analytics based on the records that are available for a certain website. For illustration it can support basic and advance queries.
Load balancing
We will be using the software load balancer to forward the request to the less loaded app server, which then going to run a specific webcrawler stage. We can use NGNIX, HAProxy etc. Load balancer will help with the below advantages
- It increases security as the request will never hit the App server directly.
- It acts as act as reverse proxy.
- Rate limiter can be configured here.
- Distributes load on multiple servers.
Communication Protocols
Let’s also look a bit at the communication protocols are at play.
- HTTPs – Between the user and load balancer.
- AMQP – Rabbit mq client(webcrawler stage) and server.
- JDBC/ODBC – Between webcrawler stage client and Sql db.
- Wire Protocol – Between webcrawler Analytics stage and Mongo Db nosql db.
Monolithic or Microservices Architecture?
Microservices architecture are tempting here but it is much more complex to maintain. Though It comes with added benefits where agile development can be leveraged at its full capacity. Whereas Monolithic is easy to maintain but slows down the development.
So which one to choose? it all depends on what kind of system is being designed. In our case new features will be very limited so Monolithic is a better option. If it is otherwise then it does make sense to think about micro services.
Architecture diagram to run multiple webcrawler
Looking at the architecture diagram, it does look self explanatory as numbers are marked to help understand the flow. Besides I want to put some weight, and discuss the idea of why two schedules are needed? The reason is there are two scheduling concerns. Which needed to be handled, both are orthogonal in nature.
- We have frequency configured (when this crawler needs to run) with each webcrawler, needs to be scheduled for its URL Discovery stage accordingly.
- We also want to schedule a next stages given the current one has completed for any crawler.
Conclusion
In this blog we discuss the system design for webcrawler framework for Large Language model. It’s not that this architecture design can only be applied to LLM model data scraping. This can be applied and extended for other requirements as well. I hope this series helped get an idea how webcrawler works from basic to advanced level. See you in the next blog!
More articles
References
https://www.rabbitmq.com/documentation.html https://en.wikipedia.org/wiki/Web_crawler https://www.mongodb.com/docs/ https://www.nginx.com/resources/glossary/load-balancing/ https://dev.mysql.com/doc/ https://www.mongodb.com/docs/manual/reference/mongodb-wire-protocol/