How to learn web scraping without knowledge of page structure

yq19820430_c2c34eb693d991fc · August 2, 2021, 2:23pm

I already know how to implement the web spider part, I just want to learn how I can determine a web page’s relevancy without knowing a single thing about the page’s structure. I have researched web scraping techniques but they all seem to assume knowledge of the page’s html tag structure. Is there a certain algorithm out there that would allow me to pull data from the page and determine its relevancy?

Any pointers would be greatly appreciated. I am using Python with urllib and BeautifulSoup .

Basically, I’m trying to write a Python script that, given a few keywords , will crawl web pages until it finds the data I need. For example, say I want to find a list of venemous snakes that live in the US. I might run my script with the keywords list,venemous,snakes,US , and I want to be able to trust with at least 80% certainty that it will return a list of snakes in the US.