Scrapy Tips from the Pros: Part 1 | The Scrapinghub Blog

Use Extruct to Extract Microdata from Websites

I am sure each and every developer of web crawlers has had a reason to curse web developers who use messy layouts for their websites. Websites with no semantic markups and especially those based on HTML tables are the absolute worst. These types of websites make scraping much harder because there are little-to-no clues about what each element means. Sometimes you even have to trust that the order of the elements on each page will remain the same to grab the data you need.

Which is why we are so grateful for Schema.org, a collaborative effort to bring semantic markup to the web. This project provides web developers with schemas to represent a range of different objects in their websites, including Person, Product, and Review, using any metadata format like Microdata, RDFa, JSON-LD, etc. It makes the job of search engines easier because they can extract useful information from websites without having to dig into the HTML structure of all the websites they crawl.

For example, AggregateRating is a schema used by online retailers to represent user ratings for their products. Here’s the markup that describes user ratings for a product in an online store using theMicrodata format:

 

Source: Scrapy Tips from the Pros: Part 1 | The Scrapinghub Blog

 

Raony Guimaraes