Web Scraping With Python Tutorial (2023)

Web Scraping With Python Tutorial (1)

One of the biggest revolutions of 21st century is realization how valuable data can be. Great news is that the internet is full of great, public data for you to take advantage of, and that's exactly the purpose of web scraping: collecting this public data to bootstrap a newly found business or a project.

In this practical introduction to web scraping in python, we'll take a deep look at what exactly is web scraping, technologies that power it, and what are some common challenges modern web scraping projects face.

For this we'll explore the entire web scraping with python process:
We'll start off by learning about HTTP and how to use HTTP clients in python to collect web page data. Then we'll take a look at parsing HTML page data using CSS and XPATH selectors. Finally, we'll build an example web scraper with Python for producthunt.com product data to solidify what we've learned.

What is Web Scraping?

Web scraping is essentially public data collection via automated process. There are thousands of reasons why one might want to collect this public data, like finding potential employees or gathering competitive intelligence. We at ScrapFly did extensive research into web scraping applications, and you can find our findings here on our Web Scraping Use Cases page.

To scrape a website with python we're generally dealing with two types of problems: collecting the public data available online and then parsing this data for structured product information. In this article, we'll take a look at both of these steps and solidify the knowledge with an example project.

Connection: HTTP Fundamentals

To collect data from a public resource, we need to establish connection with it first. Most of the web is served over HTTP which is rather simple: we (the client) send a request the website (the server) for a specific document, once the server processes our request it replies with the requested document - a very straight forward exchange!

Web Scraping With Python Tutorial (2)

As you can see in this illustration: we send a request object which is consists of method (aka type), location and headers. In turn, we receive a response object which consists of status code, headers and document content itself.
Let's take a quick look at each of these components, what they mean and how are they relevant in web scraping.

Understanding Requests and Responses

When it comes to web scraping we don't exactly need to know every little detail about HTTP requests and responses, however it's good to have a general overview and to know which parts of this protocol are especially useful in web scraping. Let's take a look at exactly that!

Request Method

Http requests are conveniently divided into few types that perform distinct function:

  • GET requests are intended to request a document.
  • POST requests are intended to request a document by sending a document.
  • HEAD requests are intended to request documents meta information.
  • PATCH requests are intended to update a document.
  • PUT requests are intended to either create a new document or update it.
  • DELETE requests are intended to delete a document.

When it comes to web scraping, we are mostly interested in collecting documents, so we'll be mostly working with GET and POST type requests. To add, HEAD requests can be useful in web scraping to optimize bandwidth - sometimes before downloading the document we might want to check it's metadata whether it's worth the effort.

Request Location

To understand resource location, first we should take a quick look at URL's structure itself:

Web Scraping With Python Tutorial (3)

Here, we can visualize each part of a URL: we have protocol, which when it comes to HTTP is either http or https. Then, we have host which is essentially the address of the server that is either a domain name or an IP address. Finally, we have the location of the resource and some custom parameters.
If you're ever unsure of a URL's structure, you can always fire up python and let it figure it out for you:

from urllib.parse import urlparseurlparse("http://www.domain.com/path/to/resource?arg1=true&arg2=false")> ParseResult(scheme='http', netloc='www.domain.com', path='/path/to/resource', params='', query='arg1=true&arg2=false', fragment='')

Request Headers

While it might appear like request headers are just minor metadata details, in web scraping they are extremely important! Headers contain essential details about the request, like: who's requesting the data? What type of data they are expecting? Getting these wrong might result in the web scraper being denied access.

Let's take a look at some of the most important headers and what they mean:

User-Agent is an identity header that tells the server who's requesting the document.

# example user agent for Chrome browser on Windows operating system:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36

Whenever you visit a web page in your web browser it identifies itself with a User-Agent string that looks something like "Browser Name, Operating System, Some version numbers". This helps the server to determine whether to serve or deny the client. In web scraping, of course, we don't want to be denied access, so we have to blend in by faking our user agent to look like that one of a browser.

There are many online databases that contain latest user-agent strings of various platforms, like one provided by whatismybrowser.com

Cookie is used to store persistent data. This is a vital feature for websites to keep tracking of user state: user logins, configuration preferences etc. Cookies are a bit out of scope of this article, but we'll be covering them in the future.

Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content we're expecting. Generally when web scraping we want to mimic this of one of the popular web browsers, like Chrome browser use:

text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

For all standard values see content negotiation header list by MDN

(Video) Web Scraping with Python - Beautiful Soup Crash Course

X- prefixed headers are special custom headers. These are important to keep an eye on when web scraping, as they might configure important functionality of a website/webapp.

These are a few of most important observations, for more see extensive full documentation over at https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers

Response Status Code

Conveniently, all HTTP responses come with a status code that indicates whether this request is a success, failure or some alternative action is requested (like request to authenticate or follow a redirect).
Let's take a quick look at status codes that are most relevant to web scraping:

  • 200 range codes generally mean success!
  • 300 range codes tend to mean redirection - in other words if we request content at /product1.html it might be moved to a new location /products/1.html and server would inform us about that.
  • 400 range codes mean request is malformed or denied. Our web scraper could be missing some headers, cookies or authentication details.
  • 500 range codes typically mean server issues. The website might be unavailable right now or is purposefully disabling access to our web scraper.

Response Headers

When it comes to web scraping, response headers provide some important information for connection functionality and efficiency.
For example, Set-Cookie header requests our client to save some cookies for future requests, which might be vital for website functionality. Other headers such as Etag, Last-Modified are intended to help client with caching to optimize resource usage.

For all options see the standard HTTP request header list by MDN

Finally, just like with request headers, headers prefixed with X- are custom web functionality headers and depend on each, individual website.

We took a brief overlook of core HTTP components, and now it's time to see how HTTP works in practical Python!

HTTP Clients in Python

Before we start exploring HTTP connections in python, we need to choose a HTTP client. Python comes with an HTTP client built in called urllib, however for web scraping we need something more feature rich and easier to handle so let's take a look at popular community libraries.

First thing to note when it comes to the HTTP is that it has 3 distinct versions:

  • HTTP1.1 the most simple text based protocol used widely by simpler programs. Implemented by urllib, requests, httpx, aiohttp
  • HTTP2 more complex/efficient binary based protocol, mostly used by web-browsers. Implemented by httpx
  • HTTP3/QUIC the newest and most efficient version of protocol mostly used by web-browser. Implemented by aioquic, httpx(planned)

As you can see, Python has a very healthy HTTP client ecosystem. When it comes to web scraping HTTP1.1 is good enough for most cases, however HTTP2/3 are very helpful for avoiding web scraper blocking.

We'll be sticking with httpx as it offers all the features required for web scraping. That being said, other HTTP clients like the requests library can be used almost interchangeably.

Let's see how we can utilize http connections for scraping in Python. First, let's set up our working environment. We'll need python version >3.7 and httpx library:

$ python --versionPyhton 3.7.4$ pip install httpx

With httpx installed, we have everything we need to start connecting and receiving our documents. Let's give it a go!

Exploring HTTP with httpx

Now that we have basic understanding of HTTP and our working environment ready, let's see it in action!
In this section, we'll experiment with basic web-scraping scenarios to further understand HTTP in practice.

For our example case study, we'll be using http://httpbin.org request testing service, which allows us to send requests and returns exactly what happens.

GET Requests

Let's start off with GET type requests, which are the most common type of requests in web scraping.
To put it shortly GET often simply means: give me the document located at : GET https://www.httpbin.org/html request would be asking /html document from httpbin.org server:

import httpxresponse = httpx.get("https://www.producthunt.com/posts/evernote")html = response.textmetadata = response.headersprint(html)print(metadata)

Here, we perform the most basic GET request possible. However, just requesting the document often might not be enough. As we've explored before, requests are made of: request type, location, headers and optional content. So what are headers?

Request Metadata - Headers

We've already done a theoretical overview of request headers and since they're so important in web scraping let's take a look at how we can use them with our HTTP client:

import httpxresponse = httpx.get('http://httpbin.org/headers')print(response.text)

In this example we're using httpbin.org testing endpoint for headers, it returns http inputs (headers, body) we sent as response body. If we run this code with specific headers, we can see that the client is generating some basic ones automatically:

(Video) Web Scraping With Python 101

{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate, br", "Host": "httpbin.org", "User-Agent": "python-httpx/0.19.0", }}

Even though we didn't explicitly provide any headers in our request, httpx generated required basics for us. By using headers argument, we can specify custom headers ourselves:

import httpxresponse = httpx.get('http://httpbin.org/headers', headers={"User-Agent": "ScrapFly's Web Scraping Tutorial"})print(response.text)# will print:# {# "headers": {# "Accept": "*/*", # "Accept-Encoding": "gzip, deflate, br", # "Host": "httpbin.org", # "User-Agent": "ScrapFly's Web Scraping Tutorial", # ^^^^^^^ - we changed this!# }# }

As you can see above, we used a custom User-Agent header for this request, while other headers remain automatically generated by our client. We'll talk more about headers in "Challenges" section below, but for now most minor web scraping can work well with headers httpx generates for us.

POST Requests

As we've discovered, GET type requests just mean "get me that document", however sometimes that might not be enough information for the server to serve correct content.

On the other hand POST requests are the opposite - "take this document". Why would we want to give someone a document when web scraping? Some website operations require complex set of parameters to process the request. For example, to render a search results page the website needs query parameters of what to search. So, as a web scraper we would send a document containing search parameters and in return we'd get document containing search results.

Let's take a quick look at how we can use POST requests in httpx:

import httpxresponse = httpx.post("http://httpbin.org/post", json={"question": "Why is 6 afraid of 7?"})print(response.text)# will print:# {# ...# "data": "{\"question\": \"Why is 6 afraid of 7?\"}", # "headers": {# "Content-Type": "application/json", # ...# }, # }

As you can see, if we submit this request, the server will receive some JSON data, and a Content-Type header indicating the type of this document(application/json). With this information, the server will do some thinking and return us a document in exchange. In this imaginary scenario, we submit a document with question data, and the server would return us the answer.

Configuring Proxies

Making thousands of connections from a single address is an easy way to be identified as a web scraper which might result in being blocked. To add, some websites are only available in certain regions of the world. Meaning, we are in great advantage if we can mask the origin of our connections by using a proxy.

Httpx supports extensive proxy options for both HTTP and SOCKS5 type proxies:

import httpxresponse = httpx.get( "http://httpbin.org/ip", # we can set proxy for all requests proxies = {"all://": "http://111.22.33.44:8500"}, # or we can set proxy for specific domains proxies = {"all://only-in-us.com": "http://us-proxy.com:8500"},)
Introduction To Proxies in Web ScrapingFor more on proxies in web scraping see our full introduction tutorial which explains different proxy types and how to correctly manage them in web scraping projects.

Managing Cookies

Cookies are used to help server track its clients. It enables persistent connection details such as login sessions or website preferences.
In web scraping, we'd can encounter websites that cannot function without cookies so we must replicate them in our HTTP client connection. In httpx we can use the cookies argument:

import httpx# we can either use dict objectscookies = {"login-session": "12345"}# or more advanced httpx.Cookies manager:cookies = httpx.Cookies()cookies.set("login-session", "12345", domain="httpbin.org")response = httpx.get('https://httpbin.org/cookies', cookies=cookies)

Putting It All Together

Now that we have briefly introduced ourselves with HTTP clients in python, let's apply it practice and scrape some stuff!
In this section, we have a short challenge: we have multiple URLs that we want to retrieve HTML contents off. Let's see what sort of practical challenges we might encounter and how real web scraping programs function.

import httpx# here is a list of urls, in this example we'll just use some place holdersurls = [ "http://httbin.org/html", "http://httbin.org/html", "http://httbin.org/html", "http://httbin.org/html", "http://httbin.org/html",]# as discussed in headers chapter we should always stick to browser-like headers for our # requests to prevent being blockedheaders = { # lets use Chrome browser on Windows: "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",}# since we have multiple urls we want to scrape we should establish a persistent sessionwith httpx.Client(headers=headers) as session: for url in urls: response = session.get(url) html = response.text meta = response.headers print(html) print(meta)

As you can see, there's quite a bit of going on here. Let's unpack the most important bits in greater detail:

Why are we using custom headers?
As we've discussed in the headers chapter, we must mask our web scraper to appear as a web browser to prevent being blocked. While httpbin.org is not blocking any requests, it's generally a good practice to set at least User-Agent and Accept headers when web-scraping public targets.

What is httpx.Client?
We could skip it and call httpx.get() for each url instead:

for url in urls: response = httpx.get(url, headers=headers)# vswith httpx.Client(headers=headers) as session: response = session.get(url)

However, as we've covered earlier HTTP is not a persistent protocol - meaning every time we would call httpx.get() we would connect with the server and only then exchange our requests/response objects. To optimize this exchange we can establish a session which is usually referred to as Connection Pooling or HTTP persistent connection.

In other words, this session will establish connection only once and continue exchanging our requests until we close it. Using sessions not only optimizes our code, but also provides some convenient shortcuts like setting global settings like headers and managing cookies, redirects automatically.

We've got a good grip on HTTP so now, let's take a look at the second part of web scraping: parsing!

Parsing HTML Content

HTML (HyperText Markup Language) is a text data structure that powers the web. The great thing about HTML structure is that it's intended to be machine-readable text content, which is great news for web-scraping as we can easily parse the data with code!

HTML is a tree type structure that lends easily to parsing. For example, let's take this simple HTML content:

<head> <title>My Website</title></head><body> <h0>Welcome to my website!</h1> <div class="content"> <p>This is my website</p> <p>Isn't it great?</p> </div></body>

Here we see a basic HTML document that a simple website might serve. You can already see the tree like structure just by indentation of the text, but we can even go further and illustrate it:

Web Scraping With Python Tutorial (5)

This tree structure is brilliant for web-scraping as we can easily navigate the whole document.
For example, to find the title of the website, we can see that it's under HTML tag <head> which is under <title> nodes. In other words - if we wanted to extract 999 titles for 1000 different pages, we would write a rule to find head->title->text for every one of them.

When it comes to HTML parsing using path instructions, there are two standard ways to approach this: CSS selectors and XPATH selectors - let's take a look at them.

Using CSS and XPATH Selectors

There are two HTML parsing standards:

  • CSS selectors - simpler, more brief, less powerful
  • XPATH selectors - more complex, longer, very powerful

Generally, modern websites can be parsed with CSS selectors alone, however sometimes HTML structure can so complex that having that extra XPATH power makes things much easier. We'll be mixing both - we'll stick CSS where we can otherwise fallback to XPATH.

Parsing HTML with CSS SelectorsFor more on CSS selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms
Parsing HTML with XpathFor more on XPATH selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms

Since Python has no built-in HTML parser, we must choose a library which provides such capability. In Python, there are several options, but the two biggest libraries are beautifulsoup (beautifulsoup4) and parsel.

(Video) Beginners Guide To Web Scraping with Python - All You Need To Know

We'll be using parsel HTML parsing package in this chapter, but since CSS and XPATH selectors are de facto standard ways of parsing HTML we can easily apply the same knowledge to beautifulsoup library as well as other HTML parsing libraries in other programming languages.

Let's install the python library parsel and do a quick introduction:

$ pip install parsel$ pip show parselName: parselVersion: 0.6.0...

For more on parsel see official documentation

Now with our package installed, let's give it a spin with this imaginary HTML content:

# for this example we're using a simple website pageHTML = """<head> <title>My Website</title></head><body> <div class="content"> <h0>First blog post</h1> <p>Just started this blog!</p> <a href="http://github.com/scrapfly">Checkout My Github</a> <a href="http://twitter.com/scrapfly_dev">Checkout My Twitter</a> </div></body>"""from parsel import Selector# first we must build parsable tree object from HTML text stringtree = Selector(HTML)# once we have tree object we can start executing our selectors# we can use ss selectors:github_link = tree.css('.content a::attr(href)').get()# we can also use xpath selectors:twitter_link = tree.xpath('//a[contains(@href,"twitter.com")]/@href').get()title = tree.css('title').get()github_link = tree.css('.content a::attr(href)').get()article_text = ''.join(tree.css('.content ::text').getall()).strip()print(title)print(github_link)print(twitter_link)print(article_text)# will print:# <title>My Website</title># http://github.com/scrapfly# http://twitter.com/scrapfly_dev# First blog post# Just started this blog!# Checkout My Github

In this example we used parsel package to create a parse tree from existing HTML text. Then, we used CSS and XPATH selector functions of this parse tree to extract title, Github link, Twitter link and the article's text.

In the last article we've covered how to download HTML documents using httpx client and in this article we've figured how to use CSS and XPATH selectors to parse HTML data using parsel package. Now let's put all of this together and write a small scraper!

In this section we'll be scraping https://www.producthunt.com/ which is essentially a technical product directory where people submit and discuss new digital products.

Let's start with the scraper's source code:

import httpximport jsonfrom parsel import SelectorDEFAULT_HEADERS = { # lets use Chrome browser on Windows: "User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",}def parse_product(response): tree = Selector(response.text) return { "url": str(response.url), 'name': tree.css('h0 ::text').get(), 'subtitle': tree.css('h1 ::text').get(), # votes are located under <span> which contains bigButtonCount in class names 'votes': tree.css("span[class*='bigButtonCount']::text").get(), # tags is our most complex location # tag links are under div which contains topicPriceWrap class # and tag links are only valid if they have /topic/ in them 'tags': tree.xpath( "//div[contains(@class,'topicPriceWrap')]" "//a[contains(@href, '/topics/')]/text()" ).getall(), }def scrape_products(urls): results = [] with httpx.Client(headers=DEFAULT_HEADERS) as session: for url in urls: response = session.get(url results.append(parse_product(response)) return resultsif __name__ == '__main__': results = scrape_products([ "https://www.producthunt.com/posts/notion-8", "https://www.producthunt.com/posts/obsidian-4", "https://www.producthunt.com/posts/evernote", ]) print(json.dumps(results, indent=2))

In this little scraper we provide a list of producthunt.com product urls and have our scraper collect and parse basic product data from each one of them.

Scraper results example
[ { "url": "https://www.producthunt.com/posts/notion-8", "name": "Notion", "subtitle": "Artificial intelligence-powered email.", "tags": [ "Android", "iPhone", "Email" ], "votes": "0,650" }, { "url": "https://www.producthunt.com/posts/obsidian-4", "name": "Obsidian", "subtitle": "A powerful knowledge base that works on local Markdown files", "tags": [ "Productivity", "Note" ], "votes": "0,706" }, { "url": "https://www.producthunt.com/posts/evernote", "name": "Evernote", "subtitle": "Note taking made easy", "tags": [ "Android", "iPhone", "iPad" ], "votes": "299" }]

Thanks to the rich python's ecosystem, we've accomplished this single page scraper in under 39 lines of code - awesome!

Further, let's modify our script, so it finds product urls by itself by scraping producthunt.com topics. For example, /topics/productivity contains a list of products that are intended to boost digital productivity:

from urllib.parse import urljoindef parse_topic(response): tree = Selector(text=response.text) # get product relative urls: urls = tree.xpath("//li[contains(@class,'styles_item')]/a/@href").getall() # turn relative urls to absolute urls and return them return [urljoin(response.url, url) for url in urls]def scrape_topic(topic): with httpx.Client(headers=DEFAULT_HEADERS) as session: response = session.get(f"https://www.producthunt.com/topics/{topic}") return parse_topic(response)if __name__ == '__main__': urls = scrape_topic("productivity") results = scrape_products(urls) print(json.dumps(results, indent=2))

Now we have a full scraping loop: we retrieve product urls from a directory page and then scrape each of them individually!

Wou could further improve this scraper with paging support as now we're only scraping the first page of topics, implement error and failure handling as well as some tests. That being said this is a good entry point into web scraping world as we tried out many thing we've covered in this article like header faking, using client sessions and parsing xpath/css with css/xpath selectors.

Challenges

When it comes to web scraping challenges we can put in them in to few distinct categories:

Dynamic Content

In this article we used HTTP clients to retrieve data, however our python environment is not a web browser and it can't execute complex javascript powered behavior some websites use. Most common example of this is dynamic data loading when page URL doesn't change but clicking a button changes some data on the page. To scrape this we either need to reverse engineer the website's javascript behavior or use web browser automation with headless browsers.

Scraping Dynamic Websites Using BrowserFor browser usage in web scraping see our full introduction article which covers the most popular tools Selenium, Puppeteer and Playwright

Connection Blocking

Unfortunately, not every website tolerates web scraping and often blocks them. To avoid this, we need to ensure that our web scraper looks and behaves like a web browser user. We've taken a look at using web browser headers to accomplish this but there's much more to it.

Parsing Content

Even though HTML content is machine parsable many website developers don't create it with this intention. So, we might encounter HTML files that are really difficult to digest. XPATH and CSS selectors are really powerful and combined with regular expression or natural language parsing we can confidently extract any data an HTML could present. If you're stuck with parsing we highly recommend #xpath and #css-selectors tags on stackoverflow.

Web Scraper Scaling

There's a lot of data online and while scraping few documents is easy, scaling that to thousands and millions of http requests and documents can quickly introduce a lot of challenges ranging from web scraper blocking to handling multiple concurrent connections.

For bigger scrappers we highly recommend taking advantage of Python's asynchronous ecosystem. Since HTTP connections involve a lot of waiting async programming allows us to schedule and handle multiple connections concurrently. For example in httpx we can manage both synchronous and asynchronous connections:

import httpximport asynciofrom time import timeurls_20 = [f"http://httpbin.org/links/20/{i}" for i in range(20)]def scrape_sync(): _start = time() with httpx.Client() as session: for url in urls_20: session.get(url) return time() - _startasync def scrape_async(): _start = time() async with httpx.AsyncClient() as session: await asyncio.gather(*[session.get(url) for url in urls_20]) return time() - _startif __name__ == "__main__": print(f"sync code finished in: {scrape_sync():.2f} seconds") print(f"async code finished in: {asyncio.run(scrape_async()):.2f} seconds")

Here, we have two functions that scrape 20 urls. One synchronous and one taking advantage of asyncio's concurrency. If we run them we can see a drastic speed difference:

sync code finished in: 7.58 secondsasync code finished in: 0.89 seconds

Fortunately, the web scraping community is pretty big and can often help solve these issues, our favorite resources are:

We at ScrapFly have years of experience with these issues and worked hard to provide one shoe-fit-all solution via our ScrapFly API where many of these challenges are solved automatically!

(Video) Python Tutorial: Web Scraping with BeautifulSoup and Requests

ScrapFly

Here at ScrapFly we recognize the difficulties of web scraping and came up with an API solution that solves these issues for our users.
ScrapFly is essentially an intelligent middleware that sits between your scraper and your target. Your scraper, instead of connecting to your target itself, requests ScrapFly API to do it for it and return the end results.

Web Scraping With Python Tutorial (9)

This abstraction layer can greatly increase performance and reduce the complexity of many web-scrapers by offloading common web scraping issues away from the scraper code!

Let's take a look at how our example scraper would look in ScrapFly SDK. We can install ScrapFly SDK using pip: pip install scrapfly-sdk and the usage is similar to that of a regular HTTP client library:

from scrapfly import ScrapflyClient, ScrapeConfigurls = [ "http://httbin.org/html", "http://httbin.org/html", "http://httbin.org/html", "http://httbin.org/html", "http://httbin.org/html",]with ScrapflyClient(key='<YOUR KEY>') as client: for url in urls: response = client.scrape( ScrapeConfig(url=url) ) html = response.scrape_result['content']

As you can see, our code with ScrapFly looks almost the same except we get rid of a lot of complexity such as faking our headers as we did in our httpx based scraper - ScrapFly does all this automatically!
We can even go further and enable a lot of optional features (click to expand for details):

javascript rendering - use ScrapFly's automated browsers to render websites powered by javascript

This can be enabled by the render_js=True option:

from scrapfly import ScrapflyClient, ScrapeConfigurl = "https://quotes.toscrape.com/js/page/2/" with ScrapflyClient(key='<YOUR KEY>') as client: response = client.scrape( ScrapeConfig( url=url render_js=True # ^^^^^^^ enabled ) ) html = response.scrape_result['content']

smart proxies - use ScrapFly's 190M proxy pool to scrape hard to access websites

All ScrapFly requests go through proxy but we can further extend that by selecting different proxy types and proxy locations:

from scrapfly import ScrapflyClient, ScrapeConfigurl = "https://quotes.toscrape.com/js/page/2/" with ScrapflyClient(key='<YOUR KEY>') as client: response = client.scrape( ScrapeConfig( url=url # see https://scrapfly.io/dashboard/proxy for available proxy pools proxy_pool='public_mobile_pool', # use mobile proxies country='US', # use proxies located in the United States ) ) html = response.scrape_result['content']

anti scraping protection bypass - scrape anti-scraping service protected websites

This can be enabled by the asp=True option:

from scrapfly import ScrapflyClient, ScrapeConfigurl = "https://quotes.toscrape.com/js/page/2/" with ScrapflyClient(key='<YOUR KEY>') as client: response = client.scrape( ScrapeConfig( url=url # enable anti-scraping protection bypass asp=True ) ) html = response.scrape_result['content']

Scraping Frameworks: Scrapy

In this article we've covered hands-on web scraping, however when scaling to hundreds of thousands of requests reinventing the wheel can be a suboptimal and painful experience. For this it might be worth taking a look into web scraping frameworks like Scrapy which is a convenient abstraction layer around everything we've learned today and more!

Web Scraping With ScrapyFor more on scrapy see our full introduction article which covers introduction, best practices, tips and tricks and an example project!

Scrapy implements a lot of shortcuts and optimizations that otherwise would be difficult to implement by hand, such as request concurrency, retry logic and countless community extensions for handling various niche cases.

ScrapFly's python-sdk package implements all the powerful ScrapFly's features into Scrapy's API:

# /spiders/scrapfly.pyfrom scrapfly import ScrapeConfigfrom scrapfly.scrapy import ScrapflyMiddleware, ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponseclass ScrapFlySpider(ScrapflySpider): name = 'scrapfly' start_urls = [ ScrapeConfig(url='https://www.example.com') ] def parse(self, response: ScrapflyScrapyResponse): yield ScrapflyScrapyRequest( scrape_config=ScrapeConfig( url=response.urljoin(url), # we can enable javascript rendering via browser automation render_js=True, # we can get around anti bot protection asp=True, # specific proxy country country='us', # change proxy type to mobile proxies proxy_pool="public_mobile_pool", ), callback=self.parse_report ) # settings.pySCRAPFLY_API_KEY = 'YOUR API KEY'CONCURRENT_REQUESTS = 2

FAQ

We've covered a lot in this article but web scraping is such a vast subject that we just can't fit everything into a single article. However, we can answer some frequently asked questions people have about web scraping in Python:

Is Python Good for Web Scraping?

Building a web scraper in Python is quite easy! Unsurprisingly, it's by far the most popular language used in web scraping.
Python is an easy yet powerful language with rich ecosystems in data parsing and HTTP connection areas. Since web scraping scaling is mostly IO based (waiting for connections to complete takes the most of the program's runtime), Python performs exceptionally well as it supports asynchronous code paradigm natively! So, Python for web scraping is fast, accessible and has a huge community.

What is the best HTTP client library for Python?

Currently, the best option for web scraping in our opinion is the httpx library as it supports synchronous and asynchronous python as well as being easy to configure for avoiding web scraper blocking. Alternatively, the requests library is a good choice for beginners as it has the easiest API.

How to speed up python web scraping?

The easiest way to speed up web scraping in python is to use asynchronous HTTP client such as httpx and use asynchronous functions (coroutines) for all HTTP connection related code.

How to prevent python web scraping blocking?

One of the most common challenges when using Python to scrape a website is blocking. This happens because scrapers inherently behave differently compared to a web browser so they can be detected and blocked.
The goal is to ensure that HTTP connections from python web scraper look similar to those of a web browser like Chrome or Firefox. This involves all connection aspects: using http2 instead of http1.1, using same headers as the web browser, treating cookies the same way browser does etc. For more see How to Scrape Without Getting Blocked Tutorial

Why can't my scraper see the data my browser does?

When we're using HTTP clients like requests, httpx etc. we scrape only the raw page source which often looks different from page source in the browser. This is because the browsers run all the javascript that is present in the page which can change it. Our python scraper has no javascript capabilities, so we either need to reverse engineer javascript code or control a web browser instance. See our for more.

What are the best tools used in web scraper development?

There are a lot of great tools out there, though when it comes to best web scraping tools in Python the most important tool must be the web browser developer tools. This suite of tools can be accessed in majority of web browser (Chrome, Firefox, Safari via F12 key or right click "inspect element").
This toolset is vital for understanding how the website works. It allows us to inspect the HTML tree, test our xpath/css slectors as well as track network activity - all of which are brilliant tools for developing web scrapers.

We recommend getting familiar with these tools by reading official documentation page.

Summary

In this python web scraping tutorial we've covered the basics of everything you need to know to start web scraping in Python.

We've introduced ourselves with the HTTP protocol which is the backbone of all internet connections. We explored GET and POST requests, and the importance of request headers.
Then, we've taken a look at HTML parsing: using CSS and XPATH selectors to parse data from raw HTML content.
Finally, we solidified this knowledge with an example project where we scraped products details of producthunt.com.

This web scraping tutorial should start you on the right path, but it's just the tip of the web scraping iceberg! Check out ScrapFly API for dealing with advanced web scraping challenges like scaling and blocking.

(Video) Beautiful Soup 4 Tutorial #1 - Web Scraping With Python

FAQs

Is Python web scraping easy to learn? ›

Python is one of the easiest ways to get started as it is an object-oriented language. Python's classes and objects are significantly easier to use than in any other language. Additionally, many libraries exist that make building a tool for web scraping in Python an absolute breeze.

Is Python good for web scraping? ›

Most popular: Web scraping with Python

Python is regarded as the most commonly used programming language for web scraping. Incidentally, it is also the top programming language for 2021 according to IEEE Spectrum.

How long does it take to learn web scraping in Python? ›

Depending on your Python knowledge, and how much time you're allocating to learn this skill, it could take anywhere from two days to two years.

How do Python web scrapers make money? ›

Here are 3 ways to make money using web scraping. You can use web scraping libraries in Python such as Beautiful Soup, Selenium, or Scrapy to do this or use any other programming language that allows you to scrape websites.

Should I learn HTML for web scraping? ›

You also need to know HTML. In this article, I want to show you the basics of HMTL. It's not hard to understand, but before you can start web scraping, you need to first master HTML. To extract the right pieces of information, you need to right-click “inspect.” You'll find a very long HTML code that seems infinite.

How do I start web scraping? ›

Let's get started!
  1. Step 1: Find the URL that you want to scrape. For this example, we are going scrape Flipkart website to extract the Price, Name, and Rating of Laptops. ...
  2. Step 3: Find the data you want to extract. ...
  3. Step 4: Write the code. ...
  4. Step 5: Run the code and extract the data. ...
  5. Step 6: Store the data in a required format.
13 Jul 2022

Is it legal to scrape a website? ›

Web scraping is legal if you scrape data publicly available on the internet. But some kinds of data are protected by international regulations, so be careful scraping personal data, intellectual property, or confidential data. Respect your target websites and use empathy to create ethical scrapers.

Is web scraping difficult? ›

Web scraping is easy! Anyone even without any knowledge of coding can scrape data if they are given the right tool. Programming doesn't have to be the reason you are not scraping the data you need. There are various tools, such as Octoparse, designed to help non-programmers scrape websites for relevant data.

Which language is fastest for web scraping? ›

The fastest language for web scraping is Python. The best language for web crawler is PHP, Ruby, C and C++, and Node.

Is Python enough to get a job? ›

Knowing the fundamentals or syntax of Python is not enough to get a job. Employers will look for several other qualities or skills, such as problem-solving skills, communication skills, willingness to learn new tools/technologies, breadth of knowledge in technology, etc.

What is the hardest programming language? ›

Haskell. The language is named after a mathematician and is usually described to be one of the hardest programming languages to learn. It is a completely functional language built on lambda calculus.

Is it worth learning Python in 2022? ›

Yes, learning Python is worth it in 2022 because some of the hottest fields in tech – including machine learning and artificial intelligence – rely heavily on programmers with Python skills.

Can I sell web scraped data? ›

Yup, those are a thing. Multiple companies often decide to outsource their web scraping jobs and can offer some pretty decent payouts for more complex jobs. A great place to start is UpWork, a platform that allows businesses to connect with freelancers and allows them to hire them for one-off jobs.

Who needs web scraper? ›

Unlike Google, in YouTube, the top 5 Job categories requiring web scraping experts are: Marketing & Communication; Software Engineering, Partnerships, Product & Customer Support, and the last, Business Strategy.

How much can I make web scraping? ›

Web Scraping Salary
Annual SalaryMonthly Pay
Top Earners$140,000$11,666
75th Percentile$110,500$9,208
Average$83,193$6,932
25th Percentile$59,500$4,958

Does web scraping require coding? ›

As you may already know, web scraping refers to the extraction of data from a website. While this can be done manually, most people will use a software tool to run their web scraping jobs. Unfortunately, many of these web scraping tools will still require custom coding from the user.

How can I learn web scraping for free? ›

Free Online Course to Learn Web Scraping in Python
  1. 1 HoursOf self-paced video lessons.
  2. Completion Certificateawarded on course completion.
  3. 90 Days of AccessTo your Free Course.

Where can I learn web scraping? ›

102 results for "web scraping"
  1. University of Michigan. Using Python to Access Web Data. ...
  2. University of Michigan. Python for Everybody. ...
  3. Johns Hopkins University. Importing Data in the Tidyverse. ...
  4. Duke University. Scripting with Python and SQL for Data Engineering. ...
  5. Meta. ...
  6. University of California, Davis. ...
  7. IBM Skills Network. ...
  8. IBM.

Can Python interact with HTML? ›

It is possible to run embed Python within a HTML document that can be executed at run time.

Which module is used for web scraping in Python? ›

Scrapy. Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.

Which Python libraries are used for web scraping? ›

Requests, BeautifulSoup, Scrapy, and Selenium, are some popular libraries used for web scraping in Python.

Can I get sued for web scraping? ›

In its second ruling on Monday, the Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act, or CFAA, which governs what constitutes computer hacking under U.S. law.

Does YouTube allow web scraping? ›

Youtube does have its own API enabling you to do some basic content search and collect data from each video. However, the YouTube API has significant limitations: YouTube scraping is limited to video data, subscriptions, recommendations, ranking, and ads.

Is Beautifulsoup legal? ›

For example, it is legal when the data extracted is composed of directories and telephone listing for personal use. However, if the extracted data is for commercial use—without the consent of the owner—this would be illegal.

How long does web scraping take? ›

Typically, a serial web scraper will make requests in a loop, one after the other, with each request taking 2-3 seconds to complete.

Why do people use web scraping? ›

Technology makes it easy to extract dataInnovation at the speed of lightBetter access to company dataLead generation to build a sales machineMarketing automation without limitsBrand monitoring for everyoneMarket analysis at scaleData(base) enrichment on demandMachine learning and large datasetsSEO loves data extraction ...

Is data scraping easy? ›

Data scraping has a vast number of applications – it's useful in just about any case where data needs to be moved from one place to another. The basics of data scraping are relatively easy to master. Let's go through how to set up a simple data scraping action using Excel.

Is Python or Java better for web scraping? ›

Short answer: Python!

If you're scraping simple websites with a simple HTTP request. Python is your best bet. Libraries such as requests or HTTPX makes it very easy to scrape websites that don't require JavaScript to work correctly. Python offers a lot of simple-to-use HTTP clients.

Which is better Selenium or BeautifulSoup? ›

The main difference between Selenium and Beautiful Soup is that Selenium is ideal for complex projects while Beautiful Soup is best for smaller projects. Read on to learn more of the differences! The choice between using these two scraping technologies will likely reflect the scope of the project.

Is Scrapy better than BeautifulSoup? ›

A better choice for large projects with complexities. It is the best choice for beginners to start with. Scrapy is comparatively more complex than BeautifulSoup.

Can I learn Python at 45 and get a job? ›

For sure yes , if you have the desired skills and knowledge . No one will ever care about the age , there are plenty of jobs available in the field of python . Beside this you can also go for freelancing as an option.

Which is the No 1 programming language? ›

JavaScript and Python, two of the most popular languages in the startup industry, are in high demand. Most startups use Python-based backend frameworks such as Django (Python), Flask (Python), and NodeJS (JavaScript). These languages are also considered to be the best programming languages to learn for beginners.

Can you learn python in 2 months? ›

In general, it takes around two to six months to learn the fundamentals of Python. But you can learn enough to write your first short program in a matter of minutes. Developing mastery of Python's vast array of libraries can take months or years.

Should I learn HTML and CSS before Python? ›

There's no definite rule that states what programming language you learn first. Both HTML and Python are easy to learn, and you can choose to get started with either of these programming languages depending on the area of development you want to focus on.

Is Python easier than HTML? ›

For python you just have to learn its concepts and framework . Both are different language used for different purposes...but html is much easier than python..

How many hours do I need to learn Python? ›

To get started, you'll go over some different reasons people want to learn to program in Python.
...
From Awareness to Ability.
GoalLearn Python's syntax and fundamental programming and software development concepts
Time RequirementApproximately four months of four hours each day
WorkloadApproximately ten large projects
1 more row

How much does a Python developer earn? ›

The average salary of entry-level Python developer salary in India is ₹427,293. The average salary of a mid-level Python developer salary in India is ₹909,818. The average salary of an experienced Python developer salary in India is ₹1,150,000.

How much does a Python job cost? ›

3 months is enough if you want to start with a basic job. A basic job only requires you to know the basics of python. After learning the basic python programming, you will have to learn some advanced topics to be professional in it and have a job.

How do web scrapers make money? ›

Reselling

One of the most common uses of web scraping, is getting prices off websites. There are those who create web scraping programs that run everyday and return the price of a specific product, and when the price drops to a certain amount the program will automatically buy the product before its sold out.

Is it legal to scrape Google? ›

Scraping of Google SERPs isn't a violation of DMCA or CFAA. However, sending automated queries to Google is a violation of its ToS. Violation of Google ToS is not necessarily a violation of the law.

Does Instagram allow web scraping? ›

While Instagram forbids any kind of crawling, scraping, or caching content from Instagram it is not regulated by law. Meaning, if you scrape data from Instagram you may get your account banned, but there are no legal repercussions.

Is web scraping a job? ›

What Are Web Scraping Jobs? Web scraping jobs involve using specialized software and web crawling tools to extract data from websites. This data is extracted for competitor analysis, market trends, pricing research, and other information that can help businesses improve their performance.

What is an example of web scraping? ›

Web scraping refers to the extraction of web data on to a format that is more useful for the user. For example, you might scrape product information from an ecommerce website onto an excel spreadsheet. Although web scraping can be done manually, in most cases, you might be better off using an automated tool.

Is web scraping data mining? ›

Web scraping refers to the process of extracting data from web sources and structuring it into a more convenient format. It does not involve any data processing or analysis. Data mining refers to the process of analyzing large datasets to uncover trends and valuable insights.

What can I do with scraped data? ›

What is Web Scraping used for?
  1. Price Monitoring. Web Scraping can be used by companies to scrap the product data for their products and competing products as well to see how it impacts their pricing strategies. ...
  2. Market Research. ...
  3. 3. News Monitoring. ...
  4. Sentiment Analysis. ...
  5. Email Marketing.
9 Nov 2021

How does website scraping work? ›

Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere.

Is Octoparse paid? ›

Octoparse offers 14-day free standard or professional trials to freemium users. Your trial turns into a paid subscription automatically after 14 days. After the trial period, your premium subscription will begin at $89 or $249 per month.

Is web scraping hard to learn? ›

Web scraping is easy! Anyone even without any knowledge of coding can scrape data if they are given the right tool. Programming doesn't have to be the reason you are not scraping the data you need. There are various tools, such as Octoparse, designed to help non-programmers scrape websites for relevant data.

Is web scraping good to learn? ›

Web scraping is a skill that can be mastered by anyone. Web scraping skills are in demand and the best web scrapers have a high salary because of this. Web scraping allows you to extract data from websites, process it and store it for future use.

Where can I learn data scraping? ›

You can learn web scraping online from sites like Udemy, Coursera, and Pluralsight. From free tutorials to certified training, there is a wide variety of courses to choose from on the web.

What is web scraping in Python? ›

Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.

Does web scraping need coding? ›

As you may already know, web scraping refers to the extraction of data from a website. While this can be done manually, most people will use a software tool to run their web scraping jobs. Unfortunately, many of these web scraping tools will still require custom coding from the user.

Why do people use web scraping? ›

Technology makes it easy to extract dataInnovation at the speed of lightBetter access to company dataLead generation to build a sales machineMarketing automation without limitsBrand monitoring for everyoneMarket analysis at scaleData(base) enrichment on demandMachine learning and large datasetsSEO loves data extraction ...

Is web scraping part of data mining? ›

Web scraping refers to collecting and structuring the data from web sources in a more convenient format. It involves no processing or review of the data. Data mining refers to analyzing large data sets to reveal useful information and patterns. It does not require data processing or extraction.

Is web scraping a career? ›

There is no doubt that the most jobs requiring web scraping are tech-relevant ones, like Engineering, and Information Technology. There are, however, surprisingly many other kinds of works also require web scraping skills such as Human Resources, marketing, business development, research, sales and consulting.

Who needs web scraping? ›

Web scraping is required for all industries, in particular retail and e-commerce, travel and hospitality, real estate, marketing and advertisement, market research, life sciences, and education, etc. The reason for this is that it is one of the sectors in which there is a lack of competition.

How much do web scrapers make? ›

As of Oct 22, 2022, the average annual pay for a Web Scraping in the United States is $83,193 a year. Just in case you need a simple salary calculator, that works out to be approximately $40.00 an hour. This is the equivalent of $1,599/week or $6,932/month.

How can I learn web scraping for free? ›

Free Online Course to Learn Web Scraping in Python
  1. 1 HoursOf self-paced video lessons.
  2. Completion Certificateawarded on course completion.
  3. 90 Days of AccessTo your Free Course.

What is an example of web scraping? ›

Web scraping refers to the extraction of web data on to a format that is more useful for the user. For example, you might scrape product information from an ecommerce website onto an excel spreadsheet. Although web scraping can be done manually, in most cases, you might be better off using an automated tool.

Can Python interact with HTML? ›

It is possible to run embed Python within a HTML document that can be executed at run time.

Which module is used for web scraping in Python? ›

Scrapy. Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.

Which Python libraries are used for web scraping? ›

Requests, BeautifulSoup, Scrapy, and Selenium, are some popular libraries used for web scraping in Python.

Videos

1. Web Scraping Using Python | GeeksforGeeks
(GeeksforGeeks)
2. Comprehensive Python Beautiful Soup Web Scraping Tutorial! (find/find_all, css select, scrape table)
(Keith Galli)
3. Python Web scraping to CSV file| BeautifulSoup | Real Estate Website Scraping
(Pythonology)
4. Step-by-Step Web Scraping Tutorial With Python
(Oxylabs)
5. Let's Build a Python Web Scraping Project from Scratch | Hands-On Tutorial
(Jovian)
6. Web Scraping Tutorial using Python and BeautifulSoup in Hindi
(CodeWithHarry)
Top Articles
Latest Posts
Article information

Author: Horacio Brakus JD

Last Updated: 10/17/2023

Views: 5505

Rating: 4 / 5 (51 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Horacio Brakus JD

Birthday: 1999-08-21

Address: Apt. 524 43384 Minnie Prairie, South Edda, MA 62804

Phone: +5931039998219

Job: Sales Strategist

Hobby: Sculling, Kitesurfing, Orienteering, Painting, Computer programming, Creative writing, Scuba diving

Introduction: My name is Horacio Brakus JD, I am a lively, splendid, jolly, vivacious, vast, cheerful, agreeable person who loves writing and wants to share my knowledge and understanding with you.