14

I am coding a web scraper for the website with the following Python code:

import requests

def scrape(url):
    req = requests.get(url)
    with open('out.html', 'w') as f:
        f.write(req.text)

It works a few times but then an error HTML page is returned by the website (when I open my browser, I have a captcha to complete).

Is there a way to avoid this “ban” by for example changing the IP address?

  • IF the ban is based on your IP address, then yes changing IP address might resolve that, but that's not something python has control over. – Danielle M. Jan 25 '18 at 16:32
  • Well, to change the IP address you can run the exact same code on a different system with a different IP address... – Peteris Jan 25 '18 at 16:32
  • Have you consider, if you are getting `banned` and there's a captcha maybe the owners of the site doesn't want you to scrape their site? – MooingRawr Jan 25 '18 at 16:35
  • 1
    Why don't you try with a proxy? – t.m.adam Jan 25 '18 at 16:40

4 Answers4

24

As already mentioned in the comments and from yourself, changing the IP could help. To do this quite easily have a look at vpngate.py:

https://gist.github.com/Lazza/bbc15561b65c16db8ca8

An How to is provided at the link.

S.B
  • 13,077
  • 10
  • 22
  • 49
Rend
  • 291
  • 2
  • 9
  • This is a great option, I will try to modify the code and make it more easy to use as a function. –  Jan 25 '18 at 16:49
14

You can use a proxy with the requests library. You can find some free proxies at a couple different websites like https://www.sslproxies.org/ and http://free-proxy.cz/en/proxylist/country/US/https/uptime/level3 but not all of them work and they should not be trusted with sensitive information.

example:

proxy = {
    "https": 'https://158.177.252.170:3128',
    "http": 'https://158.177.252.170:3128' 
}
response=requests.get('https://httpbin.org/ip', proxies=proxy)
AJ Bensman
  • 550
  • 4
  • 11
  • This was so useful. What I did is I used the export ip:port function of the free-proxy website and was able to generate a list of over 100. I randomly selected from that list every time I made an API query and completely avoided the ratelimit! – William Gerecke Jul 28 '20 at 20:50
9

I recently answered this on another question here, but using the requests-ip-rotator library to rotate IPs through API gateway is usually the most effective way.
It's free for the first million requests per region, and it means you won't have to give your data to unreliable proxy sites.

George
  • 640
  • 7
  • 13
0

Late answer, I found this looking for IP-spoofing, but to the OP's question - as some comments point out, you may or may not actually be getting banned. Here's two things to consider:

  1. A soft ban: they don't like bots. Simple solution that's worked for me in the past is to add headers, so they think you're a browser, e.g.,

    req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})

  2. On-page active elements, scripts or popups that act as content gates, not a ban per se - e.g., country/language selector, cookie config, surveys, etc. requiring user input. Not-as-simple solution: use a webdriver like Selenium + chromedriver to render the page including JS and then add "user" clicks to deal with the problems.

DLuber
  • 151
  • 7
  • Oh, and if you're hitting the same site repeatedly, best to use sessions: import requests sesh = requests.Session() req = sesh.get(url, headers={'User-Agent': 'Mozilla/5.0'}) – DLuber Jan 20 '22 at 21:49