I am trying to webscrape Goodreads (the end goal is to collect data on every book in a given genre and sort by rating) but am having a hard time logging in. This is necessary because the listing of books in each genre is restricted to logged in users. For example, this is a link to page two of the site's fantasy book records, but if you access that link and are not logged in, you are sent to page one of the fantasy book records (and there is no menu on the bottom to proceed to the next page).
I've been exploring answers on Stack for help on logging into websites and even attempted to check for an authenticity token, but still cannot seem to log in. Here is my code:
import pprint, os, re, time
from bs4 import BeautifulSoup
import requests
if not os.path.exists("GoodReads"):
os.makedirs("Goodreads")
user = "myemail@gmail.com"
password = "mypassword"
# utf8 and n are other inputs I noticed while inspecting the log in form so I added them and the values in the case they affected anything... they do not seem to
payload = {
'user[email]': user,
'user[password]': password,
'utf8':'✓',
'n': '843936'
}
with requests.Session() as sess:
# The code for this is found here https://stackoverflow.com/a/57231791/5395546
res = sess.get('https://www.goodreads.com/user/sign_in?source=home')
signin = BeautifulSoup(res._content, 'html.parser')
payload['authenticity_token'] = signin.find('input', attrs={'name':"authenticity_token", 'type':'hidden'})['value']
res = sess.post('https://www.goodreads.com/user/sign_in?source=home', data=payload)
print(res.text)
# This section is to print out the titles of the books on the page. It is printing the titles of the books on page 1, not 2, so I know it's not signing in properly.
r = sess.get('https://www.goodreads.com/shelf/show/fantasy?page=2')
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all("a", class_="bookTitle")
print(results)
I'm aware there's a Goodreads API. it's been discontinued and they don't give out developer keys anymore, so it's not an option unfortunately.