1

I am trying to webscrape Goodreads (the end goal is to collect data on every book in a given genre and sort by rating) but am having a hard time logging in. This is necessary because the listing of books in each genre is restricted to logged in users. For example, this is a link to page two of the site's fantasy book records, but if you access that link and are not logged in, you are sent to page one of the fantasy book records (and there is no menu on the bottom to proceed to the next page).

I've been exploring answers on Stack for help on logging into websites and even attempted to check for an authenticity token, but still cannot seem to log in. Here is my code:

import pprint, os, re, time
from bs4 import BeautifulSoup
import requests

if not os.path.exists("GoodReads"):
    os.makedirs("Goodreads")

user = "myemail@gmail.com"
password = "mypassword"

# utf8 and n are other inputs I noticed while inspecting the log in form so I added them and the values in the case they affected anything... they do not seem to
payload = {
    'user[email]': user,
    'user[password]': password,
    'utf8':'✓', 
    'n': '843936'
    }

with requests.Session() as sess:
    # The code for this is found here https://stackoverflow.com/a/57231791/5395546
    res = sess.get('https://www.goodreads.com/user/sign_in?source=home')
    signin = BeautifulSoup(res._content, 'html.parser')
    payload['authenticity_token'] = signin.find('input', attrs={'name':"authenticity_token", 'type':'hidden'})['value']
    res = sess.post('https://www.goodreads.com/user/sign_in?source=home', data=payload)
    print(res.text)

    # This section is to print out the titles of the books on the page. It is printing the titles of the books on page 1, not 2, so I know it's not signing in properly.
    r = sess.get('https://www.goodreads.com/shelf/show/fantasy?page=2')
    soup = BeautifulSoup(r.text, 'html.parser')
    results = soup.find_all("a", class_="bookTitle")
    print(results)

I'm aware there's a Goodreads API. it's been discontinued and they don't give out developer keys anymore, so it's not an option unfortunately.

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
Sarah Diri
  • 97
  • 1
  • 9
  • You're using a constant value for 'n' in your payload. This value changes each time the sign-in page is loaded. You'll need to look for that and update your payload dictionary accordingly –  Aug 08 '21 at 06:58
  • oh my gosh I feel so silly. Figured out the authenticity token and didn't do the same for n. Thank you @DarkKnight – Sarah Diri Aug 08 '21 at 19:54
  • Where in the page do i find n? I managed to find the authenticity token at soup.find('meta', {'name': 'csrf-token'}), but I have no idea what n is referring to. – fluent Apr 07 '23 at 10:21

0 Answers0