I am trying to scrape a forum, but I can't resolve the login part
Context about the forum to scrape
The part I want to scrape is only available to logged-in users. The forum seems to be phpBB. I can't give you the link because it is local.
Attempt to log-in using requests
I have tried to authenticate with following code:
url = 'http://forum.com' #not real one XD
pet = requests.get(url, auth=HTTPBasicAuth('user', 'passw'), verify=False)
and also:
pet =requests.get(url, auth=HTTPDigestAuth('user', 'pass'), verify=False)
Parsing HTML to extract content
For information extraction I use BeautifulSoup:
soup = BeautifulSoup(pet.content)
print(soup.prettify())
Problem
When I execute those commands, it returns the information of the forum page without login. So apparently the login thing doesn't work.
I put Verify=False because if I don't, then an SSL-error is raised.
How can I achieve this? I would prefer solutions using the requests module, but also others are welcome.
Authentication and Login-Page
I don't know what kind of authentication the forum has (Basic, Digest, ...).
This is the piece of HTML from the login-page where the user and password are asked:
<dl>
<dt>
<label for="username">
Nombre de Usuario:
</label>
</dt>
<dd>
<input class="inputbox autowidth" id="username" name="username" size="25" tabindex="1" type="text" value=""/>
</dd>
</dl>
<dl>
<dt>
<label for="password">
Contraseña:
</label>
</dt>
<dd>
<input autocomplete="off" class="inputbox autowidth" id="password" name="password" size="25" tabindex="2" type="password"/>
</dd>
<dd>
<a href="./ucp.php?mode=sendpassword">
Olvidé mi contraseña
</a>
</dd>
</dl>
<dl>
<dd>
<label for="autologin">
<input id="autologin" name="autologin" tabindex="4" type="checkbox"/>
Recordar
</label>
</dd>
<dd>
<label for="viewonline">
<input id="viewonline" name="viewonline" tabindex="5" type="checkbox"/>
Ocultar mi estado de conexión en esta sesión
</label>
</dd>
</dl>
<input name="redirect" type="hidden" value="./search.php?search_id=newposts"/>
<dl>
<dt>
</dt>
<dd>
<input name="sid" type="hidden" value="b48ad769e2eab979294621d07e3ef19d"/>
<input class="button1" name="login" tabindex="6" type="submit" value="Identificarse"/>
</dd>
</dl>
Remark:
When I make a request for the page to scrape (the one that doesn't appear unless logged-in), it returns a HTTP status code 200.
But the HTML it returns is the one of the login-page.