How to use Python to retrieve xml page that requires http login?

Question

When I access a page on an IIS server to retrieve xml, using a query parameter through the browser (using the http in the below example) I get a pop-up login dialog for username and password (appears to be a system standard dialog/form). and once submitted the data arrives. as an xml page.

How do I handle this with urllib? when I do the following, I never get prompted for a uid/psw.. I just get a traceback indicating the server (correctly ) id's me as not authorized. Using python 2.7 in Ipython notebook

f = urllib.urlopen("http://www.nalmls.com/SERetsHuntsville/Search.aspx?SearchType=Property&Class=RES&StandardNames=0&Format=COMPACT&Query=(DATE_MODIFIED=2012-09-28T00:00:00%2B)&Limit=10")
s = f.read()
f.close()

Pointers to doc also appreciated! did not find this exact use case.

I plan to parse the xml to csv if that makes a difference.

[This answer](http://stackoverflow.com/a/4188709/416467) to a similar question looks pretty straightforward. — kindall, Oct 05 '12 at 17:39
That does not work... Dunno why, get a 401 not authorized... — dartdog, Oct 05 '12 at 18:17
That answer is using Http Basic Authentication, the url you are using needs Digest Authentication. — Nathan Villaescusa, Oct 05 '12 at 18:25
Yes, see below! Digest auth was the trick! Should be noted that that is the probable default for MS servers.... — dartdog, Oct 05 '12 at 18:47
as I don't have access to the server I don't have a chance to test, but I think using http://username:password@website.com/user/... as url should also work. — root, Oct 05 '12 at 19:30

score 7 · Accepted Answer · edited Oct 10 '12 at 05:18

7

You are dealing with http authentication. I've always found it tricky to get working quickly with the urllib library. The requests python package makes it super simple.

url = "http://www.nalmls.com/SERetsHuntsville/Search.aspx?SearchType=Property&Class=RES&StandardNames=0&Format=COMPACT&Query=(DATE_MODIFIED=2012-09-28T00:00:00%2B)&Limit=10"
r = requests.get(url, auth=('user', 'pass'))
page = r.text

If you look at the headers for that url you can see that it is using digest authentication:

{'content-length': '1893', 'x-powered-by': 'ASP.NET', 'x-aspnet-version': '4.0.30319', 'server': 'Microsoft-IIS/7.5', 'cache-control': 'private', 'date': 'Fri, 05 Oct 2012 18:20:54 GMT', 'content-type': 'text/html; charset=utf-8', 'www-authenticate': 'Digest realm="Solid Earth", nonce="MTAvNS8yMDEyIDE6MjE6MjUgUE0", opaque="0000000000000000", stale=false, algorithm=MD5, qop="auth"'}

So you will need:

from requests.auth import HTTPDigestAuth
r = requests.get(url, auth=HTTPDigestAuth('user', 'pass'))

edited Oct 10 '12 at 05:18

jfs

399,953
195
994
1,670

answered Oct 05 '12 at 17:36

Nathan Villaescusa

17,331
4
53
56

Looked good but did not work, still getting un-authorized...MMMM sure looks like it should work though! – dartdog Oct 05 '12 at 18:09
You might need requests.get(url, auth=HTTPDigestAuth('user', 'pass')) by default requests does HTTPBasicAuth – Nathan Villaescusa Oct 05 '12 at 18:12
My response from the server is `The MIME type of the request is invalid`. How would we change the request type to `text/xml`? – Stevoisiak Oct 17 '17 at 13:31
You can set it with the headers kwarg: `requests.get(url, auth=HTTPDigestAuth('user', 'pass'), headers={'Content-Type': 'text/xml'})` – Nathan Villaescusa Oct 19 '17 at 02:49

root · Answer 2 · 2012-10-05T18:39:24.433

1

There are many ways to do it but i suggest you start with urllib2 and it's batteries included.

import urllib2, base64

req = urllib2.Request("http://webpage.com//user")
b64str = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
request.add_header("Authorization", "Basic %s" % b64str)   
result = urllib2.urlopen(req)

You can use requests, beautifulsoup,mechanize or selenium if your task gets harder. Googling will give you enough examples for each one of these,

edited Oct 05 '12 at 18:39

answered Oct 05 '12 at 17:39

root

76,608
25
108
120

you could use `base64.b64encode()` to avoid unnecessary `.replace('\n','')`. – jfs Oct 05 '12 at 18:07
This is the answer referred to in the 1st comment from kindall, don't know why but I still got a 401 not authorized... – dartdog Oct 05 '12 at 18:19

score 0 · Answer 3 · answered Oct 05 '12 at 17:53

0

This can be done in a couple of ways:

Use urllib/urllib2 and requests as others have suggested
Use Mechanize to simulate manual form-filling and get back the response

answered Oct 05 '12 at 17:53

inspectorG4dget

110,290
27
149
241

How to use Python to retrieve xml page that requires http login?

3 Answers3

Linked