1

When I access a page on an IIS server to retrieve xml, using a query parameter through the browser (using the http in the below example) I get a pop-up login dialog for username and password (appears to be a system standard dialog/form). and once submitted the data arrives. as an xml page.

How do I handle this with urllib? when I do the following, I never get prompted for a uid/psw.. I just get a traceback indicating the server (correctly ) id's me as not authorized. Using python 2.7 in Ipython notebook

f = urllib.urlopen("http://www.nalmls.com/SERetsHuntsville/Search.aspx?SearchType=Property&Class=RES&StandardNames=0&Format=COMPACT&Query=(DATE_MODIFIED=2012-09-28T00:00:00%2B)&Limit=10")
s = f.read()
f.close()

Pointers to doc also appreciated! did not find this exact use case.

I plan to parse the xml to csv if that makes a difference.

dartdog
  • 10,432
  • 21
  • 72
  • 121
  • 1
    [This answer](http://stackoverflow.com/a/4188709/416467) to a similar question looks pretty straightforward. – kindall Oct 05 '12 at 17:39
  • That does not work... Dunno why, get a 401 not authorized... – dartdog Oct 05 '12 at 18:17
  • That answer is using Http Basic Authentication, the url you are using needs Digest Authentication. – Nathan Villaescusa Oct 05 '12 at 18:25
  • Yes, see below! Digest auth was the trick! Should be noted that that is the probable default for MS servers.... – dartdog Oct 05 '12 at 18:47
  • as I don't have access to the server I don't have a chance to test, but I think using http://username:password@website.com/user/... as url should also work. – root Oct 05 '12 at 19:30

3 Answers3

7

You are dealing with http authentication. I've always found it tricky to get working quickly with the urllib library. The requests python package makes it super simple.

url = "http://www.nalmls.com/SERetsHuntsville/Search.aspx?SearchType=Property&Class=RES&StandardNames=0&Format=COMPACT&Query=(DATE_MODIFIED=2012-09-28T00:00:00%2B)&Limit=10"
r = requests.get(url, auth=('user', 'pass'))
page = r.text

If you look at the headers for that url you can see that it is using digest authentication:

{'content-length': '1893', 'x-powered-by': 'ASP.NET', 'x-aspnet-version': '4.0.30319', 'server': 'Microsoft-IIS/7.5', 'cache-control': 'private', 'date': 'Fri, 05 Oct 2012 18:20:54 GMT', 'content-type': 'text/html; charset=utf-8', 'www-authenticate': 'Digest realm="Solid Earth", nonce="MTAvNS8yMDEyIDE6MjE6MjUgUE0", opaque="0000000000000000", stale=false, algorithm=MD5, qop="auth"'}

So you will need:

from requests.auth import HTTPDigestAuth
r = requests.get(url, auth=HTTPDigestAuth('user', 'pass'))
jfs
  • 399,953
  • 195
  • 994
  • 1,670
Nathan Villaescusa
  • 17,331
  • 4
  • 53
  • 56
1

There are many ways to do it but i suggest you start with urllib2 and it's batteries included.

import urllib2, base64

req = urllib2.Request("http://webpage.com//user")
b64str = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
request.add_header("Authorization", "Basic %s" % b64str)   
result = urllib2.urlopen(req)

You can use requests, beautifulsoup,mechanize or selenium if your task gets harder. Googling will give you enough examples for each one of these,

root
  • 76,608
  • 25
  • 108
  • 120
  • you could use `base64.b64encode()` to avoid unnecessary `.replace('\n','')`. – jfs Oct 05 '12 at 18:07
  • This is the answer referred to in the 1st comment from kindall, don't know why but I still got a 401 not authorized... – dartdog Oct 05 '12 at 18:19
0

This can be done in a couple of ways:

  1. Use urllib/urllib2 and requests as others have suggested
  2. Use Mechanize to simulate manual form-filling and get back the response
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241