3

I would like to get data from https://creis.fang.com/.

However, I need to login the page first.

There are 4 values I need to fill in.

I tried to use requests but failed.

Here is my code:

import requests
url = 'https://creis.fang.com/'
s = requests.Session()
data ={'cnname': 'myname', 'cnotp':'abc', 'cntempcode':'123', 'cnproducttitle':'企业版'}
r = s.post(url=url, data=data)

Can you help me?

Thanks.

enter image description here

enter image description here enter image description here enter image description here

Chan
  • 3,605
  • 9
  • 29
  • 60
  • This is one of the core problems in web scraping...logging into a site. There's no one answer. It will vary on a site by site basis. Nobody is, IMO, going to tell you directly how to do this. It can take quite a bit of work to figure out how to do this for a particular site. This question is way too broad for this forum, which seeks to discuss and help programmers with basic programming problems and concepts. A whole article, or even a book, could be written on the topic you are asking about. With that said... – CryptoFool Jun 25 '19 at 02:59
  • Without scripting a full web browser, which is what our company does, what you need to do is run a Debugging HTTP Proxy and see what the request that goes over the wire when you fill out this dialog and successfully log in looks like. Seeing that, you'd then seek to reproduce that same request via Requests or some other HTTP client library. There are a number of good Debugging HTTP Proxy tools out there that will show you this traffic. We use Charles (https://www.charlesproxy.com), which is a killer tool. – CryptoFool Jun 25 '19 at 03:02
  • I have added my code. – Chan Jun 25 '19 at 03:09
  • ...if you're successful writing a script that logs into the site, then you'll get back some sort of token that you'll use on subsequent requests to the site that represent your logged in session. This could come back just about anywhere in the response...in the headers, in cookies, or in the body of the response. – CryptoFool Jun 25 '19 at 03:09
  • Can you show me how to do it, Steve? – Chan Jun 25 '19 at 03:11
  • Browsers also have developer tools that will show you what is going back and forth on the wire. That's where you need to start. You need to see a successful login happen as raw HTTP requests and responses. That's the only way to tackle this. I've told you all I can. To go any further would mean me solving this significantly time consuming problem for you. – CryptoFool Jun 25 '19 at 03:15
  • What leads you to believe that the site will accept the request your code is sending and log you in? Is that just a wild guess? If so, the chances that that, or anything like it, will work, is almost 0%. In general, the problem is much more complicated than that. In fact, almost arbitrarily so, as most sites actively seek to make it hard for scripts (robots) to log into the site. – CryptoFool Jun 25 '19 at 03:19
  • How to use my id and password to login the page with `requests`? – Chan Jun 25 '19 at 03:38
  • The webpage is just waiting for some variables that are being sent from a form by a POST request with user-defined variable names. These variables are what you think of as password and id but `requests` does not know them. These variables can have any name and you need to dig into the website code or listen to the traffic using the proxy mentioned above or your browser's web developer tools. – Joe Jun 25 '19 at 06:52
  • I have posted the data passed to the server. The `sPassword` which should be my password, but seems encrypted. How can I get such information? – Chan Jun 25 '19 at 07:41
  • In `Request URL`, there is a `r=0.352` parameters at the end. But it is generated randomly. Should I pass `https://creis.fang.com/` or the Request URL? – Chan Jun 25 '19 at 07:43
  • It seems like you'd have a much easier time trying this using ```Selenium``` or ```pyautogui```. Is there a reason that you have to use ```requests```? – Anshu Chen Jun 28 '19 at 14:45
  • It is because `requests` is faster and able to run in background. – Chan Jun 28 '19 at 17:10

2 Answers2

3

First of all, the login request and the code you have is different.

The password that is sent is encrypted using RSA. I am not sure if it changed on every request or the logic to encrypt is fixed. So you need to make sure the password that is sent is encrypted

import requests

url = 'https://creis.fang.com/'
s = requests.Session()
s.get(url)

data = {
    "sloginID": "test",
    "sPassword": "8dfddc7151749b91cedc965dec30295f34f5ab3f46679d3fa8c8d917474d162c63adcc6c9c0ea42f4cfc3fdf06bf98e4a9225f6fb8690874a028f17f2f01726259b4a32fc49e3e287a7832b6799916531603d06763b8309ca144f8c6828a9273bd4612d6b759193ee69c3608e3072f914e9af05ab2ea4bf1ab12816995f51711",
    "sHID":"",
    "sTempCode":"",
    "product": "sfangdata", 
    "rsakey":"",
    "l": "cn"
}

res = s.post("https://creis.fang.com/login/login/?r=0.6500200817926693", data=data)

print (res.json())

After that, I get the error

{'code': 'error', 'msg': '登录器权限已过期!', 'product': 'sfangdata', 'other': 'login'}

Which is because I don't have the correct credentials

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • Thank you, Tarun. However, how to get the number `0.6500200817926693` in the url and how to get the encrypted password which changes every time I log in? – Chan Jun 28 '19 at 21:56
  • The number is random and you can send anything. The encrypted password is best taken from watching the request made by browser by logging in manually once and noting the encrypted value used. Replicating it in python will not be straightforward – Tarun Lalwani Jun 29 '19 at 00:39
  • If it helps, I think the password encryption is is done via a `encryptedString()` function, which can be found in http://js.soufunimg.com/industry/land/js/RSA.js. The public key modulus and exponent are hard-coded in the login page and have values `9BB03766637D452E5D17A3FDBB5F0B0B3C8AAA4167C03A284B8BA97EEAC02C49D2A3108C024E6AD1CD816A17815CE11AD68F135134CCFCA5749770A12AD756398408EF12C4317D88498F837C734C2C52351AFD293179B274F3F07E9EF003BB2277965EDEAADF839A1094A6F0E808985D967493A0EBA0C14475F203EE55ECC65D` and `010001`. So, it may be possible to create the key with `Pycryptodome` or `Cryptography`. – t.m.adam Jun 29 '19 at 16:32
  • However, I agree that it would be easier to use the password created by the Js script in the browser. It seems they use PKCS#1 v1.5 padding which produces different ciphertext every time, but the decrypted plaintext will be the same. – t.m.adam Jun 29 '19 at 16:38
  • Yeah saw that already, but then not sure if it's even worth the efforts. Scraping you want to take shortcuts that work – Tarun Lalwani Jun 29 '19 at 17:07
  • How to generate public key modulus and exponent with `requests`? – Chan Jul 02 '19 at 05:48
  • Doing that changes the whole question itself. I think the main issue concerned with the original question is solved – Tarun Lalwani Jul 02 '19 at 06:12
  • Not yet. I followed the steps and got the JSON response : `'{"code":"ok","msg":"http://fdc.fang.com/login/SetProductV2.aspx?c=ADF9JSLKDFJ3IDLS456CD083F847690ACC8A632349058SJDL3C42BFCFB7E1F08&r=System.Random","product":"etp","other":"login"}'` . I still cannot login. – Chan Jul 02 '19 at 06:37
  • Probably you need to do get on this URL after the API call. Without correct credentials, I can't be sure what the actual flow is like – Tarun Lalwani Jul 02 '19 at 06:39
1

In this case, the password is encrypted in javascript before the request is sent by the browser.

What I'd recommend at this point is to use something like https://selenium-python.readthedocs.io/ instead of requests.

That way you'll avoid having to reverse-engineer the javascript portion of things and be able login easily.

If you then need to proceed with requests, you could attempt to retrieve the cookies generated by selenium and continue with requests, but I'd recommend to use selenium for everything instead.

velxundussa
  • 91
  • 1
  • 10