1

I am trying to crawl some webpages in Coursera, which are important for review after the course, such as syllabus, homework, etc.

I am using wget, however I found login is required. So I tried two post: 1 2. None of them work.

I found the Coursera webpages do not end with *.html or *.htm. Is there any way to pass through login and download webpages using wget in Coursera?

Community
  • 1
  • 1
user2262504
  • 7,057
  • 5
  • 18
  • 23

1 Answers1

1

This Python package, https://github.com/dgorissen/coursera-dl, may be more applicable to what you are asking for except that it doesn't use wget and uses, and requires, Python instead. Author notes using Python 2.7 and the pip package. The advantage with this package is that you can download everything related to the course in one run.

Do note that you do need to accept the honor code for the Coursera class, the first time you open the class page, before this script will run correctly as noted on the main project page and in README.md. This project, unlike at least one on github.com, is actively maintained with the most recent update within the last 6 months.

I would strongly recommend you check with one of the Python packages as, in my own testing on Windows (unless you find a difference with wget on another platform), it appears that the wget tool itself continues to have an issue with the Coursera secure certificate despite the inclusion of --no-check-certificate in either command.

This testing was done with GNU Wget 1.14 built on mingw32 from the version string. Lastly, please note that the same result was encountered with both v1 and v3 of the Coursera Login protocol.

wget (using Coursera Login v1, also from a comment below):

wget --save-cookies=cookies.txt --no-check-certificate --keep-session-cookies
--post-data="email=email@example.com&password=mypassword&webrequest=true" 
https://accounts.coursera.org/api/v1/login?

Resolving accounts.coursera.org (accounts.coursera.org)... 54.225.163.33, 107.20
.232.186, 54.243.110.245
Connecting to accounts.coursera.org (accounts.coursera.org)|54.225.163.33|:443..
. connected.
WARNING: cannot verify accounts.coursera.org's certificate, issued by ...
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 400 Bad Request
ERROR 400: Bad Request.

wget Update (using Coursera Login v3):

Note that wget (tested on windows) does not appear to work with Coursera Login v1 (comment below) or Coursera Login v3 (immediately below:

wget https://accounts.coursera.org/api/login/v3/login? --save-cookies cookies.txt
--keep-session-cookies --no-check-certificate --post-data
"email=email@example.com&password=mypassword&webrequest=true"


Resolving accounts.coursera.org (accounts.coursera.org)... 50.19.244.62, 107.20.
145.110, 54.221.210.127
Connecting to accounts.coursera.org (accounts.coursera.org)|50.19.244.62|:443...
 connected.
WARNING: cannot verify accounts.coursera.org's certificate, issued by ...
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 400 Bad Request
ERROR 400: Bad Request.
localhost
  • 375
  • 1
  • 3
  • 15
  • Yes, I am currently using cousera-dl to download video, slides. But this tool can not download webpages. Or I don't find the documentation for using it to download webpage. Can you help? – user2262504 May 14 '15 at 01:58
  • @user2262504 I will take a closer look into that as well. If you check with this version (similar in name to a few others) lecture notes and quizzes, among others, can be downloaded with this tool as per the description. As well, I would recommend you confirm the folder structure for one you installed is similar to, or the same, as the listing in the package located at the link above. – localhost May 14 '15 at 02:04
  • I have tried the https://github.com/dgorissen/coursera-dl, it only download index webpage and lecture webpage. And both of these two pages are "ooops... HTTP 404". – user2262504 May 14 '15 at 03:39
  • I checked with wget on Windows and obtained the following, did you check with this code and similar response? **Code:** `wget --save-cookies=cookies.txt --no-check-certificate --keep-session-cookies --post-data="email=email@example.com&password=mypassword&webrequest=true" https://accounts.coursera.org/api/v1/login?` **Response:** WARNING: cannot verify accounts.coursera.org's certificate, issued by... Unable to locally verify the issuer's authority. HTTP request sent, awaiting response... 400 Bad Request 2015-05-13 23:22:05 ERROR 400: Bad Request. – localhost May 14 '15 at 05:41
  • The api is updated to be: https://www.coursera.org/api/login/v3. You can use chrome developer tool to inspect the content. This two images are a demo: http://xuelangzf-github.qiniudn.com/20140903_login_url.png , http://xuelangzf-github.qiniudn.com/20140903_coursera_post.png – user2262504 May 14 '15 at 23:47