0

It is trivial to extract the PDF url from the following webpage.

https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745

But when I wget it, it will show something like in the output instead of downloading a PDF file.

<p>OSA has implemented a process that requires you to enter the letters and/or numbers below before you can download this article.</p>

As the website uses the cookie cfid, it should be protected by ColdFusion. Does anybody know how to scrape such a webpage? Thanks.

https://cookiepedia.co.uk/cookies/CFID

EDIT: The wget solution offered by Sev Roberts does not work. I checked the chrome devtools (in a new incognito window), many requests are sent after the first request of https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745 is sent. I guess it is because wget won't send those requests so that the subsequent wget (with cookies) of https://www.osapublishing.org/boe/viewmedia.cfm?uri=boe-11-5-2745&seq=0 won't work. Can anybody tell which of those extract requests are essential? Thanks.

user1424739
  • 11,937
  • 17
  • 63
  • 152
  • I don't think it's protected by ColdFusion. Please provide the command line that you are using with WGET. – James Moberg Aug 14 '20 at 22:19
  • It's not trivial, but if I tried hard enough, I could probably do it with ColdFusion code. – Dan Bracuk Aug 15 '20 at 03:26
  • Edit - Seems like they put a process in place to detect and prevent automated downloads like this. Doesn't sound CF specific. IF it's not a violation of their TOS, my guess is it requires making one request to get a cookie value. Extracting the cookie value, then submitting a second request with that cookie. – SOS Aug 15 '20 at 07:41
  • @SOS The problem is many requests are sent. How to figure out which of them are essential? – user1424739 Aug 15 '20 at 11:32
  • AFAIK, Trial and error. All part of the fun and adventures of screen scraping. No two sites are alike.. – SOS Aug 16 '20 at 05:15
  • Are you using ColdFusion to scrape another website? You mention "WGET" and that the site you are trying to scrape uses ColdFusion, but you didn't state whether you were using ColdFusion or not. If this is protection is due to a WAF or session, you may need to use a JS-enabled browser (like headless Chrome). – James Moberg Aug 17 '20 at 16:54

1 Answers1

1

There are several methods that sites use against this sort of scraping and direct linking or embedding. The basic old methods included:

  1. Checking the user's cookies: to at least check the user already had a session from a previous page on this site; some sites might go further and look for the presence of specific cookie or session variables that verify a genuine path through the site.
  2. Checking the cgi.http_referer variable to see whether the user arrived from the expected source.
  3. Checking whether the cgi.http_user_agent looks like a known human browser - or checking that the user agent does not look like a known bot browser.

Other more intelligent methods of course exist, but in my experience if you're requiring more than the above then you're reaching the territory of requiring a captcha and/or requiring a user to register and log in.

Obviously (2) and (3) are easily spoofed by setting the headers manually. For (1) if you're using cfhttp or its equivalent in another language, then you need to ensure that cookies returned in the Set-Cookie header of the site's response, are returned in the headers of your subsequent request by using cfhttpparam. Various cfhttp wrappers and alternative libraries such as Java wrappers bypassing the cfhttp layer, are available to do this. But if you want to understand a simple example of how this works then Ben Nadel has an old but good one here: https://www.bennadel.com/blog/725-maintaining-sessions-across-multiple-coldfusion-cfhttp-requests.htm

With the pdf url from the link in your question, a couple of minutes tinkering in Chrome shows that if I lose the cookies from the previous page and keep the http_referer then I see the captcha challenge, but if I keep the cookies and lose the http_referer then I get directly through to the pdf. This confirms that they care about the cookies but not the referer.

Copy of Ben's example for SO completeness:

<cffunction
    name="GetResponseCookies"
    access="public"
    returntype="struct"
    output="false"
    hint="This parses the response of a CFHttp call and puts the cookies into a struct.">
 
    <!--- Define arguments. --->
    <cfargument
        name="Response"
        type="struct"
        required="true"
        hint="The response of a CFHttp call."
        />
    <!---
        Create the default struct in which we will hold
        the response cookies. This struct will contain structs
        and will be keyed on the name of the cookie to be set.
    --->
    <cfset LOCAL.Cookies = StructNew() />
    <!---
        Get a reference to the cookies that werew returned
        from the page request. This will give us an numericly
        indexed struct of cookie strings (which we will have
        to parse out for values). BUT, check to make sure
        that cookies were even sent in the response. If they
        were not, then there is not work to be done.
    --->
    <cfif NOT StructKeyExists(
        ARGUMENTS.Response.ResponseHeader,
        "Set-Cookie"
        )>
        <!---
            No cookies were send back in the response. Just
            return the empty cookies structure.
        --->
        <cfreturn LOCAL.Cookies />
    </cfif>
    <!---
        ASSERT: We know that cookie were returned in the page
        response and that they are available at the key,
        "Set-Cookie" of the reponse header.
    --->
    <!---
        Now that we know that the cookies were returned, get
        a reference to the struct as described above.
    --->
    <!--- 
        The cookies might be coming back as a struct or they
        might be coming back as a string. If there is only 
        ONE cookie being retunred, then it comes back as a 
        string. If that is the case, then re-store it as a 
        struct. 
    ---><!---<cfdump var="#arguments#" label="Line 305 - arguments for function GetResponseCookies" output="D:\web\safenet_GetResponseCookies.html" FORMAT="HTML">--->
    <cfif IsSimpleValue(ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ])>
        <cfset LOCAL.ReturnedCookies = {} />
        <cfset LOCAL.ReturnedCookies[1] = ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ] />
    <cfelse>
        <cfset LOCAL.ReturnedCookies = ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ] />
    </cfif>
    <!--- Loop over the returned cookies struct. --->
    <cfloop
        item="LOCAL.CookieIndex"
        collection="#LOCAL.ReturnedCookies#">
        <!---
            As we loop through the cookie struct, get
            the cookie string we want to parse.
        --->
        <cfset LOCAL.CookieString = LOCAL.ReturnedCookies[ LOCAL.CookieIndex ] />
        <!---
            For each of these cookie strings, we are going to
            need to parse out the values. We can treate the
            cookie string as a semi-colon delimited list.
        --->
        <cfloop
            index="LOCAL.Index"
            from="1"
            to="#ListLen( LOCAL.CookieString, ';' )#"
            step="1">
            <!--- Get the name-value pair. --->
            <cfset LOCAL.Pair = ListGetAt(
                LOCAL.CookieString,
                LOCAL.Index,
                ";"
                ) />
            <!---
                Get the name as the first part of the pair
                sepparated by the equals sign.
            --->
            <cfset LOCAL.Name = ListFirst( LOCAL.Pair, "=" ) />
            <!---
                Check to see if we have a value part. Not all
                cookies are going to send values of length,
                which can throw off ColdFusion.
            --->
            <cfif (ListLen( LOCAL.Pair, "=" ) GT 1)>
                <!--- Grab the rest of the list. --->
                <cfset LOCAL.Value = ListRest( LOCAL.Pair, "=" ) />
            <cfelse>
                <!---
                    Since ColdFusion did not find more than one
                    value in the list, just get the empty string
                    as the value.
                --->
                <cfset LOCAL.Value = "" />
            </cfif>
            <!---
                Now that we have the name-value data values,
                we have to store them in the struct. If we are
                looking at the first part of the cookie string,
                this is going to be the name of the cookie and
                it's struct index.
            --->
            <cfif (LOCAL.Index EQ 1)>
                <!---
                    Create a new struct with this cookie's name
                    as the key in the return cookie struct.
                --->
                <cfset LOCAL.Cookies[ LOCAL.Name ] = StructNew() />
                <!---
                    Now that we have the struct in place, lets
                    get a reference to it so that we can refer
                    to it in subseqent loops.
                --->
                <cfset LOCAL.Cookie = LOCAL.Cookies[ LOCAL.Name ] />
                <!--- Store the value of this cookie. --->
                <cfset LOCAL.Cookie.Value = LOCAL.Value />
                <!---
                    Now, this cookie might have more than just
                    the first name-value pair. Let's create an
                    additional attributes struct to hold those
                    values.
                --->
                <cfset LOCAL.Cookie.Attributes = StructNew() />
            <cfelse>
                <!---
                    For all subseqent calls, just store the
                    name-value pair into the established
                    cookie's attributes strcut.
                --->
                <cfset LOCAL.Cookie.Attributes[ LOCAL.Name ] = LOCAL.Value />
            </cfif>
        </cfloop>
    </cfloop>
    <!--- Return the cookies. --->
    <cfreturn LOCAL.Cookies />
</cffunction>

Assuming you have a cfhttp response from the first page https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745 and pass that response into the above function and hold its result in a variable named cookieStruct, then you can use this inside subsequent cfhttp requests:

<cfloop item="strCookie" collection="#cookieStruct#">
    <cfhttpparam type="COOKIE" name="#strCookie#" value="#cookieStruct[strCookie].Value#" />
</cfloop>

Edit: if using wget instead of cfhttp - you could try the approach from the answer to this question - but without posting a username and password since you don't actually need a login form

How to get past the login page with Wget?

eg

# Get a session.
wget --save-cookies cookies.txt \
     --keep-session-cookies \
     --delete-after \
     https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745

# Now grab the page or pages we care about.
# You may also need to add valid http_referer or http_user_agent headers
wget --load-cookies cookies.txt \
     https://www.osapublishing.org/boe/viewmedia.cfm?uri=boe-11-5-2745&seq=0

...although as others have pointed out, you may be violating the terms of service of the source, so I couldn't recommend actually doing this.

Sev Roberts
  • 1,295
  • 6
  • 7
  • 1
    .. yeah, though if they are doing all that, the TOS probably doesn't allow screen scraping :-) – SOS Aug 17 '20 at 17:42
  • I don't use cf code for scrapping and I don't understand cf code. I just want to send the raw HTTP requests for scraping. Could you provide the raw wget command (along with any other commands to figure out other parameters) so that I can try whether it works? Thanks. – user1424739 Aug 17 '20 at 18:46
  • Ah - I had mentally autocorrected `wget` to 'get' when I read the question. I don't use `wget` for anything other than simple download scripts for Docker. I could translate that CF into Java, C# or PHP if I really had to, but I wouldn't know how you go about programmatically extracting headers from one wget response to feed into another wget request. If you're doing this in shell scripts then you'd be best off looking up the `man` pages for `wget` to see whether it has a built-in option for persisting cookies, and if not, choose another command line browser that does... – Sev Roberts Aug 17 '20 at 19:14
  • I edited my answer to add a `wget` example, untested, but it ought to work for your case since they are using a 302 redirect, which `wget` can follow. It wouldn't work for a javascript redirect; for future ref if that was the case then you'd have to use the final destination of the redirect as the last wget URL, and hope that they are not using a dynamically-generated time-limited unique URL :-) – Sev Roberts Aug 17 '20 at 19:28
  • @SevRoberts I think the website does more to prevent simple wget command with cookies. I think the key is to figure out what are the minimal sets of HTTP requests are send. Are you able to check what HTTP requests that your coldfusion code send and only keep the minimal sets of HTTP requests (and the dependency among them as an HTTP request latter may need some info got from an HTTP request earlier) yet still able to get the PDF file downloaded? Only by this way, one can figure out the wget commands needed (figuring out the wget commands is trivial once the HTTP requests are known)? – user1424739 Aug 21 '20 at 23:55