5

Download PDF files frome this website "https://register.awmf.org/de/start" but the code didn't find any PDF Link, although there are links to PDF files, but indirectly,I want to download all available PDF files and organize them into a folder.

and this message appeared when applying the code: No PDF links found on the page.

this is the code:

library(rvest)
library(stringr)
library(downloader) # For downloading the actual files

1. Define the main URL of the page you want to scrape

main_page_url <- "https://register.awmf.org/de/start" # <--- REPLACE with your target page URL

2. Read the HTML content of the main page

cat("Reading HTML from:", main_page_url, "\n") webpage <- tryCatch({ read_html(main_page_url) }, error = function(e) { cat("Error reading main page:", e$message, "\n") return(NULL) })

if (is.null(webpage)) { stop("Could not access the main webpage. Exiting.") }

3. Extract all links (href attributes of <a> tags)

You might need to inspect the webpage's HTML to find specific CSS selectors

if you only want links from a certain section (e.g., a div with class "downloads")

cat("Extracting links...\n") all_links <- webpage %>% html_elements("a") %>% # Select all <a> (anchor) tags html_attr("href") # Extract the 'href' attribute from each <a> tag

Remove any NULL or NA links

all_links <- all_links[!is.na(all_links)]

4. Filter for specific file types (e.g., PDFs)

You can extend this regex to include .docx, .xlsx, .zip, etc.

pdf_links <- all_links[str_detect(all_links, "\.pdf$")] # Links ending with .pdf

5. Construct full URLs (handle relative links)

This is crucial! Many links are relative (e.g., /documents/file.pdf)

base_url <- "https://example.com" # <--- REPLACE with the base URL of the website full_pdf_urls <- sapply(pdf_links, function(link) { if (str_starts(link, "http://") || str_starts(link, "https://") || str_starts(link, "ftp://")) { return(link) # It's already an absolute URL } else if (str_starts(link, "/")) { return(paste0(base_url, link)) # It's a root-relative URL } else { # This case might be more complex, e.g., current_path/file.pdf # For simplicity, we'll assume most downloadable files are root-relative or absolute. # You might need more sophisticated URL parsing if this is an issue. return(NA) # Mark as NA if it's not directly handleable } })

Remove any NA URLs that couldn't be resolved

full_pdf_urls <- full_pdf_urls[!is.na(full_pdf_urls)]

if (length(full_pdf_urls) == 0) { cat("No PDF links found on the page.\n") } else { cat("Found", length(full_pdf_urls), "PDF links.\n")

6. Define download directory

download_dir <- "downloaded_pdfs" if (!dir.exists(download_dir)) { dir.create(download_dir) }

7. Download each PDF file

for (url in full_pdf_urls) { # Extract filename from URL (e.g., "file.pdf" from "http://example.com/path/file.pdf") filename <- basename(url) destination_path <- file.path(download_dir, filename)

cat(&quot;Downloading:&quot;, url, &quot;to&quot;, destination_path, &quot;\n&quot;)

tryCatch({
  download(url, destination_path, mode = &quot;wb&quot;)
  cat(&quot;  Successfully downloaded:&quot;, filename, &quot;\n&quot;)
}, error = function(e) {
  cat(&quot;  Error downloading&quot;, filename, &quot;:&quot;, e$message, &quot;\n&quot;)
})

} cat("\nAll PDF download attempts completed.\n") }

Ward Khedr
  • 51
  • 1

1 Answers1

3

This question might be off-topic - stackoverflow might be a better place.

Anyway, I don't think your approach will work for the website you want to access. It appears that the page in question is not static HTML - rather it is probably rendered dynamically with javascript. As far as I know the R package rvest package doesn't support JavaScript rendering, so it won't see dynamically generated content like the PDF links on that page.

Instead you could try chromote which can render the page so that javascript elements (PDF links in this case) are included.

There are also a few minor issues with your code:

  • the base_url is hardcoded incorrectly, so relative links won't resolve.
  • download() isn’t vectorised and fails on URLs with query strings. It handles one file at a time and doesn’t clean query strings, so filenames like file.pdf?version=1 cause errors on some systems (eg. Windows).

One work-around is to use the chromote package to render the page in a headless browser. Then you can apply the same html_elements("a") approach once the DOM is fully built. You then use standard DOM extraction, similar to the code you have, to gather the pdf links.

Robert Long
  • 3,518
  • 12
  • 30