Hello, I have a homework assignment that doesn't make any sense to me. In the code block is what I'm given. I have never done scripting w/ webpages before, not really sure how to get started. I could use some pointers on Part 2 in that I don't know how to tell if a "file" is reachable or not.
I have tried putting "wget http://www.oracle.com/us/solutions/index.html" into a shell and a info about the site comes up, but not sure what information to use.
You have a web site containing static pages, such as www.oracle.com and you wish to verify the site.
Part 1: Static Verifier
Write an application called static_verifier that takes command-line arguments directory and a base URI (i.e. Uniform Resource Identifier, such as h_tp://www.oracle.com/us/solutions/index.html The application will scan all .html files in the directory and its subdirectories for <a> (anchor tags) and <img> (image tags) to find linked files.
For each link, determine whether it points to an internal (this site) or external resource. If it is internal, verify whether the file exists in your snapshot. Output should consist of the file name, the missing internal links, the valid internal links, and the external links. Indent each section. Within each section, list the names alphabetically so they will be diff-compatible with baseline data.
Sample Output (from modified index.html):
Missing Internal Links
Valid Internal Links
Write a program called file_verifier that will take the
same arguments as static_verifier.
Verify that each file in the subtree is reachable directly or
indirectly from the the homepage (index.html). Print out
the list of unreachable files in sorted order.