#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    1
    Rep Power
    0

    How to tell if a file is reachable


    Hello, I have a homework assignment that doesn't make any sense to me. In the code block is what I'm given. I have never done scripting w/ webpages before, not really sure how to get started. I could use some pointers on Part 2 in that I don't know how to tell if a "file" is reachable or not.
    I have tried putting "wget http://www.oracle.com/us/solutions/index.html" into a shell and a info about the site comes up, but not sure what information to use.


    Code:
    You have a web site containing static pages, such as www.oracle.com and you wish to verify the site.
    
    Part 1: Static Verifier
    Write an application called static_verifier that takes command-line arguments directory and a base URI (i.e. Uniform Resource Identifier, such as h_tp://www.oracle.com/us/solutions/index.html The application will scan all .html files in the directory and its subdirectories for <a> (anchor tags) and <img> (image tags) to find linked files.
    For each link, determine whether it points to an internal (this site) or external resource. If it is internal, verify whether the file exists in your snapshot. Output should consist of the file name, the missing internal links, the valid internal links, and the external links. Indent each section. Within each section, list the names alphabetically so they will be diff-compatible with baseline data.
    
    Sample Output (from modified index.html):
    
    data/index-broken.html
        Missing Internal Links
            bad_dijkstra.zip
            missing_single-dispatch.cc
            oldsite/index.html
            test_assignment-5.html
            test_assignment-5.html
        Valid Internal Links
            Run_***1
            allocator_skel.cc
            args.cc
            assignment-1.html
            assignment-2.html
            assignment-3.html
            lecture-01.html
            lecture-02.html
            lecture-03.html
            wordcount-btree-skel.cc
            wordcount-map.cc
        External Links
            http://catb.org/jargon/
            http://cis.stvincent.edu/html/tutorials/swd/index.html
            http://courses.washington.edu/css343/zander
            http://en.wikibooks.org/wiki/C%2B%2B
            http://en.wikibooks.org/wiki/More_C%2B%2B_Idioms
            http://en.wikipedia.org/wiki/Bash_%28Unix_shell%29
            http://en.wikipedia.org/wiki/Bourne_shell
            http://en.wikipedia.org/wiki/C_shell
            http://en.wikipedia.org/wiki/Exit_status
            http://en.wikipedia.org/wiki/Man_page
            http://en.wikipedia.org/wiki/Script_%28computing%29
            http://www.cs.sunysb.edu/~algorith/video-lectures/
            http://www.parashift.com/c++-faq-lite/index.html
            http://www.uwb.edu/css
            http://www.washington.edu/computing/unix/
            http://yosefk.com/c++fqa/
            https://catalyst.uw.edu/collectit/dropbox/morrisb9/25684
          
    Part 2
    
    Write a program called file_verifier that will take the 
    same arguments as static_verifier.
    
    Verify that each file in the subtree is reachable directly or 
    indirectly from the the homepage (index.html). Print out
    the list of unreachable files in sorted order.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,930
    Rep Power
    1225
    Cross posted on perlguru and stackoverflow.

IMN logo majestic logo threadwatch logo seochat tools logo