#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    13
    Rep Power
    0

    [JS/PHP] Efficient & Elegant way to scrape specific info off webpage?


    Hello,

    I'm guessing a REGEX is what I need so I'm posting my question here, it involves a JavaScript script and a PHP script working together.

    The summary is, I want to grab specific entities of text off of specific webpages and insert them into a MySQL database.


    I need to make a JS script (for users to install in their browser via GreaseMonkey or Chrome) that will only work on specific URL's (this is a hobby project for a text based browser game, I want to grab info off of pages and then organize it in a database for future analysis).

    I am already building the database as I know the specific info I want to collect.

    What I don't know, is the most efficient and elegant way to grab specific pieces of text information off a webpage and assimilate it into a database?

    Would I have the JS script grab everything between the BODY tags and send it to the PHP script to be parsed for specific bits of info and THEN the PHP script also inserts it into the database?

    Or would I have the JS script parse the web page itself, and then send the specific bits of info, already separated by the JS, to the PHP script to be inserted into the DB?

    I'm assuming REGEX is what I want to do? But do I want to do the REGEX in the JS script, the PHP script. or would it need to be in both?


    I'm a newb, so I know enough to confuse myself, but not enough to know the elegant solution to this. I don't want the JS script to slow down the webpage as its a game, so I hope someone can help me figure out an efficient way to do this or at least point me in the right direction.


    Thank you for your time, I do appreciate it.
  2. #2
  3. CSS & JS/DOM Adept
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jul 2004
    Location
    USA (verifiably)
    Posts
    20,122
    Rep Power
    4258
    Welcome to DevShed Forums, Korben Dallas.

    RegEx is not very suitable for parsing HTML. Since you're using "user scripts" I would recommend that you take advantage of the browser's built-in DOM methods to extract the pieces of information you want. Another advantage that doing that gives you is that less data has to be transferred to the server.


    P.S. Big ba da boom!
    Spreading knowledge, one newbie at a time.

    Check out my blog. | Learn CSS. | PHP includes | X/HTML Validator | CSS validator | Common CSS Mistakes | Common JS Mistakes

    Remember people spend most of their time on other people's sites (so don't violate web design conventions).
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    13
    Rep Power
    0

    Mul-ti-passss.... YES, she KNOWS its a multipass!


    Thanks for the welcome and reply.


    Okay, after reading all those freakouts about REGEX I am now convinced I don't need it for this


    So maybe a mod can move this thread to the JavaScript section?


    Speedy, so I need a javascript file that will use the browsers DOM to pluck out what bits of data I want off the page and feed it to my PHP script?

    What would this method be called? Is there a term for it that people already use, or should I just go Googling "javascript dom html parsing"?
  6. #4
  7. CSS & JS/DOM Adept
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jul 2004
    Location
    USA (verifiably)
    Posts
    20,122
    Rep Power
    4258
    I've put in a request for one of the mods of this forum to move this thread for you.

    You might try "javascript DOM traversal OR manipulation tutorial". Adding "HTML" to the query won't really help because it's used so very much, and people wouldn't use the term "parsing" for this because the browser does that automatically.

    Here are the sites/articles I recommend for learning about the DOM:
    Rough Guide to the DOM
    JavaScript tutorial - W3C DOM introduction
    http://www.quirksmode.org/js/contents.html#dom
    http://www.brainjar.com/dhtml/intro/
    Spreading knowledge, one newbie at a time.

    Check out my blog. | Learn CSS. | PHP includes | X/HTML Validator | CSS validator | Common CSS Mistakes | Common JS Mistakes

    Remember people spend most of their time on other people's sites (so don't violate web design conventions).
  8. #5
  9. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2009
    Posts
    284
    Rep Power
    255
    To be honest, I would highly recommend you skip learning the DOM, and learn EcmaScript 3rd and 5th editions, and the idea that different browser vendors interpret the languages differently. Instead of learning the problem- which is mostly the DOM, just learn the elegant functional language underneath it- JavaScript.

    http://www.amazon.com/JavaScript-Goo.../dp/0596517742

    And while you're reading that, you can learn jQuery or YUI to get stuff working without having to know that much:
    http://jquery.com
    http://developer.yahoo.com/yui/

    Douglas Crockford suggests that programming languages have good parts and bad parts. He points out the best parts that make JavaScript a very productive, intuitive and robust programming language. Also, Douglas points out the harmful and terrible parts in EcmaScript 3rd edition of JavaScript in his book, JavaScript- The Good Parts (May 2008, Douglas Crockford).
    Last edited by s-p-n; April 21st, 2012 at 12:26 AM.
    - The Wise Guy
  10. #6
  11. CSS & JS/DOM Adept
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jul 2004
    Location
    USA (verifiably)
    Posts
    20,122
    Rep Power
    4258
    Would you care to explain what, in your opinion, is wrong with the DOM?

    Libraries and frameworks are useful, but not taking the time to gain a foundational understanding of the underlying languages and APIs seems rather shortsighted to me. (The DOM standards make up an API.)
    Spreading knowledge, one newbie at a time.

    Check out my blog. | Learn CSS. | PHP includes | X/HTML Validator | CSS validator | Common CSS Mistakes | Common JS Mistakes

    Remember people spend most of their time on other people's sites (so don't violate web design conventions).
  12. #7
  13. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2009
    Posts
    284
    Rep Power
    255
    Since the standards provided by the W3C for the DOM API are so vague and difficult to read, major browser vendors such as Internet Explorer misinterpret the rules. For that reason, web developers are forced to use unnecessary control flow when working with the DOM. Most developers incorrectly take out their frustration on the browsers which saw the problem coming, (such as Microsoft by providing the <!--[IF IE]><[endIF]--> conditions and who fixed many problems with the DOM from Netscape 4 in IE6). When you start to get inside of HTML markup, you'll find that the rules for HTML 4.01 are also hard for browser venders to interpret. Especially when companies partake in proprietary implementations such as DirectX in IE6, and CSS/EcmaScript pre-implementations by all of the major browser outlets. For example, -moz-, -webkit-, -ie- and -o- for early CSS3 implementations. Most of the issues with the DOM are not problems with the DOM API, but instead are problems carried over to the DOM from various browser venders. IE has a problem with memory leaks when you dereference part of the DOM tree leaving a pointer to it from somewhere else. The leak remains in memory even after refreshing the browser window.

    The DOM also has Java-like naming conventions which are rendered unnecessarily large in JavaScript. For example, we've all seen these lines:
    Code:
    window.document.HTMLDocument
    window.document.HTMLElement
    window.document.getElementById()
    window.document.getElementsByTagNames()
    window.document.HTMLHtmlElement
    window.document.HTMLMetaElement
    window.document.HTMLStyleElement
    window.document.HTMLIsIndexElement()
    window.document.
    So, that looks a lot like Java to me.
    Well, if I were to think of the XML skeleton of a DOM in a JavaScript object, I would think of it much differently.
    Code:
    html[0]: {
    	attributes: { .. }, 
    	children: { 
    		head[0]: { 
    			attributes: {}, 
    			children: {
    				title[0]: {
    					attributes: {}, 
    					children: "Hello, World!"
    				}, 
    				meta[0]: { .. }, 
    				style[0]: { .. }, 
    				link[0]: { .. } 
    			}
    		}, 
    		body[0]: { 
    			attributes: { onload: function(e) { .. } },  
    			children: {
    				div[0]: { .. }
    				div[1]: {
    					attributes: {
    						id: "foo",
    						class[0]: "bar",
    						class[1]: "baz"
    					},
    					children: {
    						div[0]: { .. }
    					}
    				}	
    			}
    		}
    	}
    }
    Another thing to note with the DOM API, is that the attribute names are statically placed within each DOM element in the specifications. That kind of overhead is unnecessary. Instead, I argue the DOM should be dynamically created based on the attributes used in the page and available in the browser. A bug in JavaScript throws an error when you use a reserved word in a JavaScript object (error in ES3; fixed in ES5). The term "class" found in many HTML elements has been translated to 'className' in the DOM API rather than making use of JavaScript's square-bracket property reference syntax, Object['property'].


    While the guys were over there working on the DOM, the CSS2 specifications were also being constructed. Rather than communicating nicely and creating a great unified web development experience, the two parties did not collaborate on the web standards. CSS, instead of borrowing JavaScript's object-notation syntax, which would look like this:
    Code:
    p {
        text.decoration: "underline";
    }
    CSS team instead chose to use dashes (-) while knowing the (-) was used as a subtraction symbol in most programming languages. The DOM developers could have lived with dashes, using JavaScript's square-bracket syntax, but instead decided to change the naming. And you guessed it, rather than 'class' the term 'className' was used, and rather than '-', camelCase was used. There are also some exception name conventions in CSS2 that could not be completely converted to camelCase because of name collisions.

    Instead of trying to become "Super Web Guy" and counter ALL of the vender misinterpretations which are justified by the poor specs for the web, realize the hard work has already been done. jQuery made the DOM what it could've been had the DOM and CSS2 guys gotten together and figured out the problem. Memory leaks found in IE browsers are often solved simply by using jQuery instead of the DOM. The IE6 event bubbling standard was replaced by the w3c when they wrote the DOM, causing web developers to incorrectly blame Microsoft for the problem which they originally solved from Netscape. jQuery can bubble-in or bubble-out in any browser, so the direction of event-bubbling doesn't need to be rehacked by every web designer anymore.

    There is a major advantage to using a framework such as jQuery instead of learning bad programming practices introduced by so-called "experts" who ruined the web for today, 13 years ago. I think the biggest problem was that the DOM developers were Java developers- and most people in 1999 thought JavaScript was useless with a few benefits found in the DOM. Bless their hearts, they knew not what they standardized.

    Apologies for the long rant Hope that answers your question about my views on the DOM, and why I suggest to avoid it altogether for the purpose of properly learning JavaScript- which does not include the DOM- as the DOM API is not part of EcmaScript.


    Note: This post is highly influenced by Douglas Crockford- it's hard to claim any of this as my own research. Though I've cross-referenced, Crockford has done most of the work digging into the standards committees and blaming them, finding the best possible practices, and moving forward in web development.
    Last edited by s-p-n; April 21st, 2012 at 12:30 PM.
    - The Wise Guy
  14. #8
  15. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,301
    Rep Power
    7170
    While I fully agree that both the standards and the implementation are a mess, unfortunately it's not just something you can ignore. You need at least a basic understanding of the DOM, as it currently exists, in order to effetively use frameworks like jQuery.

    The simple fact is that the OP is not going to be able to solve their problem if they don't have a basic understanding of the DOM, even if they use a framework.
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
  16. #9
  17. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2009
    Posts
    284
    Rep Power
    255
    It is definitely true that for anyone who plans on using JavaScript in the browser- 99.999% of everyone who plans on using JavaScript- the DOM is an essential part of the language and is to be learned and used.
    - The Wise Guy
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    13
    Rep Power
    0
    I will dig into both the links you guys provided, the DOM and the book which I picked up today. I appreciate all the discussion, it's very interesting even if some of it has currently whooshed me
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    13
    Rep Power
    0

    Node.js and jsdom


    http://blog.nodejitsu.com/jsdom-jque...ines-on-nodejs

    Can jsdom + jQuery DOM scraper work on a site you have to first login to? Or is there a simpler way to collect info and stick it in a database?


    This is just a hobby project so I don't want it to be huge and convoluted and I am hoping to figure out a simple and elegant solution.


    This is a screen shot of the type of web page I want to grab text off of:
    http://i.imm.io/nlXM.png

    I want to be able to load this page, and via JS or Jquery grab that list of players and their ships and insert it into a MySQL database on my own personal web site to track, organize and analyze.

    This is an example of the source code on that page:

    Code:
    <TR><TD ALIGN=CENTER><IMG SRC=images/clear.gif HEIGHT=8 WIDTH=1 BORDER=0><BR>
    Starships in Grid<BR>
    <IMG SRC=images/clear.gif HEIGHT=5 WIDTH=1 BORDER=0><BR>
    <TABLE CELLPADDING=2 CELLSPACING=0 WIDTH=100% BORDER=0>
    <TR>
    <TD BGCOLOR=#151515 WIDTH=16><IMG SRC=images/user/54413_mini1.gif WIDTH=16 HEIGHT=16 BORDER=0></TD>
    <TD BGCOLOR=#151515 WIDTH=1><IMG SRC=images/clear.gif HEIGHT=1 WIDTH=1 BORDER=0></TD>
    <TD BGCOLOR=#151515><FONT COLOR=#9D9DA1><A HREF='index.php?go=ship_info&ship_id=79623' ONMOUSEOVER="window.status=''; return true">Boreas</A> (<A HREF='index.php?go=class_info&class_id=14' ONMOUSEOVER="window.status=''; return true">BTR</A>)</FONT></TD>
    <TD BGCOLOR=#151515 WIDTH=1><IMG SRC=images/clear.gif HEIGHT=1 WIDTH=1 BORDER=0></TD>
    <TD BGCOLOR=#151515><FONT COLOR=#9D9DA1>owned by Cpt. <A HREF='index.php?go=user_info&user_id=54413' ONMOUSEOVER="window.status=''; return true">Mindless</A> of the <A HREF='index.php?go=faction_info&faction_id=464' ONMOUSEOVER="window.status=''; return true">Black Beach Alliance</A>
    </FONT></TD>
    <TD BGCOLOR=#151515 WIDTH=1><IMG SRC=images/clear.gif HEIGHT=1 WIDTH=1 BORDER=0></TD>
    <TD BGCOLOR=#151515 ALIGN=RIGHT><FONT COLOR=#9D9DA1>Scan</FONT></TD>
    </TR>
    <TR>
    <TD BGCOLOR=#101010 WIDTH=16><IMG SRC=images/faction/464_mini12.gif WIDTH=16 HEIGHT=16 BORDER=0></TD>
    <TD BGCOLOR=#101010 WIDTH=1><IMG SRC=images/clear.gif HEIGHT=1 WIDTH=1 BORDER=0></TD>
    <TD BGCOLOR=#101010><FONT COLOR=#9D9DA1><A HREF='index.php?go=ship_info&ship_id=75207' ONMOUSEOVER="window.status=''; return true">SCSY Ventura</A> (<A HREF='index.php?go=class_info&class_id=14' ONMOUSEOVER="window.status=''; return true">BTR</A>)</FONT></TD>
    <TD BGCOLOR=#101010 WIDTH=1><IMG SRC=images/clear.gif HEIGHT=1 WIDTH=1 BORDER=0></TD>
    <TD BGCOLOR=#101010><FONT COLOR=#9D9DA1>owned by Cpt. <A HREF='index.php?go=user_info&user_id=181' ONMOUSEOVER="window.status=''; return true">Andym88</A> of the <A HREF='index.php?go=faction_info&faction_id=464' ONMOUSEOVER="window.status=''; return true">Black Beach Alliance</A>
    </FONT></TD>
    <TD BGCOLOR=#101010 WIDTH=1><IMG SRC=images/clear.gif HEIGHT=1 WIDTH=1 BORDER=0></TD>
    <TD BGCOLOR=#101010 ALIGN=RIGHT><FONT COLOR=#9D9DA1>Scan</FONT></TD>
    </TR>
    <TR>
    <TD BGCOLOR=#151515 WIDTH=16><IMG SRC=images/faction/464_mini12.gif WIDTH=16 HEIGHT=16 BORDER=0></TD>
    <TD BGCOLOR=#151515 WIDTH=1><IMG SRC=images/clear.gif HEIGHT=1 WIDTH=1 BORDER=0></TD>
    <TD BGCOLOR=#151515><FONT COLOR=#9D9DA1><A HREF='index.php?go=ship_info&ship_id=68408' ONMOUSEOVER="window.status=''; return true">GAR Pulseing Heart</A> (<A HREF='index.php?go=class_info&class_id=14' ONMOUSEOVER="window.status=''; return true">BTR</A>)</FONT></TD>
    <TD BGCOLOR=#151515 WIDTH=1><IMG SRC=images/clear.gif HEIGHT=1 WIDTH=1 BORDER=0></TD>
    <TD BGCOLOR=#151515><FONT COLOR=#9D9DA1>owned by Cpt. <A HREF='index.php?go=user_info&user_id=57070' ONMOUSEOVER="window.status=''; return true">Remus Cross</A> of the <A HREF='index.php?go=faction_info&faction_id=464' ONMOUSEOVER="window.status=''; return true">Black Beach Alliance</A>
    </FONT></TD>
    <TD BGCOLOR=#151515 WIDTH=1><IMG SRC=images/clear.gif HEIGHT=1 WIDTH=1 BORDER=0></TD>
    <TD BGCOLOR=#151515 ALIGN=RIGHT><FONT COLOR=#9D9DA1>Scan</FONT></TD>
    </TR>

    What would be the simplest method to grab that data and throw it into my own database?

    Thank you for your help.

IMN logo majestic logo threadwatch logo seochat tools logo