#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    7
    Rep Power
    0

    Parsing HTML page -- get fatal error on DOCTYPE


    I'm trying to parse a URI, but I get:
    [Fatal Error] beaconschedule.html:1:63: White spaces are required between publicId and systemId.
    I understand the cause of the error, but how to I get around it? I can not change the HTML. Do I need to run some kind of HTML tidy tool before parsing?

    As an FYI this is the code:
    PHP Code:
    NodeList nl=null;
            try{
                
    DocumentBuilderFactory factory DocumentBuilderFactory.newInstance();
                
    DocumentBuilder builder factory.newDocumentBuilder();
                
    Document doc = (Documentbuilder.parse("http://www.ncdxf.org/beacon/beaconschedule.html");
                
    XPathFactory xPathfactory XPathFactory.newInstance();
                
    XPath xpath xPathfactory.newXPath();
                
    XPathExpression expr xpath.compile("/html/body/p/table/tbody/tr[2]/td[2]/table/tbody/tr[3]");
                
    nl = (NodeListexpr.evaluate(docXPathConstants.NODESET);
            }
            catch(
    Exception e){}
            
    int i=0;
            for(
    i=0;i<nl.getLength();i++)
            {
                
    System.out.println(nl.item(i).toString());
            } 
  2. #2
  3. Daniel Schildsky
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Mar 2004
    Location
    KL, Malaysia.
    Posts
    1,555
    Rep Power
    1621

    Skipping DTD error


    You can implement your own ErrorHandler and set it to the DocumentBuilder before you start parsing the HTML file.

    java Code:
     
    //import statement omitted for brevity
     
    public CustomisedErrorHandler implements ErrorHandler
    {
         public void error(SAXParseException e)
               throws SAXException{
             System.out.println("Error caught");
             //you should use a logger framework to log
             // errors instead. System.out.println is used here
             // for simplicity.
         }
     
         public void warning(SAXParseException e)
               throws SAXException{
              System.out.println("Warning caught");
              //Likewise, logger framework should be used to log
             // warnings.
         }
     
         public void fatalError(SAXParseException e)
               throws SAXException{
              System.out.println("Warning caught");
              // Likewise, logger framework should be used to log
              // fatal errors.
     
              // A typical ErrorHandler implementation should rethrow
              // the exception, but here you MUST NOT rethrow it.
         }
    }


    Then, try to set the feature in DocumentBuilderFactory to force the parser to continue parsing after fatal error is caught.
    java Code:
     
     NodeList nl=null; 
            try{ 
                DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); 
                //set the feature so that the parser continues after 
                //fatal error.
                try {
                      factory.setFeature("http://apache.org/xml/features/continue-after-fatal-error", 
                       true);
    } 
    catch (ParserConfigurationException e) {
        System.err.println("could not set parser feature");
    }
     
                DocumentBuilder builder = factory.newDocumentBuilder(); 
                //set the custom error handler here
                builder.setErrorHandler(new CustomisedErrorHandler());
     
                Document doc = (Document) builder.parse("http://www.ncdxf.org/beacon/beaconschedule.html"); 
                XPathFactory xPathfactory = XPathFactory.newInstance(); 
                XPath xpath = xPathfactory.newXPath(); 
                XPathExpression expr = xpath.compile("/html/body/p/table/tbody/tr[2]/td[2]/table/tbody/tr[3]"); 
                nl = (NodeList) expr.evaluate(doc, XPathConstants.NODESET); 
            } 
            catch(Exception e){} 
            int i=0; 
            for(i=0;i<nl.getLength();i++) 
            { 
                System.out.println(nl.item(i).toString()); 
            }
    When the programming world turns decent, the real world will turn upside down.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    7
    Rep Power
    0
    Originally Posted by tvc3mye
    You can implement your own ErrorHandler and set it to the DocumentBuilder before you start parsing the HTML file.
    Thanks for such a detailed reply and for showing me how to do error handlers in Java which I haven't tried before.

IMN logo majestic logo threadwatch logo seochat tools logo