#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2011
    Posts
    3
    Rep Power
    0

    Need help to extract <TITLE> from multiple files & directories


    Hi,
    This is my first post here and I know this is probably not the best place to have this thread but I am not sure in which language this would be best.

    I have an old legacy website with a large directory structure (should have been created using a SQL DB in all honesty) on my local drive.
    The html files all have numbers instead of names.

    I need to extract the <TITLE> for each HTML file along with its parent directory and place into a Spreadsheet

    so the structure would be as follows

    Directory name Page Title
    Directory name Page Title

    one entry per page for the whole site.

    any help would be appreciated as I don't fancy opening up each file and copy & pasting about 700 times

    I was sure I could write a DOS command to do it for me, but for the life of me I cannot figure it out. Although I am open to any suggestions on alternate languages.
    I have tried searching the site but to be honest my programming skills stopped at DOS commands about 20 years ago.

    Thanks
  2. #2
  3. Lord of the Dance
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Oct 2003
    Posts
    3,663
    Rep Power
    1945
    If you are in doubt, you can try look at the broader "idea"/problem. In your case i think the Software Design would be the best place.

    It is also a good thing to tell which system you are working with and how you want the program to be run.

    If it ia something that should run locally using Windows and MS Execel, i would say VB scripting could be an option, as it both have files access and "integration" to MS Excel
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2011
    Posts
    3
    Rep Power
    0
    Thanks, sorry I should have said I am using Windows 7
    I have Excel2002 so I would be capable of running a VB script.
    However I am unsure of how to implement it.
    The site is not being overhauled or changed, I just need a list of the page titles.
  6. #4
  7. JavaScript is not spelt java
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2011
    Location
    Landan, England
    Posts
    743
    Rep Power
    165
    Open Excel and press Alt-F11 to open the VB Editor. Then choose Insert/ Module and copy the whole of the following code to it.

    Amend the path to your root folder. This will open each .html temporarily and extract it's title (and path) to a new worksheet.

    Once you've amended the path click withn the first Sub procedure and press F5 to run it. Added: If your nervous about running the whole thing press F8 to step through it until you're convinced. Then press F5 to run the rest of it.
    Code:
    Sub GetPathsAndTitles()
        Dim colFiles As New Collection
        Dim vFile As Variant
        Dim aCell As Range
        Dim tempWB As Workbook
        
        Application.ScreenUpdating = False
        RecursiveDir colFiles, "C:\Users\Andrew\Documents\My Web Sites\andy_web", "*.html", True
        Worksheets.Add
        Set aCell = ActiveCell
        For Each vFile In colFiles
            aCell.Value = vFile
            Set tempWB = Workbooks.Open(vFile)
            aCell.Offset(0, 1).Value = tempWB.BuiltinDocumentProperties("Title")
            tempWB.Close False
            Set aCell = aCell.Offset(1, 0)
            Next vFile
        ActiveSheet.UsedRange.EntireColumn.AutoFit
        
    End Sub
    
    'The 'Dir' function offers a convenient solution to finding files in a single folder,
    'however, due to Dir's internal implementation it can't be called recursively.
    
    'The 'RecursiveDir' function below allows us to find all files beneath a certain
    'point in the directory tree (or all .jpg files, for example). The code is based
    'on an example by Albert Kallall, adapted by Allen Browne.
    
    Public Function RecursiveDir(colFiles As Collection, _
                                 strFolder As String, _
                                 strFileSpec As String, _
                                 bIncludeSubfolders As Boolean)
    
        Dim strTemp As String
        Dim colFolders As New Collection
        Dim vFolderName As Variant
    
        'Add files in strFolder matching strFileSpec to colFiles
        strFolder = TrailingSlash(strFolder)
        strTemp = Dir(strFolder & strFileSpec)
        Do While strTemp <> vbNullString
            colFiles.Add strFolder & strTemp
            strTemp = Dir
        Loop
    
        If bIncludeSubfolders Then
            'Fill colFolders with list of subdirectories of strFolder
            strTemp = Dir(strFolder, vbDirectory)
            Do While strTemp <> vbNullString
                If (strTemp <> ".") And (strTemp <> "..") Then
                    If (GetAttr(strFolder & strTemp) And vbDirectory) <> 0 Then
                        colFolders.Add strTemp
                    End If
                End If
                strTemp = Dir
            Loop
    
            'Call RecursiveDir for each subfolder in colFolders
            For Each vFolderName In colFolders
                Call RecursiveDir(colFiles, strFolder & vFolderName, strFileSpec, True)
            Next vFolderName
        End If
    
    End Function
    
    
    Public Function TrailingSlash(strFolder As String) As String
        If Len(strFolder) > 0 Then
            If Right(strFolder, 1) = "\" Then
                TrailingSlash = strFolder
            Else
                TrailingSlash = strFolder & "\"
            End If
        End If
    End Function
    Last edited by AndrewSW; May 21st, 2011 at 02:44 PM.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2011
    Posts
    3
    Rep Power
    0
    Hi, That looks exactly the kind of thing I'm after, however I have come across a problem, when stepping through the code with F8 it works fine with no errors, however the spreadsheet is not recording any data, all the sheets are empty.
    I run the whole code with F5 it runs for a while then returns an error code "run time error 53" File not found.

    I've checked and then double checked the path and all is fine
    Any suggestions?

    Thanks

    but this is absolutey what I am looking for
  10. #6
  11. JavaScript is not spelt java
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2011
    Location
    Landan, England
    Posts
    743
    Rep Power
    165
    I ran it again from a large html directory. I had to press Ok a few times when it couldn't load attached css files, but if the files contain links to JavaScript files this could be more of an issue. Adding false to the following line tells it not to attempt to update links - this may help:
    Code:
    Set tempWB = Workbooks.Open(vFile,false)
    You could delete the line Application.ScreenUpdating = False so you could see that it's working, but it will then run very slowly.

    If you still can't get it working then you may have to look into usng file IO to open the html files as text files, and using a regular expression to find the Title. This is a little more complicated..

    Note: Having filenames beginning with numbers doesn't help
  12. #7
  13. JavaScript is not spelt java
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2011
    Location
    Landan, England
    Posts
    743
    Rep Power
    165
    Try also
    Code:
    Set tempWB = Workbooks.Open(vFile,false,true)
    which opens the files as read-only.

    There is also a Microsoft download called DSO discussed within this page that would enable you to read document proeprties without opening the file, which I assume would work even with .html files.
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    1
    Rep Power
    0

    I need your help!


    Hi AndrewSW, Thanks a lot for your code. It is really helpful. Now I was trying to change the ("Title") into ("Keywords") or ("Description"), and the content won't be showing there on the sheet.

    AS you know, My html meta tags are like:

    <title>GetPathsAndTitles</title>
    <meta name="keywords" content="Get Paths, Titles">
    <meta name="description" content="GetPathsAndTitles">

    Could you please kindly help me with this?

    Thank you in advance.

    Tom




    Originally Posted by AndrewSW
    Open Excel and press Alt-F11 to open the VB Editor. Then choose Insert/ Module and copy the whole of the following code to it.

    Amend the path to your root folder. This will open each .html temporarily and extract it's title (and path) to a new worksheet.

    Once you've amended the path click withn the first Sub procedure and press F5 to run it. Added: If your nervous about running the whole thing press F8 to step through it until you're convinced. Then press F5 to run the rest of it.
    Code:
    Sub GetPathsAndTitles()
        Dim colFiles As New Collection
        Dim vFile As Variant
        Dim aCell As Range
        Dim tempWB As Workbook
        
        Application.ScreenUpdating = False
        RecursiveDir colFiles, "C:\Users\Andrew\Documents\My Web Sites\andy_web", "*.html", True
        Worksheets.Add
        Set aCell = ActiveCell
        For Each vFile In colFiles
            aCell.Value = vFile
            Set tempWB = Workbooks.Open(vFile)
            aCell.Offset(0, 1).Value = tempWB.BuiltinDocumentProperties("Title")
            tempWB.Close False
            Set aCell = aCell.Offset(1, 0)
            Next vFile
        ActiveSheet.UsedRange.EntireColumn.AutoFit
        
    End Sub
    
    'The 'Dir' function offers a convenient solution to finding files in a single folder,
    'however, due to Dir's internal implementation it can't be called recursively.
    
    'The 'RecursiveDir' function below allows us to find all files beneath a certain
    'point in the directory tree (or all .jpg files, for example). The code is based
    'on an example by Albert Kallall, adapted by Allen Browne.
    
    Public Function RecursiveDir(colFiles As Collection, _
                                 strFolder As String, _
                                 strFileSpec As String, _
                                 bIncludeSubfolders As Boolean)
    
        Dim strTemp As String
        Dim colFolders As New Collection
        Dim vFolderName As Variant
    
        'Add files in strFolder matching strFileSpec to colFiles
        strFolder = TrailingSlash(strFolder)
        strTemp = Dir(strFolder & strFileSpec)
        Do While strTemp <> vbNullString
            colFiles.Add strFolder & strTemp
            strTemp = Dir
        Loop
    
        If bIncludeSubfolders Then
            'Fill colFolders with list of subdirectories of strFolder
            strTemp = Dir(strFolder, vbDirectory)
            Do While strTemp <> vbNullString
                If (strTemp <> ".") And (strTemp <> "..") Then
                    If (GetAttr(strFolder & strTemp) And vbDirectory) <> 0 Then
                        colFolders.Add strTemp
                    End If
                End If
                strTemp = Dir
            Loop
    
            'Call RecursiveDir for each subfolder in colFolders
            For Each vFolderName In colFolders
                Call RecursiveDir(colFiles, strFolder & vFolderName, strFileSpec, True)
            Next vFolderName
        End If
    
    End Function
    
    
    Public Function TrailingSlash(strFolder As String) As String
        If Len(strFolder) > 0 Then
            If Right(strFolder, 1) = "\" Then
                TrailingSlash = strFolder
            Else
                TrailingSlash = strFolder & "\"
            End If
        End If
    End Function

IMN logo majestic logo threadwatch logo seochat tools logo