Editing multiple HTML files: downloading sites, mass renaming filenames, and find and replace with regular expressions.

This post is borne from the frustration of having to edit multiple HTML files at the same time. One boring aspect of any HTML/CSS developers job is having to convert an old clients website to a new template. The manual way to do it is to open each document one by one, copy the content out, and paste it into a new document with the new template… there has to a better way!

Having done this a few times below are my tips for speedy editing of multiple HTML pages – it is all about thinking laterally.

Getting files from a website

If you need to get a clients website files, and you either don’t have access to their ftp details, or all the data resides in a database and you don’t have a programmer to help, you will need to use a site stripper. My favourite is httrack, as it is free and it does a very good job.

Start the web address at the page that you wish to start downloading from. Most often this is the root url of a site.

In the Set Options click the Links tab and select Get HTML files first!.

If you don’t want to download the whole site, set the web page to the sub-URL you want to start downloading from. Watch the folder in which the files are downloading, and stop it when you have all of the ones you want. There might be a better way than this, but I have never bothered to experiment with httrack and I can get done what I want with this method.

Renaming files

Never ever ever spend time renaming files by hand! You are wasting your time if you are renaming more than 5 files at a time. Rename-it is a light weight file rename program which allows you to rename file extensions enmass e.g. .html files to .aspx. The interface is simple and self-explanatory.

Note: to edit file extensions click the Rename menu and select Filename with extension.

Dreamweaver

Unfortunately I will be talking about Dreamweaver which is not free, as most HTML developers use it (including myself). The principles will be the same using other code editing programs.

The good thing about editing multiple HTML pages is they are often almost identical. This allows liberal use of Find/Replace on multiple files. The bad thing is Find/Replace relies on the pages being completely identical which most aren’t. The first step is to work out what content needs to remain in the files you are updating, and then replace the code around that content. To get around the different content in things such as title or metatags, or navigations that are section specific we can use Dreamweavers Find/Replace features: Specific Tag and Use regular expressions.

Note: To use them across multiple HTML files you will first need to define a site containing those files, open up the find/replace dialogue, and then set Find in: Selected Files in Site. Then in the Files window pane, select the folder containing the files that you want to edit enmasse.

Specific Tag

If you need to either strip content from tags, select tags by attributes and reset the attributes, or just remove a particular tag, liberal use the Search: Specific Tag feature in the Dreamweaver find/replace.

Using regular expressions (regex)

The real power in search replace lies in using regex to change content. The Adobe site has a great overview and tutorial about using regular expressions within Dreamweaver. To use regex do a normal find/replace, but check the Use regular expression box.

My main trick for editing multiple pages with different headers and footers is to replace the whole header with the new page header in one go. This can be done with a simple regex that will select any character:

[^]*

In the find input area wrap this code with the doctype and end it with the top of the header which you want to replace:

<!DOCTYPE[^]*</head>

In the replace input area you can then put the header code for the new pages. Obviously if you want to keep the meta data then this regex will replace it. An example that will keep a title, meta keywords, and description would be:

Find box

(<!DOCTYPE)([^]*)(<title>)([^]*)(</title>)([^]*)(<meta name="description"
content=")([^]*)(" />)([^]*)(<meta name="keywords" content=")([^]*)(" />)
([^]*)(</head>)

Replace box

<!DOCTYPE what ever is your doctype />
<title>$4</title><meta name="description" content="$8" />
<meta name="keywords" content="$12" />
</head>

The tags $x in the replace box take the value from each subexpression (surrounded by the brackets) in the find box and inserts it into the replace where you have specified.

Note regular expressions are unforgiving. I always run finds (rather than find/replaces) to make sure I am selecting the right files. Of course make sure to back up any files before doing mass changes, as you cannot undo anything.

Often one character is incorrect and the expression won’t return anything, which is very frustrating. It can take a bit of tweaking to get them to work well, but spending half an hour making sure that the order of meta and title tags is correct can save three hours cutting and pasting.


Comments

Post a comment