In my last post titled Lesson Re-learned: Backups !, I admitted that I had committed the cardinal sin of making changes to my web site without doing a backup first (walking the tightrope without a net).
Luckily for me I had installed the WP Super Cache plugin, so all of my content actually still existed as static files, and being a bit of a hacker, I was able to throw together some code to effectively recover my posts.
The first step was to look at the files in the cache and figure out how much of the content was there. Interestingly the cache really does a lot of work to put all the files together, and it builds a tree that really follows the structure of your posts
Step 1 - was to download the files from my WordPress server onto my development box. Under my wp-content folder there is a cache folder that contains the supercache, which holds all the static pages generated by WP Super Cache:
Now you'll notice that this folder contains a ton of folders that aren't directly related to my regular domain name of www.accuweaver.com. I think that is partly because WP Super Cache tries to be really efficient, and also partly because my server actually serves some other domains that get redirected through my Apache server.
With a little investigation, it looked to me like the best candidate for recovery was the folder named blog.accuweaver.com (which happened to be the WordPress installation I'd deleted that had started this mess).
I downloaded all of these files and folders to my local Mac ~/Sites/accuweaver.com folder so that I could see what the pages look like and be certain that I had a backup copy of the content.
Step 2 - browse the contents of the folders and see what the structure looks like. By opening each folder locally, I was able to see that the posts appeared to pretty much all be there, and at each level of the tree, there was an index.html that was a static copy of what that particular URL would have looked like.
So for instance loading the file at the root of the blog.accuweaver.com folder give me the home page that I originally had on my server:
So browsing through the folders, I see that at each level of the tree, there are index.html files that represent what you would have seen at that URL. For instance if you looked at the file at blog.accuweaver.com/2008, you see the yearly archives page, blog.accuweaver.com/2008/12 is the archives for December of 2008 and so on until you get to the actual post, which is a folder with the post name like blog.accuweaver.com/2008/12/27/another-great-reason-to-use-a-gps-when-cycling/
Since all I'm interested in is the information at the bottom level, those will be the index.html files I'm looking for.
Step 4 - Create a WordPress export file from the post I just created. Since my plan is to upload the content from these index.html files, I will need to reformat them into something that WordPress will understand. So by exporting from WordPress, I will have an XML file that I can use to follow in my coding. And since prior to the import I only have one post, it should be pretty easy to recreate.
First I go to the WordPress Admin page under Tools/Export, then choose only the posts (since I don't really care about the other items):
Clicking the "Download Export File" gives me an XML file that will be the target for what I need to restore my posts.
Step 5 - test using the editor. Before I start automating this whole thing, I want to test if it will work at all, so I take my output file, and try to create a second post to restore using one of the index.html files.
From a brief investigation, it appears that what I need to do is to add a new "item" tag to the XML file with the content from one of those HTML pages, so I make a second "<item>" section and carefully edit the tags to match values from the index.html.
A couple of false starts where I discover that the post ID matters, and that I need to change a few other tags to make things right, and I now have confidence that I should be able to automate crafting of this file from my cache.
Step 6 - Start coding - now this probably would be really easy in PHP, since I could probably just borrow some code from the WP Super Cache, and just hack together something that works. However, since I just started a job where I'm going to be using Java again a lot, I figured this would be a good exercise in writing some Test Driven Development (TDD) code in Java.
So being a semi-good TDD citizen, I start a new project in NetBeans using Maven. I create my main class, with some methods that I think I want, and then generate the skeleton unit tests. As in all good TDD, I run the tests and watch them all fail (which of course they do because the methods and tests are all stubbed out.
Next I start writing the unit tests to test my conditions. First I want a method that will return true if a folder has an index.html file in it. I create a small test data folder, and copy in a portion of my cache to the src/test/resources/data folder (making an input, output, and expected folder that I'll use later).
The test for this one is simple: point the routine at a folder with an index.html and I should get "true", to another without an index.html and I should see "false". The only other trick is to make sure to use the Maven way of getting the relative file location, which is use getResource relative to the class. I code a convenience method for this in my test:
/** * Convenience method to get the full file system file name for testing * * @param fileName * @return Full path for file in the test folder ... */ public static String getRelativeFileName(String fileName) { return ConvertHTMLPostTest.class.getClass().getResource(fileName).getFile(); }
Then I code up the test method:
/** * Test of hasIndexFile method, of class ConvertHTMLPost. * @throws Exception */ @Test public void testHasIndexFile() throws Exception { System.out.println("hasIndexFile"); ConvertHTMLPost instance = new ConvertHTMLPost(); boolean result = instance.hasIndexFile(getRelativeFileName("/data/input")); assertFalse(result); assertTrue(instance.hasIndexFile(getRelativeFileName("/data/input/2008/11"))); }
Running my tests, they of course still fail (since I haven't written any code for testHasIndexFile except to return false). So next I code the method to actually do the check:
/** * Check to see if the directory contains an index.html file ... * * * @param dirName Directory to check * @return * @throws IOException */ public boolean hasIndexFile(String dirName) throws IOException { DirectoryStream<Path> dirStream = Files.newDirectoryStream(FileSystems.getDefault().getPath(dirName)); for (Path path : dirStream) { if (Files.isRegularFile(path, LinkOption.NOFOLLOW_LINKS)) { if (path.getFileName().endsWith("index.html")) { logger.log(Level.FINE, "Directory ''{0}'' has index.html", dirName); return true; } } } logger.log(Level.FINE, "Directory ''{0}'' has no index.html", dirName); return false; }
Step 7 - normal code to completion - so I do the same for the rest of the methods I think I need: writing tests, refactoring, getting the tests to succeed, and finally I'm ready to output a test file. I start with just a couple of folders from the cache, and run through to see how close I am.
I upload the file that I output, and find a few more things that aren't formatted exactly right, or cause problems on the upload. For the most part I try to be a good TDD person and create tests that will fail for these, but a few (like having some extra empty lines), I just fix in the code.
Step 8 - full run - I run the code against the full cache, only to find that I get a bunch of item entries that are just an empty title. Turns out that a lot of the stuff in the cache isn't really in the format that I'm trying to parse. I add some TODO's to the code comments.
Looking at the folders it becomes apparent that if I just delete the things that aren't in the part of the tree that is the date hierarchy, I get rid of that noise, so rather than correcting the code, I just eliminate that data from my input.
Final run gets me all of my posts imported into the blog.
Step 9 - review of the posts - So looking at the posts, I notice that the images are not showing up, and there are some broken URL's. I make a couple of small tweaks to the code to get all of the image URL's to be relative.
The images are an issue with the database tables I lost before: all of the images are still sitting in my wp-content/uploads folder on the server, so I just need a way to get WordPress to see them.
Step 10 - Add the "Add From Server" plugin - this plugin actually allows you to add the media (or other files) from your file system. A quick run through gives you all the media from the wp-content/uploads directory back into your Media library (and displaying properly on the posts).
Finally - so now all my posts are back in play. There were a number of things I haven't completed that would make this usable for others, and I did lose a few things in the translation:
- My categories were all gone (haven't looked at the cache to see if they'd be available in the index.html)
- Things like the "see more" breaks and summaries weren't included.
- Tags and links weren't all added.
- Some posts have extra stuff at the end.
So I put my code up on Github for anybody else who might find it useful, or might want to take this to the next level.
But better yet, just make sure to do a backup once in a while
The code is available at https://github.com/accuweaver/WordPressRecover
The WP Super Cache is at http://wordpress.org/plugins/wp-super-cache/
And the Add From Sever plugin is at http://wordpress.org/plugins/add-from-server/
1 comment:
[...] ← WordPress Recovery [...]
Post a Comment