Homebrew DR
Many of you guys out there have gathered a lot of digital information. This might be your private photo collections, home videos, scanned documents and your home administration. All sorts of data that sits somewhere on your computer(s) or NAS devices.
Most of you are are aware of the risk of having it all stored on a single hard disk in your personal computer, whether it be a laptop or whatever workstation you fancy.
The storage bloggers I follow all seem to have a home NAS device, which most likely is setup with multiple drives which has some form of RAID protection. RAID1, RAID5, BeyondRAID are all very good options for home usage.
Closer to home, friends are also wising up to the risk of losing everything if it is all stored on a single drive (hopefully because I try to educate them on that risk). However, not everyone seems to be aware of the risk, nor are they willing to be educated. One person even told me his data was not at risk because the external USB drive wasn’t always connected and therefore could not break. Fortunately I was able to rid him of this misconception.
What am I going to blog about now?
Well, I have set up a way or protecting my own private and business information and have implemented a quick ‘n dirty way of document management.
The way I did this will be described later. Some might think it is a good solution, some might think otherwise. The point is, it works for me (and my family), and I am confident I can recover the data if something happens to either of the copies – the copy at home or the copy at the remote location.
Something about my data.
The private administration is something I could live without, but it is nice if it would be available. It took me some weeks of scanning documents from way back in 1996. I know, I keep too much. I was proud about how neatly I had it organized in folders and stuff, but searching for a document became quite a task, and the number of organizers was growing and eating a lot of cabinet space.
Next to the private information, I also had to organize my business information, which I have to be able to present to the local legislators, and therefore I need to have a proper way of archiving it. In the age of digital information I believe I should practice my profession and store all my documents on a server in an ordered and secure kind of way. I bought myself a shiny new all-in-one printer/scanner from HP which had to be able to scan from a document feeder, double-sided (duplex if you will) automatically. And it had to be scanned into a portable document format.
During a sale on a website I bought myself the HP7780. It has an extra paper tray, and was cheap compared to other all-in-one printer/scanners. It has an ethernet port, wireless and even bluetooth for printing from a cellphone. It is also able to save the duplex ADF scans directly into PDF format to either a removable media device, or a CIFS share. I just happened to have a CIFS share on my home server. An ideal device for my goal.
So, how does this work then?

My home server is an Ubuntu Server (has been Debian for 9+ years before) and holds a bunch of drives. It has two 120GB IDE drives in RAID1 to hold the operating system and all operating system related data. I also have four 1TB SATA drives in RAID5, which holds all my photos, scanned documents, business accounting data and multimedia files like MP3 and HD movies which I stream to my PS3 in HD format.
On this server I have setup CIFS shares. One of the shares points to the location I keep my scanned documents. I have divided it into three sections. Private (P), Business (Z) and kids. The kids (K) sections is a place I keep pictures and scanned files of stuff they made in school and sorts.
The private section holds categories for every company or institution I exchange information with. These are mortgage firms, insurance firms,utility providers for electricity and all other companies that in one form or another have something to do with me living in a house with my wife and kids. Most of you probably know what I mean. Each category is represented by a folder name which makes sense to me.
The same applies to the section I devoted to my business.”Z” is the first letter for the Dutch word for business, which is “zakelijk”. I could have made it a “B”, but since we are Dutch…. Ok whatever.
The folder structure is like this
(fictional names of course);
- sharename
- private
- mortgage-firm
- electricity-provider
- insurance-firm
- the-list-goes-on…
- business
- accounting
- contracts
- the-list-goes-on…
- kids
- oldest-son
- youngest-son
- this-list-does-not-go-on…
- private
The tree structure in the example above gives you an idea how it is set up.
On the HP7780, I have configured some quick-button settings which hold options on how to scan a certain document (dpi, color/grayscale, fileformat) and where to put it. I have made two options per section. One for single-sided scans and one for duplex scans. The quick-buttons give me an easy way of telling the printer to scan a private document or business document. The scanner places the document in the folder I specified in the quick-button configuration mentioned earlier.
It scans all files into PDF for me (JPG is the other option). One downside that I have experienced so far, is that I cannot do OCR on those files at the moment. The PDF files are actually just images stored in PDF, not text. You might want to use a different format if you want to do OCR. Tiff seems to work best, but that feature is not on the printer.
The printer names the files according to a counter mechanism. I can set a custom prefix, but other then that, the filename is not very helpful. Therefore, when I open the resulting files to see if they are readable, I also rename them to match a specific mask like this;
yyyy-mm-dd_firmname_description.pdf
- yyyy-mm-dd are self explanatory. This enables me to order them chronologically order . The date used is the date I received the particular letter, or the mail date in the letter.
- Firmname should be an indication of which institution or company this letter is from, or to (in case I send the mail).
- The rest of the filename should be a description of the contents.
- I used an underscore as a field separator. I’ll explain later why I did this.
After I rename the file to match the mask, I move it to the folder (aka category) it belongs to. Well, are we done now? No, of course not. I merely scanned and renamed the file and put it in a sensible place. I could be satisfied now, because I can easily retrieve or search for it. It is quite a bit better than the paper archive. It is RAID protected, so I could tolerate a disk failure. Not bad. The renaming might seem a bit cumbersome, but if you don’t backlog this too much, it isn’t really much work at all.
But, as a storage guy, I couldn’t possibly be satisfied yet. I need a way of making sure my data and 30GB of private and business files are recoverable in the event something should happen to my home server. I also want to make sure the files don’t get corrupted or overwritten accidentally. Therefore, a cron task that runs every hour marks all the new files readonly.
To make sure the files are going to be recoverable when disaster strikes, I made a perl script that uses rsync to synchronize a couple of folders (photo’s, file archives) to a server I lease at a hosting provider. I have leased a server there for a couple of years now, and are very satisfied with their service, uptime and of course price. Last month, I switched to a new server, with more capacity and a lower monthly fee. I now have 250GB of RAID1 protected external (you might even consider it cloud-like) storage to store my mail server data, website(s) and a copy of all my photos and file archives for € 30,- per month excluding taxes.
Since I lease the server itself, it is completely customizable to my own needs. I have a 2TB network transmit limit per month, of which I never reach the limit.
In comparison, if I had to subscribe to a web based backup service, I would have to spend more money on that, would have less flexibility, and would have the problem that most backup services don’t support Linux operating systems.
Phase 1 is complete.
Phase 2 is about managing these documents.
Next to backup. I wanted to make the file archive data more search-able and I wanted a form of document management. I tried several open source tools and Joomla modules to set this up, but the available products only filled part of the requirements I had. It had to be easy, searchable, have several sections and categories.
Guess what. Joomla fills these requirements by means of their article database structure alone. It has sections, categories, and is searchable. I tried to exploit that structure and made another perl script that scans the Joomla content table for articles that match my file archive sections P and Z (but in full text of course). It also fetches all categories in the database and matches the categories I use in my file archive (each category is a company name, remember?). If I have a section or a category in my file archive which is not in the Joomla section or category list, they will be added. All files in my file archive are then compared to the articles in Joomla. If there is no article referring to a file in my archive, a new article is inserted in the content table, with references to the corresponding section, and category.
The files yyyy-mm-dd part is used as the creation date of the article. This way I made sure the articles are in chronological order. The other parts of the file name are split based on the underscore. The words resulting from this split are used as keywords in the article and as the title of the article. Now I can also update articles if information has to be added, or in case information was exchanged with the company the document was from, or sent to. This makes it easy for us to track the history if certain items, like insurance policies.
If I were to remove a file (like a temporary file, contract or whatever…) the article will be archived in Joomla. It isn’t deleted, but it is also no longer visible to the front-end. Of course the articles are only accessible for registered users (so I like to believe), which is just me and my wife.
This is only a quick ‘n dirty way to set up a DIY document management system, but it works for me, and even my wife. She doesn’t have to learn a complex document management tool which would definitely only prevent her from using it.
In summary;
I place all my scanned documents in a sensible folder on my server. These folders get synchronized to a remote server every hour (could be any interval I choose of course).
On the remote server the files and folders get scanned by a perl script and Joomla articles are created referring to each individual file. Keywords are used for searching purposes, and files are ordered chronologically and by sections and categories.
Besides my file archive, my photo’s are also replicated to the remote server, but are only there for recoverability. If I wanted, I could add the Gallery2 package and import them into it. On the remote server, my mailserver and websites are hosted. The backups made by the mailserver, and the database backups and website files are also replicated every hour, but from the remote machine to my home server. I would very much like to be able to recover the websites and mailserver as well of course.
I just wanted to share this with you, and may be spark some ideas for you. I am sure there a better ways to do this, but this is a pretty darn good solution for home usage. Not many of you will have taken care of your photo archives or maybe even file archives like this. The remote leased server has plenty of capacity to last me another couple of years and for only € 30,- per month (excluding taxes), I think this is a fair price for my usage. You show me a backup service or online storage service that offers 250GB capacity for only €30,- and I’ll buy you a case of beer.
I intentionally left out details of the hosting provider. I am not affiliated with them so I do not want to make any type of advertisement for them. If you would like more info on them, just ask me.
If you have any comments, suggestions or questions, please feel free to post them as a comment or contact me trough twitter http://twitter.com/ICoolen
Thanks to Nigel Poulton for helping me out with the post. We collaborated on this using Google Wave. Finally I could put it to good use



