FileOrganizer

4/2/2018
Most open source projects start with an individual programmer scratching an itch, and FileOrganizer is no exception.

But first, a story: it is interesting to ponder how much digital files, or assets of all types have started to be part of people's identities.  Not in the sense of a portfolio of essays or programs to be used for self promotion, but in a deeper sense.  Personally, I have gone through many moves during high school, college, and my early 20s, and with each move, I have lost something valuable.  The biggest loss was when I had stored all my worldly possessions in the summer of 1997 in a friend's basement, which then got flooded so most of my books and clothes got ruined.

Nevertheless, I have been able to have redundant copies of some important files on floppy disks--it is hard to express how a few hundred kilobytes of high school essays and Pascal code mean to me.  Aside from some childhood photos which my parents kept (and which are now part of my digital backups too), those files are the only thing that I have left from my youth, other than my memories.   Also some plain text emails from college which I had the foresight to save (I remember our entire email quota was just 5 MB, compared to the 15GB available for free on Gmail now).



330px-Terminal-dec-vt52


I wish I had some of the first programs I wrote in middle school back in 1987, cutting my teeth on a green phosphor VT52 terminal in DEC Vax Basic on a Romanian Coral PDP 11 clone, plotting simple mathematical functions with * symbols on an 80x24 grid (for lack of graphics capabilities).  Or writing a Franken program to create and solve 100 problems over Christmas break, which was my 6th-grade physics teacher's idea of an extra credit assignment.  Back then there weren't any floppy disks even sold in Romania, and any printouts I had have long been lost--I do still remember the half dozen formulas I used in that program, starting with friction coefficient multiplied by normal force, on to torque calculations, then some problems involving pressure, and I believe some with potential and kinetic energy.  Not rocket science (the rocket equation would only be introduced to me in high school), but a nice trip down memory lane to remember my first projects.
im274-360px-Pdp-11-40
In high school I had several computers, staring with a Tandy CoCo2, which could not save programs at all, thus the 3 or 4 games I wrote on it during summer weekends when I was still 14 were lost as soon as the computer was turned off.  Then a Tandy 1000 HX, with no hard disk but a 720kb floppy drive.  I got it in 1990 when it was marked down at Radio Shack, a couple years after the fictional young Sheldon Cooper got it in the 5th episode of the series.   That is where I wrote all my high school essays (setup was just like the one in the wiki page, but no joystick), and also a Star Trek game in 10th grade.

Tandy1000HX_tweaked


After I graduated from college, my digital assets kept multiplying exponentially, first with photographs and music, then with TV recordings from a cable capture card, then with virtual machines starting in 2003 or so, and finally with DVDs.  Their size grew in inverse proportion to their emotional value for me, which doesn’t mean that they are worthless.


Rio800


The picture above is of an MP3 player as a “ship-it” award for Office XP on March 2, 2001.  Though its specs now seem puny compared to what is available nearly 2 decades later, it was nice of Microsoft to do it, and am not one to look a gift horse in the mouth. Being frugal, I probably wouldn't have bought an MP3 player for myself until years later, once the price went down to under $50.  I was very happy to use the Diamond Rio 800 (or was it a Rio 600?), with its 32MB of flash memory that could hold about 8 high quality MP3 files, or 20 low quality MP3 files, or 30 low bitrate WMA files in Microsoft’s proprietary format.  I chose the latter option since transferring files at the USB 1.0 speeds wasn’t fun.  I used the little player heavily for a couple years, until on a plane trip I tried to use the airline’s non-standard headphones on it.  I am guessing they had an unusual low impedance because right then the sound started making noises when playing music, as if the amplifier was shorted.  Or maybe it was just a coincidence, maybe some cheap caps just lost some electrolyte at the same time (perhaps precipitated by low pressure in the airplane cabin) what’s certain is that although the hardware has been disposed of over a decade ago, I will carry around the digital files associated with it for the rest of my life.  They hardly take up any space—my favorite 60 WMA files from 2001 take up just 60MB, or 0.001% of a 6TB external backup drive which is now $100, or about 1 tenth of one cent.  There is virtually no economic cost of storing them in perpetuity, moving them to a new external drive every few years (and keeping the old ones as redundant backups), the only cost is the mental hassle of adding a drop to the ocean of hundreds of thousands of files I have already accumulated.  This is exactly what the FileOrganizer tool endeavors to alleviate, though I will not go to the length of automating the process of determining whether a WMA file is no longer needed (because there is an MP3 file of the same song already).


I don’t think I am unique.   Over decades, people can accumulate a wide variety of heterogeneous files ranging in size from kilobytes to gigabytes (from text files to music, videos, large virtual machines). External storage is cheap but not very reliable, and important things should be kept on at least two, or even 3 locations.

Another problem is that manually trying to keep track of things, unless one has a strict hierarchical organization discipline that one rigidly adheres to, can lead to the opposite problem—too many copies of files in separate places, which are then time-consuming to merge back together.   I must confess that since I was never very disciplined in general, my digital assets are normally pretty disorganized, with redundant copies among a dozens external drives, flash drives, SD memory cards, and various computers.
FileOrganizer endeavors to solve many of the problems in this area by doing the following:

  1. First, the program itself will be portable, no need to have an installer. It should be able to be run right off the portable external drive (thus from any computer it may be attached to). A simple WindowsForms program in C# would fit the bill, and it should run on any version of Windows made in the last decade.  I don't know if there any users of Windows XP left, but if there is no need to use flashy UI, will just keep it as WinForms.
  2. Using plain text app.config settings file, enables the user to define rules. For instance, if one has 3 large external drives: documents and binaries should be replicated in triplicate; movies should be just on drives 1 and 2, TV serials just on drives 1 and 3, and music just on drives 2 and 3.
  3. Another key feature is the ability to organize files in a folder hierarchy based on keywords in file names and file extensions. Furthermore, if the target drive has user-created subfolders, it should see existing files already there and not copy them again. Making that determination should be done just based on file size and a hah created by the first and last dozen bytes—can’t rely on file names (those can change), nor on timestamps. That way, if one has old archival CD or DVD’s, the FileOrganizer can just scan them, regardless of how the files are organized on the source disk/drive, and only copy the ones that are truly different, not just in name or timestamp.
  4. One related feature would be to flag duplicates already in the target drive. It wouldn’t delete them automatically, but it could generate a DeleteDuplicatesYYYY-MM-DD.bat batch file, that the user could choose to run, after first checking out the report of duplicate files. The report would be both in a plain text file and in an HTML file with hyperlinks.
  5. The tool will also output scan results and copy results in a _backupMetadata subfolder, allowing users to use tools like Windiff.exe to see substantial changes between any two dates. Also, text files with file listings are easily portable, and one could also make copies of them locally or online. TBD if one should be able to add tag words (that aren’t part of the filename or pathname already), to make searching for files much easier. In addition to text files, a SQLite DB may also be useful if tag words will be desired
  6. Some optional features may be done for particular file types.  For instance ebooks.  PacktPub has a nice “free book of the day” promotion, so I have accumulated about 500 files by now, which have the ISBN as part of their title.  So for each book, there is an _ISBN_.pdf , _ISBN_.mobi, _ISBN_.epub, and often an _ISBN_.zip with code files.  One useful feature would be to use a web service based on the ISBN, and rename them all to _ISBN_BookName.pdf etc.  That would allow the automatic categorization into the desired hierarchy.  As #3 above remarks, if a matching PDF file with the same size and hash exists, do not create a duplicate PDF, but instead copy the whole set of files to the already selected target folder (renaming the existing PDF to ISBN_BookName.pdf).  If there is a SQLiteDB, add an ebook record to that, including links to all formats, a synopsis, rating on Amazon, link to Amazon’s book reviews,  and perhaps any other tag words or metadata that might be useful from amazon. 

This will be an ongoing project, the first few numbered list items will likely be implemented in the summer of 2018, the rest depending on my spare time and motivation.   I know I will do what I need for myself to bring under control all the backups that multiplied like Trek Tribbles, but beyond that, it is not likely that I will invest too much time in this project.   I will probably create a GitHub open source project for it, should anyone be interested.

There are some possible features that I almost certainly will not have time to implement, but if anyone wants to, they would be welcome to, and I will incorporate them in the main program if it can be done gracefully.  One thing that comes to mind is for music collections—aside from various filenames and path names, which the existing proposal already accounts for, MP3 files can also have different sizes and hashes.


Other features I may find more valuable, e.g. automatic grouping of virtual machine backups.  But in that case, it is probably best to manually standardize on file names.  I keep them as 7z archives of VHD files, and those are pretty expensive to scan and categorize, so that won’t be done.






Comments