Since there's so many smart people reading and writing here I thought it might be worth a shot asking for a litle help:
I've run into a nasty little problem at work with a tool that just not up tothe task its needed for, so I waswondering if I could do it better myself instead.
Here's the basic idea: The program should crawl a harddrive, analyze the folder- and filestructure, run a couple of tests on everything (simple naming schemes and, more complicated, a few image checks incl. corruption checks.) and correct all found mistakes wherever possible. The final results should be exported to an Excel/File.
Now, my first question is: Am I completelz out of my mind, or is this possible for someone who's able to throw a few basic scripts together?
It is important here to be precise in what you want your script/program to accomplish. Richard Masters mentioned chkdsk, and it seems much of what you want is in fact done by chkdsk. Now I'm not intimately familiar with how chkdsk precisely operates, but my impression is it is precisely the equivalent of its Unix/Linux cousin fsck - so I'll discuss the rest from my Unix/Linux background.
What fsck does is to check the integrity of the filesystem structure - typically that administration that is hidden under the hood for the user, even the administrator. It walks down the filesystem tree from the root (top) directory, and checks that every file indeed has the right number of data blocks allocated to it (e.g., a 11K file should have 3 blocks of 4K), it checks that those data blocks are not allocated to two files at the same time (well, FAT can't do this by design), and it checks that data blocks which are said to belong to some file indeed belong to some file.
If that's what you want, then the tool you want already exists in the form of Windows chkdsk or Linux fsck.
What it won't do, however, is that if a file MYFILE.PDF is found, is to check if indeed the
contents of that file is valid PDF; or in case of a file MYPIC.JPG, if it's a valid JPG file. chkdsk couldn't even do that: when you make a filetype association in Windows, you say that files with extension PDF should be opened by some program called acroread, but you don't say with which program the contents can be checked. (and in Linux, additionally, there are not such strict filetype associations).
If that's what you want, you're entirely dependent on what tools already exist. While it would be possible for a competent programmer to write a checker for JPG files, it is a major task and not something done in one or two lost afternoons. And then I'm using a readily accessible standard like JPG as example; others would be less easily obtainable and I won't even start about the complexity involved in, say, a DOC file and its myriad of different formats (Word 7, 95, 97, ...). And you'd have to write a separate checker for each file format you're interested in.
So, if you want to look into the contents of the files, your only bet is that someone else already did the work.
Somene recommended C++ to make the thing run fast, I;e never so much as looked t C or C++ though. Should I stick to something I know at least a little more about, or would it be worth the trouble?
Nonsense. Suppose you are interested in checking the integrity of PDF and JPG files, and you have hunted down tools checkpdf and checkjpg which can do that. Then, on Linux, I'd cobble together the following shell script:
Code:
find / -type f | \
while read f; do
case "$f" in
*.pdf) checkpdf "$f" ;;
*.jpg) checkjpg "$f" ;;
esac
done > checkreport.txt
While you may not be able to read Linux shell scripts, it serves to demonstrate that scripting languages (whether Linux shell, Windows Power Shell, Perl, Python, ...) are geared to have primitives which make it easy to write such tasks. You'd have to write a lot more code in C or C++ to accomplish this, and it doesn't give the speedup you'd expect. The real costs of the above is in the various checker programs.
What tools and programs would I need? Is it possible to design a simple GUI for the entire thing, too?
Why do people want to have GUIs? If the program would need to check many files, on todays' disk sizes, it could well run for a couple of hours. You might want to schedule it running at 2AM and then the GUI stands in the way.