• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Programming / help needed to get started

Rasmus

Philosopher
Joined
Jul 27, 2005
Messages
6,372
Since there's so many smart people reading and writing here I thought it might be worth a shot asking for a litle help:

I've run into a nasty little problem at work with a tool that just not up tothe task its needed for, so I waswondering if I could do it better myself instead.

Here's the basic idea: The program should crawl a harddrive, analyze the folder- and filestructure, run a couple of tests on everything (simple naming schemes and, more complicated, a few image checks incl. corruption checks.) and correct all found mistakes wherever possible. The final results should be exported to an Excel/File.

Now, my first question is: Am I completelz out of my mind, or is this possible for someone who's able to throw a few basic scripts together?

Somene recommended C++ to make the thing run fast, I;e never so much as looked t C or C++ though. Should I stick to something I know at least a little more about, or would it be worth the trouble?

What tools and programs would I need? Is it possible to design a simple GUI for the entire thing, too?


Thanks,

Rasmus.
 
Since there's so many smart people reading and writing here I thought it might be worth a shot asking for a litle help:

I've run into a nasty little problem at work with a tool that just not up tothe task its needed for, so I waswondering if I could do it better myself instead.

Here's the basic idea: The program should crawl a harddrive, analyze the folder- and filestructure, run a couple of tests on everything (simple naming schemes and, more complicated, a few image checks incl. corruption checks.) and correct all found mistakes wherever possible. The final results should be exported to an Excel/File.

Now, my first question is: Am I completelz out of my mind, or is this possible for someone who's able to throw a few basic scripts together?

Somene recommended C++ to make the thing run fast, I;e never so much as looked t C or C++ though. Should I stick to something I know at least a little more about, or would it be worth the trouble?

What tools and programs would I need? Is it possible to design a simple GUI for the entire thing, too?


Thanks,

Rasmus.

It's not that you couldn't do it yourself, but depending partly on the tools you want to use, it could take quite a while. You mentioned not C or C++, but made it sound like you were familiar with another language, if that's the case, which one?

One other alternative (unless you're interested in the idea of programming) would be to get free tool that does this already.

For example, Windows (and DOS) comes with chkdsk.

Someone else might have other suggestions.
 
Sounds like just the ticket for a good scripting language like Python, Ruby, or, god forbid, Perl. I doubt you want to generate an Excel file, but rather some sort of comma-separated value (csv) file that you then import into Excel.

I'm not sure how you would check for corrupted images, but you must have something in mind. Correcting the problems is another can of worms. Be careful you don't cause more trouble than you fix.

GUI? Why? Do you want to get the job done or do you want it to be pretty?

~~ Paul
 
First - I concur with the other posters.

Now, if you really have to code it yourself, then C/C++ would be my last choice, unless you were already proficient in C. A C-based executable would likely be much faster than, say, Python or Ruby, but in practice the vast majority of the program's time will be spent waiting for disk access; if the Python version takes 4 hours to evaluate a disk, the C++ version would probably take about 3 hours and 59 minutes.

If I had to code up something quick, I'd probably do it in C# (assuming that the .exe will be running on a Windows machine, which may not be a safe assumption) but only because I work fastest in C#, not because it's particularly well-suited to this task. If I wanted to broaden myself a bit and use a tool that's probably better suited, I'd use Python or Ruby. (I have some experience in Python, but Ruby looks interesting and I'd like to have an excuse to spend some time with it). Another interesting possibility, if it's going to run in Windows, would be PowerShell.

If there's an existing disk tool API that you want to use, your choice of language may be driven by what the API supports.

Absolutely concur with Paul that it's a easier to create (and VASTLY easier to debug) a CSV file than an Excel file.

But unless there's something unique about your file structures or that image testing thing, I'd spend a few hours hunting for some existing tool before I'd start thinking about coding it myself. And I'm saying that as someone who codes recreationally.
 
Since there's so many smart people reading and writing here I thought it might be worth a shot asking for a litle help:

I've run into a nasty little problem at work with a tool that just not up tothe task its needed for, so I waswondering if I could do it better myself instead.

Here's the basic idea: The program should crawl a harddrive, analyze the folder- and filestructure, run a couple of tests on everything (simple naming schemes and, more complicated, a few image checks incl. corruption checks.) and correct all found mistakes wherever possible. The final results should be exported to an Excel/File.

Now, my first question is: Am I completelz out of my mind, or is this possible for someone who's able to throw a few basic scripts together?
It is important here to be precise in what you want your script/program to accomplish. Richard Masters mentioned chkdsk, and it seems much of what you want is in fact done by chkdsk. Now I'm not intimately familiar with how chkdsk precisely operates, but my impression is it is precisely the equivalent of its Unix/Linux cousin fsck - so I'll discuss the rest from my Unix/Linux background.

What fsck does is to check the integrity of the filesystem structure - typically that administration that is hidden under the hood for the user, even the administrator. It walks down the filesystem tree from the root (top) directory, and checks that every file indeed has the right number of data blocks allocated to it (e.g., a 11K file should have 3 blocks of 4K), it checks that those data blocks are not allocated to two files at the same time (well, FAT can't do this by design), and it checks that data blocks which are said to belong to some file indeed belong to some file.

If that's what you want, then the tool you want already exists in the form of Windows chkdsk or Linux fsck.

What it won't do, however, is that if a file MYFILE.PDF is found, is to check if indeed the contents of that file is valid PDF; or in case of a file MYPIC.JPG, if it's a valid JPG file. chkdsk couldn't even do that: when you make a filetype association in Windows, you say that files with extension PDF should be opened by some program called acroread, but you don't say with which program the contents can be checked. (and in Linux, additionally, there are not such strict filetype associations).

If that's what you want, you're entirely dependent on what tools already exist. While it would be possible for a competent programmer to write a checker for JPG files, it is a major task and not something done in one or two lost afternoons. And then I'm using a readily accessible standard like JPG as example; others would be less easily obtainable and I won't even start about the complexity involved in, say, a DOC file and its myriad of different formats (Word 7, 95, 97, ...). And you'd have to write a separate checker for each file format you're interested in.

So, if you want to look into the contents of the files, your only bet is that someone else already did the work.

Somene recommended C++ to make the thing run fast, I;e never so much as looked t C or C++ though. Should I stick to something I know at least a little more about, or would it be worth the trouble?

Nonsense. Suppose you are interested in checking the integrity of PDF and JPG files, and you have hunted down tools checkpdf and checkjpg which can do that. Then, on Linux, I'd cobble together the following shell script:
Code:
find / -type f | \
while read f; do
    case "$f" in
    *.pdf) checkpdf "$f" ;;
    *.jpg) checkjpg "$f" ;;
    esac
done > checkreport.txt
While you may not be able to read Linux shell scripts, it serves to demonstrate that scripting languages (whether Linux shell, Windows Power Shell, Perl, Python, ...) are geared to have primitives which make it easy to write such tasks. You'd have to write a lot more code in C or C++ to accomplish this, and it doesn't give the speedup you'd expect. The real costs of the above is in the various checker programs.

What tools and programs would I need? Is it possible to design a simple GUI for the entire thing, too?
Why do people want to have GUIs? If the program would need to check many files, on todays' disk sizes, it could well run for a couple of hours. You might want to schedule it running at 2AM and then the GUI stands in the way.
 
Let me say this about that...

The part about crawling the directory structure can be done in almost any language; I've used C/C++, Visual Basic and even assembler code, though not on a PC. That's the easy part. The manipulating of the file structure probably can only be done in C/C++; I doubt if the necessary structures and such are available in other languages. Speed is not the main thing here: accuracy is at the top of the list, and that means very through testing. You will find the program will probably be disk I/O bound in any case, and probably won't run he CPU for more than 10%. More efficiency in running buys you nohing.

If you intend to mess with the file directory structure, you'd better know what you are doing - what structures and APIs are available and how to use them. If you muck it up you could easily make your target disk unusable without a reformat. You'll need care in setting up a testing regime that will keep you from doing just that. This seems like an awfully difficult and risky task to be doing with no experience in systems programming. Personally, I'd look long and hard for something already verified out there to do the job.

The first thing I would do is program the app to run and diagnose, but not manipulate the disk. You can then get a feel for how your program is running and the sorts of problems you can expect to encounter, before you take the plunge. Be conservative on this on, really conservative.
 
Last edited:
Why do people want to have GUIs? If the program would need to check many files, on todays' disk sizes, it could well run for a couple of hours. You might want to schedule it running at 2AM and then the GUI stands in the way.

I think people want to see the fruit of their labor, and a GUI is a pretty neat way to communicate it to others as well.

A static screen may not offer enough feedback.

If a GUI is important, I would go with C# or any .NET language (which includes C++).
 
Let me say this about that...

The part about crawling the directory structure can be done in almost any language; I've used C/C++, Visual Basic and even assembler code, though not on a PC. That's the easy part. The manipulating of the file structure probably can only be done in C/C++; I doubt if the necessary structures and such are available in other languages. Speed is not the main thing here: accuracy is at the top of the list, and that means very through testing. You will find the program will probably be disk I/O bound in any case, and probably won't run he CPU for more than 10%. More efficiency in running buys you nohing.

If you intend to mess with the file directory structure, you'd better know what you are doing - what structures and APIs are available and how to use them. If you muck it up you could easily make your target disk unusable without a reformat. You'll need care in setting up a testing regime that will keep you from doing just that. This seems like an awfully difficult and risky task to be doing with no experience in systems programming. Personally, I'd look long and hard for something already verified out there to do the job.

The first thing I would do is program the app to run and diagnose, but not manipulate the disk. You can then get a feel for how your program is running and the sorts of problems you can expect to encounter, before you take the plunge. Be conservative on this on, really conservative.

And one more thing, Shadron is right on. :)
 
Since there's so many smart people reading and writing here I thought it might be worth a shot asking for a litle help:

I've run into a nasty little problem at work with a tool that just not up tothe task its needed for, so I waswondering if I could do it better myself instead.

Here's the basic idea: The program should crawl a harddrive, analyze the folder- and filestructure,

It's called CHKDSK on Windows machines and fsck on every Unix system I every worked on.
 
The part about crawling the directory structure can be done in almost any language; I've used C/C++, Visual Basic and even assembler code, though not on a PC.
Agreed, but you won't beat the one-liner I gave above in shell script, or the corresponding one-liner in Perl.

That's the easy part. The manipulating of the file structure probably can only be done in C/C++; I doubt if the necessary structures and such are available in other languages.
You'd basically need two things: (1) the ability to read file contents in binary mode (as opposed to line-based text mode), and (2) the ability (and tools) to parse binary (non-text) content. Linux shell scripts can't do (1). Perl can do both, but I'd go for C/C++ (with lex and yacc) for such a task.

However, when I felt the urge to write a checker for file formats like PDF or GIF or JPG, it'd have to be a very big urge before I embarked on it, as it's a major project.

Speed is not the main thing here: accuracy is at the top of the list, and that means very through testing. You will find the program will probably be disk I/O bound in any case, and probably won't run he CPU for more than 10%. More efficiency in running buys you nohing.
Amen.

If you intend to mess with the file directory structure, you'd better know what you are doing - what structures and APIs are available and how to use them. If you muck it up you could easily make your target disk unusable without a reformat. You'll need care in setting up a testing regime that will keep you from doing just that. This seems like an awfully difficult and risky task to be doing with no experience in systems programming. Personally, I'd look long and hard for something already verified out there to do the job.
As long as you're only reading the disk, nothing major could happen, though?

In my previous job, I extended and maintained a small tool we used as demonstration tool in the Unix system administration course; the tool showed how the Unix filesystem administers the data blocks associated with a file. I extended it for use with Veritas 4 filesystems, ufs and ext2fs filesystems. In every case, I only needed the C header files to understand the filesystem structure and get it working :-).

Taking how long it took for Linux to even have a read-only NTFS filesystem kernel module, I wouldn't translate this experience to the Windows world. ;)
 
Richard Master said:
I think people want to see the fruit of their labor, and a GUI is a pretty neat way to communicate it to others as well.

A static screen may not offer enough feedback.
Bah humbug. This is a job for a log file. I suppose if you first want to display the log file using weird fonts and funky colors, so be it. Just make sure I can select and copy the entire thing.

~~ Paul
 
Would be a tad more difficult, yes. ;) But there's complete documentation on the ZFS filesystem structure available, contrary to NTFS.

If the task is to search for and repair file system damage, ZFS is the solution since it does its own scrubbing. If the task is the repair damage caused by errant applications then detailed filesystem knowledge should not be necessary.
 
I agree with this:

Suppose you are interested in checking the integrity of PDF and JPG files, and you have hunted down tools checkpdf and checkjpg which can do that. Then, on Linux, I'd cobble together the following shell script:
Code:
find / -type f | \
while read f; do
    case "$f" in
    *.pdf) checkpdf "$f" ;;
    *.jpg) checkjpg "$f" ;;
    esac
done > checkreport.txt
While you may not be able to read Linux shell scripts, it serves to demonstrate that scripting languages (whether Linux shell, Windows Power Shell, Perl, Python, ...) are geared to have primitives which make it easy to write such tasks. You'd have to write a lot more code in C or C++ to accomplish this, and it doesn't give the speedup you'd expect. The real costs of the above is in the various checker programs.

To check the integrity of image files, ImageMagick comes to the rescue! Use the identify command-line utility:
http://www.imagemagick.org/script/identify.php
Just check for a non-zero exit status, which indicates some sort of error occurred.
 
If the task is to search for and repair file system damage, ZFS is the solution since it does its own scrubbing. If the task is the repair damage caused by errant applications then detailed filesystem knowledge should not be necessary.

I thought you were referring to my anecdote about the program I maintained.

If you're referring to the OP: with the Excel mention, I bet he's on Windows, so ZFS is not an option. CHKDSK is your friend then.
 
I don't think it's at all clear what you want to do. If you just want to scan through a certain folder and make sure that the filenames conform to a certain naming scheme, you can do that easily enough in most programming or scripting languages using regular expressions or whatever string parsing principles you're comfortable with.

If, on the other hand, you want to actually analyze the NTFS filesystems itself, it's not going to be realistic. It would happen at such a low level that it requires intimate knowledge of both Windows and NTFS. C/C++ is the only choice then, since accessing native APIs in anything else is just too much trouble to be worth it. If you also want to repair it (a destructive action), you run the risk of invalidating the file system and losing data unless you know what you're doing. I suggest you run as fast as you can in the opposite direction.

I don't think it's clear what you mean by image either. Are you talking about actual picture files, or program images?

As for exporting the data, that is the easiest part. If you have Excel installed, it is completely COM-enabled and usable from most Windows-centric languages (in PowerShell's scripting language you would simply create the object with $excel = new-object -com Excel.Application and then add worksheets and items to it and finally display or save it.) Or, like someone else suggets, you can export CSV which can then be imported into Excel or anything else. PowerShell can also make GUIs easily enough, although it requires some familiarity with Windows Forms. If you were to use C/C++, the Win32 GUI API has a very steep learning curve and you are going to have to devote a fair bit of time to even understand how to create an empty window.
 
Hi guys, and thanks for all the replies. I'll try to clarify a couple of things and apologize for the lack of clarity earlier.

I am on a windows XP machine, I've learned some Pascal a million years ago, did very limited amounts of VBA a while back and more recently just had to work with AutoIT. I am a little further from computer illiterate than I am from being a proper programmer, I guess.

- First, I didn't intend to do any system-level stuff regarding the file structure. I just need to make sure that the names of files and directories adhere to a specific naming scheme. Testing these is easy, and it should be just as easy to correct them. But I do like the idea of also running chkdsk etc.

- Secondly, if the tool works (and works better than what I have now) I might get my employer to use it. That would require at least a basic GUI. Other than that, I fully agree that getting the job done is more important than being pretty.

- I need to output Excel since that is what is being used to track stuff. (And I know I'll burn in hell for that alone, too.)

- Checking if a file is an actual image file will probably be very difficult, but it would be worth it. (I have a very unreliable check at the moment, which tends to yield false negatives. Just opening those files takes enough time to justify looking into a better solution - never mind that I'll be able to sleep at night when things are done right ...) (And we are talking about images as in "pictures".

If it's true that the main problem will be accessing the disk drives then I think what little I know of AutoIT (plus the excellent pointer to ImageMagick) might be enough to get the job done. But since I'll be doing this on my own time I might take the chance and play around with one of the languages you suggested.
 
If Pascal is your best language, you might try Borland's Delphi. It's basically Turbo Pascal for Windows, and Turbo Delphi is a free download.

If your Pascal is barely-remembered, then I'd say go straight to one of the other languages (sorry, Borland)

For Excel vs. CSV . . . bear in mind that Excel opens CSV files just fine. They're not formatted (column width, number formats, etc) but the data goes into the right cells. So it's not quite Excel vs CSV, it's Excel .xls vs. Excel .CSV. If you simply need to get the data into Excel once in a while, I'd go with CSV. If a lot of people are doing this all the time, then you're probably stuck with native .xls.

I'd still probably code it up for .csv first and add the .xls output after I got the rest working.
 
I agree with what everyone has said about CSV vs Excel but if you really need to get an actual Excel spreadsheet with formatting then I would do it by automating Excel rather than trying to figure out the file format. It's not very hard, at least with Microsoft's programming tools. You could try the Express editions of Visual Basic or Visual C# for free.
 

Back
Top Bottom