Student Java Projects

Last updated by Roedy Green ©1998-1999 Canadian Mind Products.

Project Difficulty
A Tractable Artificial Intelligence Problem 7
Book Store Referral Applet 3
BusTel, exchange electronic business cards during a voice call. 2
Child Abuse Database 5
Colour Chart 0
Constructor Amanuensis 3
Defragger 5
Deleecher 2
Deleter, Delete Files With Wildcards 1
Delta Creator 4
Dissolve Designer 1
Domain Registry 4
Domain Search Engine 4
Dynamic Version Control 9
Fast Snail Mail 6
Foreign Language Trainer 3
Form Filler 2
GridBag Amanuensis 3
Hermit Crab Variable Length Record Files 1
HTML Glossary Presenter 4
HTML Table Reorganiser 1
HTML Table Sorter 1
HTML Tidier 2
Ideal lover database search 4
Infinite Disk, integrated file migration/backup 4
INI file tidier 2
Java File System 6
Java Powered TV Commercials 3
Java Source Code Beautifier 5
Java Source Code Presenter 3
Java Source Code SCID-style browser/editor 8
JavaDOC tools 3
Linkcop clone 1
LINT for Java 6
Localisation Tool 4
Mailreader/Newsreader 7
Native Class Amanuensis 2
Password Protector 1
Prebranded Software rental with auto updates 6
Regex Proofreader 5
Reimplement Standard classes 2
SAX File Transfer Prototcol 3
SCID (Source Code In Database) Java browser/editor 8
SCROOM Web site avoider 3
Sort Visualiser 0
Stay In Touch Database 4
Struct reader Amanuensis 3
Submit-It Clone 2
Super Compressor 3
Systray Deleecher 2
A Tractable Artificial Intelligence Problem 7
TweakDUN Clone 1
Uncrackable Encryption 1
WebRing Controller 4

Introduction

Here are 49 ideas for student projects most likely involving Java. Some ideas might be done by teams or a whole class. Some may develop into commercial projects. You are free to take these ideas and use them as you wish. I would appreciate hearing from you if you decide to implement any of these ideas. I would also like to hear any comments you would like me to add to the descriptions, or ideas you have for projects. If you have such project ideas on your own website, I would be happy to build links here.

I have discussed most of the projects many times at length over the last decades on www.bix.com. BIX has retained all postings in archives. For most projects I have additional materials I can provide you to give you hints on how to proceed.

A Tractable Artificial Intelligence Problem

This is a fairly simple artificial intelligence problem that mimicks, in miniature, the problem computer programmers have extracting specifications from users and turning those specifications into code.

Have a look at The Learn To Count Applet, particularly the source code. It contains code to count out in words in various languages.

I wrote the class for each language by getting a set of example numbers and words from a native speaker or a phrasebook or by sampling Internet posts, e.g. 21 == "vingt et un" in French, and deducing the rules for converting numbers to words and writing the code. I would then print out a set of sample numbers using the rules (selecting the samples to ensure I cover all the rule variants), and check them with a native speaker. He would then give me some more examples to explain the exceptions to the rules he had forgotten about previously. I would then convert those exceptions into rules, code them, and go round again until the native speaker was happy. Sometimes the native speaker would tell me the rules directly. Usually, these were less useful and less ambiguous than two or three examples.

I discovered there were various patterns in the rules. Different languages had various combinations of the standard patterns.

The process was fairly easy, quite mechanical, and allowed for a fair bit of trial and error rather than demanding brute reasoning.

Could you automate the task of writing code for additional languages?

Could you write a program that, just shown some counting examples from a foreign language, could compose a program to generate all the longs out in words? The answer is yes, you could write such a program, no problem, though it would be "cheating". It would just require a bit of patience to manually write the code for each human written language, then recognize it, and cough up the previously optimised code when handed a set of examples in that language.

What we want is something a little cleverer, that could work on a language it had never seen before, but that was a human language. Again you could cheat, in a lesser way this time, my simply cataloging all features of counting in all written human languages, writing custom code to recognize each feature, and cranking out code by turning on or off pieces of preformed boilerplate.

One way to attack the problem is to ask yourself, "How could I rapidly generate, generalise and optimise the cheating solution?" Another is, how could I solve the simpler problem of just handling the 20 or so languages the Learn To Count Applet already knows, or some subset of them.

So think of it as a contest. You will be handed a Java class, that will accept a long positive or negative and will return you a String for the corresponding number in words in the target language. Your program will then, by probing the class for examples, and making inferences, construct a program that reproduces the one inside the black box to convert numbers to words for the unknown language inside the class.

You get points for:

  1. Elegance (mathematically simple and ingenious)
  2. Generality (ability to handle various curves thrown at it.)
  3. Compactness, speed and beauty of both the code itself and the code generated.
  4. How accurate generated code is.
  5. How few calls to the black box class to generate examples you required. Only unique samplings count. There is no penalty for asking for the same number more than once.
  6. If you want a slightly simpler problem, for fewer points, assume you are given a bit map of interesting examples to start, that cover at least one example of each of the quirks of the language. You may still need to probe other numbers to test your assumptions about the patterns behind those examples.
  7. You get bonus points if you can also generate code to go the other way, take a word accurately spelled out in numbers in the foreign language and convert it to a long.
  8. You also get bonus points for being the first to submit a long->words class with Java source for any actual human language to the Learn To Count class library. There is no point in great duplication of effort there.

Book Store Referral Applet

The intent of this Applet is to allow you to automatically specify and keep up to date several sources for a book you mention in a web page. Here is the problem that I and many web authors have. We want to make a reference to a book e.g. Frank van Gilluwe's excellent The Undocumented PC (ISBN 0-201-47950-8). If you click on that, you will be taken to www.Amazon.com. If you buy the book I will get a 15% commission.

However, that is not necessarily the best place for you to buy the book. If you look in the Java glossary under Book Stores you will find a list of quite a few online book stores. One may have the book more cheaply. One may be located closer to you. You may have a special discount arrangement with one. Another may be able to deliver it more quickly. You may have had a particularly bad or good experience with some store. Some stores may not accept the credit cards you use.

On my side, I may make even more lucrative arrangements with other bookstores, that carry a smaller selection of books. I don't have the time to keep track of which bookstores carry which books and which stores have the best prices on each book. I don't know from day to day which is the best place to buy each book. Further, there is no best place. It depends on too many factors.

I hope it is coming clear what this Applet does. You click on the book, and it shows you just the bookstores that carry the book, their price (including shipping, duty and taxes), delivery time, sorted in descending order either by price or estimated delivery time. It also tells you how fresh/reliable the information is (perhaps using icons) for each bookstore. It may further automatically filter out stores that don't handle one of your available cards, or use icons to show which ones are accepted.

Instead of embedding a reference to Amazon in my HTML, I embed a reference to the comparison shopper Applet, along with the book's ISBN as a parameter.

Applets are not allowed to go shopping, because the sandbox security mechanism stops them from talking to any server except the one they were loaded from. However they could ask the server for this information. A servlet or application on the server is allowed to go comparison shopping. It can keep this information reasonably up to date. It need not comparison shop more than once per day per book. It need not comparison shop about a book no one has asked about recently.

The scheme would let you go visit any of the book stores and then do your shopping and browsing there directly. The Applet's job is simply to direct the people to the most promising on-line bookstores for each book.

You can make money with this if you make commission arrangements with most or all the bookstores you refer to. Further you can rent out your Applet and possibly database server service as well to owners of other websites containing referrals to books. They would then get the referral commissions. They may simply use your Applet free, and you get the commissions or split them 50/50. The benefit to them is pleasing people browsing their website.

Your little engine can work given any ISBN. It need not be one you have on file. The comparison shopper on the server goes out looking while the customer waits. It might even provide estimated prices and availability when it can't get through right away to all the bookstores.

The ISBN consists of 9 digits plus one check digit (0..9 or X). To calculate the check digit you must multiply the last digit of the true number by 2, the last but 1 by 3 etc. and add these results. The number needed to fill this sum to the next multiple of 11 is the check digit. If it is 10, the check digit is replaced by the letter 'X' (historically correct of course). In case the user mistypes an ISBN, (e.g. by transposing two digits) you want to explain that to him, rather than misleading him by telling him no one has the book in stock. You can read up on the rules for calculating the expected check digit.

Here is an example valid ISBN 0201479508. Here is how you ensure that it is valid.

ISBNs are sometimes written with dashes, breaking them up into variable length fields, much like class A, B, C IP addresses. For all practical purposes, you can ignore dashes and just remove them. However, if you are a perfectionist and want to display them properly you can read up on the rules. For code you can cannibalise to verify checkdigits and insert dashes, see the ISBN Amanuensis.

If you want to get even more clever, you can hook into the search engines of the various stores to determine the ISBN when the user does not know that, then use the ISBN to go comparison shopping.

There are some comparison shoppers out there already, such as mySimon, but this would be quicker since normally it would have recent prices already on file on a given narrow class of books e.g. Java.

How does this work under the covers? You need a very simple database of book information, possibly a Hashtable, BTree or SQL lookup by ISBN. This database is kept on the server. You can access it via JDBC, CGI or a servlet talking HTTP or raw sockets. At night the server goes shopping, getting fresh prices on the most popular items first, making sure to hop from store to store so as not to put an undue load on any one store. Eventually there may be a protocol for stores to notify you of price or availability changes so you don't need to pester them with nightly polls.

You want to avoid polling any more frequently than absolutely necessary to remain on good terms with the book stores. Reduce polling frequency if the store does not carry the book. Reduce polling frequency if the price has not changed in a long time. Reduce the polling frequency if no other books at the store have recently changed prices. Reduce polling frequency if the book is not popular.

If you want to get fancy, you can use cookies to track your customers, mainly so you know what country/state/province they live in to compute shipping charges and taxes accurately. See the list of country codes. For a simple version, you might leave taxes and shipping out. However, you need that information to properly comparison shop. It can be hard to worm it out of a book store owner before committing to purchase.

The hardest part of this project is gathering the price information in many formats from many stores. The interfaces you are dealing with were designed for humans, not computers. They may change at any time. Before you get heavily into that, prototype the idea with fake or hand calculated data. Smooth the rough edges off your user interface, simulate the delays and see if the idea is acceptably quick, and whether end users love or hate it. It may be a licence to print money or a black hole for it.

After you have all this working, you might do a thin client version as well, that does not require an Applet. It works by generating pure HTML at the server.

BusTel, exchange electronic business cards during a voice call

BusTel is a way of rapidly full-duplex exchanging name/address/credit information during a voice call. This fits in with the "Common Phone" design contest (see BIX telecomm.tech/phone.co) which will implement the BusTel protocol in desk phones. You tap the heart/trust key to exchange names/addresses. You tap twice to exchange credit card info. You tap three times for direct debit exchange. This project has potential to generate big bucks. Basically it works by exchanging vCard packets full duplex in the middle of a voice call. There is an API for applications to extract the vCard information. The default version just displays it with a standard vCard viewer. have a device for tying two phones back to back simulating a telephone exchange that would be useful in debugging such a project.

The other form of BusTel is to exchange encrypted vCards over the Internet for shopping cart use. You and the vendor exchange information about each other.

See vCard, JavaPhone in the Java Glossary.

Child Abuse Database

Tracks suspicious incidents, in themselves not conclusive, but when viewed in totality show a picture of imminent danger. This is the brainchild Alexander Investigations run by Glen Morrison, an ex-cop and private investigator. By tying together reports based on numerous small details which indicate they may refer to the same person, you can build a more complete picture of a suspect.

Imagine computing a correlation matrix of every report against every other report by how many points of similarity they had. You might then be able to group reports into related families. It would greatly help an investigator if he or she had to look only at one family of reports at a time. The computer might then report the largest families as ones most worthy of deeper investigation.

If a very serious incident were reported, such as an attempted abduction, an investigator could ask to see all reports closely related to that crucial one. One of them might contain something very valuable like a license number or very detailed physical description. Perhaps one of the suspect's frequent haunts would come clear. From those aggregate reports, the investigator might be able to put together a picture of the suspect's movements over time, and use that to help gather more information.

The main intent is to deal with anonymous sexual predators, but the system might also work to report incidents of physically abusive parents. By seeing a large number of reports of a pattern of abuse from many different witnesses, authorities might be more willing to act in a case where the parents are good liars and pass off the results of abuse as accident proneness.

Confidentiality is very important. The whole project smacks of "Big Brother is watching you". It treads on moral thin ice. Most of the reports in the database will be completely innocent. A father may be reported as lurking at a playground when he was just waiting for his daughter to finish her swimming lesson. Ideally some group independent of the police would run the database. The database would never be stored on a machine with an Internet connection and it would need to be stored in an encrypted format. Since sexual child abusers may move from city to city, ideally there should be ways of consolidating databases or exchanging information between databases for different cities. You might permit reporting suspicious events via the Internet, but you would not want the main database online.

What sort of data might you potentially record about a suspicious incident? Here is a very rough guide of the sorts of fields that would be needed. To refine the fields and multiple choice value encodings, you need to try encoding a few hundred sample suspicious event reports.

Incident Report

Person Report

Clothing Report

Vehicle Report

Matching logic would be very similar to that used in an Abundance program I wrote for the Humane Society to match lost cats with their owners. For example a match of red hair against blond is considered a partial match -- better than red against black, but not as good as red against red. Some sorts of match are more important than others. You can build up a points/weighting scheme. In the Abundance Cats program I did most of the calculations for cascaded weighting, normalizing the weights at compile time, and removed as much as possible of the matrix calculation out of the inner loops. This made the matching computation very fast. The databases could potentially get very large. You will need to rely on SQL and various background batch processing. You use SQL for coarse match filtering, then weighted point scoring for fine matching.

Colour Chart

Have a look at the Netscape Colour Chart. Create something similar done as a Java Applet so that you could test your browser to see how it distorts Java colours. There is no problem with all the Netscape/X11 colour names being supported -- you can easily add them all as static final Color constants or as ints, so you could dispense with the "using name" columns of the table. Allow the following sorts: alphabetical by colour name, HSB, HBS, SHB, SBH, BHS, BSH, RGB, RBG, GRB, GBR, BRG, BGR. The Applet is a pretty toy that may also help sharpen intuition about how the various colour representations work. See RGB and HSB in the Java glossary. The sorting could be done most rapidly with RadixSort available with source via the Java glossary.

For an added bonus, add columns that display the hue as the spectral frequency and wavelength.

Then do a colour chooser bean. You can choose the colour by keying: red/green/blue, hue/saturation/brightness, hex number, decimal number or selecting Netscape X11 colour name. As you change each value (or use the spinner controls), all the other ways of expressing the colour change to stay in step.

Constructor Amanuensis

The Constructor Amanuensis is a Java Program that helps you rapidly generate boring repetitive code to provide a variety of constructors to a class.

Why use it?

One kind of code I find exceedingly boring to write and maintain goes like this: Then you have to write several variants where you leave out some or all of the fields.

How does this amanuensis work?

  1. You fill in the names of your class and the member variables, possibly by cut/paste or my just pasting the entire code of your class onto the Applet.
  2. You fill in the attributes of the fields by clicking selecting listboxes, or fill in the blanks.
  3. You fill in a 0, 1 or 2 lines of JavaDoc for each field.
  4. You click the CLASS button, then it generates code like this for you:
  5. Beside each member variable is a checkbox. You can turn it on or off.
  6. You then click the CONSTRUCTOR button and it generates the code for a constructor using those variables as input:
  7. You can then change the checkboxes and punch out the variant constructors like cookie cutters. e.g.
  8. You then click the GET/SET button and it generates the code for all the dummy get set JavaBean property methods for each of the variables. You may throw some of these away later, or polish them up by changing the public scope, or adding safety checks.
  9. The amanuensis might save this template away, so that you could come back to it, modify the comments, variable names, add new variables, etc, and crank out everything again, without having to edit each piece of it individually.

Defragger

A defragger is a program that rearranges files on hard disk to speed up access. See defragger in the Java glossary. You would want the defragger to put files in order by last access date so that the rarely used files are near the center (end) of the disk and the most used ones are near the outer edge (beginning). You want each file in one contiguous lump. Ideally, you want to move frequently used system files like the directories, free space tables, and system swap file as well. Within the i-node/MFT table you want to internally defrag as well, putting the directories together, and ordering by last access date, within that by directory. You can seek to optimise either file open time or find-file if you don't have indexes to rapidly search.

A defragger may also do other tidying such as consolidating free space fragments, removing deleted dummy directory entries, sorting directories, and collecting files from the same directory together when their access timestamps are on the same day.

There may also be quick variants of the defragger that just seek to quickly clean up the files with the most fragmentation.

The defragger comes in four pieces:

  1. The algorithm for how you shuffle the pieces of the files around. It is a bit like a game of sliding squares. You can't move a chunk into its final resting place if there is some other file in that slot at present.
  2. The display that entertains and hypnotises the user as you work. You display a coloured map of the disk showing its state and what you are doing. It would display groups of 256 pixels each representing a cluster, arranged in 16x16 squares. For small disks each cluster may be represented by more than one pixel. You might experiment with other groupings of pixels so that a contiguous file showed as a long thin line, or a blob.
  3. The platform-dependent Java code for accessing and changing the low-level disk structure.
  4. The platform-dependent native code for accessing and changing the low-level disk structure.
Most operating systems have some files you may not move. You have to simply work around them.

Because undebugged defraggers trash hard disks, you will want to write a defragger simulator that is platform independent. You may want to write platform dependent data-gatherers, and ask your friends to send you copies of their messy disk layouts - not the data, just the current file positions and fragmentation so you can test your program on real-world situations.

In later debugging stages, you will want a large hard disk with a free partition you can trash and recreate with a partition copier.

The simulator lets various people try coding the algorithm safely without trashing any disks or getting involved in the low level details of defragging. You can then plug in the most efficient defragger.

Your simulator's scorekeeper should reward doing i/o in large contiguous lumps. It should reward keeping disk arm motions to a minimum both in number and length of leap. It should, as best as possible, simulate true elapsed time.

How do you defrag when there are other tasks using the files as you work?

  1. You can use the operating system defrag interface such as the inept one provided in NT. The problem with NT is that it lets you move only one file at a time, though you can move a run of clusters from that file at a time so long as they are all being moved into a contiguous chunk of free space. You can read about how the NT official defrag interface works at Systems Internals. Moving small chunks at a time this way moving the heads back and forth from source to destination is very, very slow. It is quite incompatible with the algorithms I have described for fast defragging. Further, you can't afford to do a thorough defrag with such a limitation.

    Ideally, you should be able to specify a long list of moves to make, and NT would do them in one or more sweeps across the disk, handling perhaps dozens of files per sweep, depending on how much RAM it has available. The only restriction is that all the target clusters need to be free before you submit the list. You would not be allowed to do a list of moves that requires an internal ordering of the moves to free up space before the rest of the list of moves can complete. When you need to do that, you must break the list in two and submit them separately.

    If you watch Raxco PerfectDisk defragger's display as it uses the NT interface, it can drive you nuts. Nothing appears to be happening for 5 minutes at a stretch. It behaves like some bored barber endlessly picking away at your hair, cutting individual strands, shortening by never more than a millimeter per snip. Then, all of a sudden, there is a noticeable little burst of activity as its moves a large file, which can be done in reasonably efficient chunks. Then it goes back to its nitting.

  2. You can run under some other OS on some other partition so that the partition undergoing defragging is not active. e.g. you might run your defragger under a tiny JavaOS partition to defrag your Win9x FAT, OS/2 HPFS, NT NTFS and Linux ext-2 partitions in sequence.
  3. You do your defragging by copying from one partition to another empty one, perhaps on a spare hard disk.
  4. You run your program very early in the boot process before any other disk-using tasks have started. This is how Partition Magic deals with a similar problem.
What sort of algorithm should you use? Here is a suggestion. Sort all your files so that you have a table of every cluster on disk and where its final destination should be. This table can be compressed where there are contiguous clusters within a file. You need plenty of RAM and virtual RAM.

Imagine the disk arms are like a bus picking up passengers and dropping them off as it sweeps back and forth across the disk. The clusters being moved are like passengers. Here are some rules of thumb.

  1. Don't pick up any more passengers if the bus is full.
  2. Try to keep the bus full.
  3. Don't pick up a passenger unless the passenger's destination slot is currently free.
  4. Don't pick up a passenger unless the destination slot is in the direction you are currently headed.
  5. Sweep the disk smoothly like an elevator. Don't hop all over picking up and dropping off passengers the way most defraggers do in their passion to defrag the disk methodically starting at track 0.
  6. Prefer to pick up or drop off a line of passengers contiguous on disk.
  7. Prefer to pick up clusters belonging to badly fragmented files.
  8. Prefer to pick up clusters that belong to recently used files.
  9. Prefer to pick up passengers that are far from their destinations.
  10. If there are no passengers you can move directly to their destinations, free up some space by moving a busload of passengers to any free slots just past the end of where all the files will eventually be moved. Don't move them all the way to the end of the disk the way Norton's defragger does.
The conventional wisdom is that defraggers must be crashproof. If the power fails at any time during the defrag process, you must lose no data. Perfection is impossible, since you can always fail in the middle of a write to the directory or the free space table. However, you can aim for recovery if the failure occurs between writes. You must move files before you marked them moved in the directory and free space tables. For a moment they exist in two places. Only after the move is recorded in the directory may you reuse the old space. The problem is crashproofing makes the defrag take longer. NT takes this to ridiculous extremes. It would take days to properly defrag a disk using the official defrag one cluster at a time interface. Ironically this extra time exposes you to more potential for power failure.

You might try simulating the algorithms used in the commercial defraggers such as Norton SpeedDisk for Windows9X, and NT, Executive Software Diskeeper, Raxco PerfectDisk, or PowerQuest Partition Magic. Then if your algorithms are substantially faster, you might be able to persuade them to reissue their products using your algorithms.

Because of the way it is written, your defragger might have more general application, e.g. defragging the memory heap in a JVM to optimise virtual RAM swapping. You might use it inside proprietary firmware inside a hard disk to handle defragging/marthaing that was transparent to the operating system, thus creating a seemingly super-fast drive. It would appear to have almost 0 seek time. It might be implemented internally as two separate drives that spend their lives mirroring each other while simultaneously defragging to the other drive. By having two sets of heads, either on the same set of surfaces on separate sets, you would get away with much less head motion. The software can then be as messy as it likes. The hardware will clean up after it. In fact, the irony will be, defragging with software will cause the system to temporarily slow down!

Deleter, Deletes Files with Wildcards

Programs like 4DOS and 4NT are painfully slow if you want to sweep your hard disk to remove junk files, e.g. temp?.*, temp.*, *.tmp, C:\temp\*.*. We need a program that will take a set of wildcards (positive and negative), and in one pass over the directories delete the described files. With the /S switch you request recursion over subdirectories. You should also be able to specify multiple directories with wildcards (positive and negative).

Delta Creator

Let us say you send somebody a file, then you change the original file. Instead of sending the whole new file, you can send just the changes. How do you detect the changes?

I suggest inventing a generic way to do this that works with files other than just text files. Current schemes are not very bright and end up creating delta files almost as big as the new file. They are not good at noticing that text has simply been reordered.

Here's how you might tackle it. For each file type you write a chunker. It divides the file into chunks. For a text file, it may divide at each newline character. For an MSWord file it may divide at each paragraph. For a FORTH BLK file, it might divide at each 1024 bytes. For a Btree file it chunks out each record...

Then you compute a hash on each chunk of the original file and index it in a Hashtable along with a reference offset and length into the old file. You don't put the text itself in the Hashtable.

To compose the delta file, you look at each chunk in the new file. You compute its hash, and look it up in your Hashtable. If you find a match, you make a note that the chunk can be had from offset/length of the old file. If you don't find a match, you have to insert the chunk from the new file into your delta file. Very often you will get a sequential run of matching chunks from the old file. In your notes, you can collapse these entries into one big offset/length reference.

Creating delta files of *.exe and *.com files requires some special handling. Experiment by adding a single instruction to a program and notice how that tiny change ripples and cause changes throughout the rest of the file. The idea is to send only the fact that one extra instruction was added, and somehow generate all the rippling side effects instead of transmitting them.

Once you have these low level tools in place, set up a scheme so that customers are automatically kept up to date from a master copy, using the Internet. The scheme should recover even if customer should accidentally delete files, corrupt files, or restore out of date files.

Dissolve Designer

I saw a special effect for the first time on TV in a bank commercial. It was a dissolve from one scene to another. The image first dissolved into coloured tiny tiles (each the color of the average of the pixels in the image in that tile.). The tiles independently scooted about to a new position where they formed a new image. During this time the tiles smoothly change colour, and fatness of the diamond shape to morph into the new image. The transition was a sort of two-bladed cyclone pattern.

I thought a dissolve designer might be a fun toy for a newbie to write. The program may be slower than TV, but it does not have to work in real time to be displayed full speed on TV.

The end user could control various parameters of the dissolve (e.g. tile size, intermediate colours, trajectories) and see the results on pairs of images he supplies. A fancier version would let the end user write simple functions to be included. You might parse these or compile them and load them via Class.forName, the way the LearnToCount program does.

Domain Registry

Domain registry exchange where all domain names for sale are listed. You can search for suitable names by many different criteria. The project supports itself by banner ads, broker subscriptions (ad-free), and possibly a small listing fee for a prime real-estate location. To be successful you will need to be perceived as the place to go where all names are listed, whether the owners pay a listing fee or not.

You need a way for owners to register their names online. You can get the owner details from the InterNIC database. You should send the owner an automated email asking permission to list. That way you can foil pranksters putting up other people's domain names for sale.

Part of this database would be who is in ultimately in charge of assigning names for each TLD (Top Level Domain) and what the yearly fees are, and whether the server must be located in the country of registry. Check with iana to begin your researches.

Another part of the database would be brokers. You track which TLDs they handle, and what their fees are, and where on the planet they have their offices. After you get a fair sized database, you could start charging for a listing, or for a prime real estate listing. Check with RIPE and CoreNic to start your researches collecting brokers.

See the essay on domain names for further background.

Domain Search Engine

This is a database designed solely to help you find a website. By forcing all hits to point only to a home page, you prune out a huge amount of material, making it faster to find a website.

A variant of this is a product database. All hits in it point to a single definitive page on the vendor's website.

Dynamic Version Control

Current version control systems stink big time. PVCS drives you mad because the modules you want are always locked by someone else. CVS drives you mad because other people's changes constantly stop you from getting a clean compile. You have the problem of manual or automated merges of changes done based on superficial similarities in bits of text. The process is hideously error prone. Talking about the last stage of the project has branded me a lunatic. Here is my vision.

We will have programming teams scattered over the planet. They will simultaneously be working on the same code with a 24 -hour Internet high speed ADSL or better connection. As you make each change to the source code, that change is broadcast and reflected to anyone who is simultaneously looking at that part of the program.

Changes are made atomically by syntactic element, not character by character. They are transactions to your database. The database always reflects a syntactically correct program, including correctly spelled variable names. It might not work, but it is syntactically correct at all times.

I now have to wave my hands a bit. Each programmer works logically on a transparent sheet of film, overlaid over the program called a job sheet (or job for short). He puts his changes on that film. Other programmers work on their own layers of film. Each layer of film represents a job -- a unit of work to make some complete change. A programmer may have several layers on the go at once, and several programmers may work on the same layer.

You can choose to hide or apply other programmer's layers to your work. You reveal them to see how upcoming changes will interfere with your work. You can hide them if the new layers don't quite work yet and you need to test your code.

There is a process where a layer is proven to work, and gets incorporated into the base layer. It can be backed out, but basically once in is incorporated, you have to deal with any changes it has created.

Highlighting helps you see areas where your changes are conflicting with somebody else's both before and after that person's changes are committed. Since both of you can see the conflict arising, there is a good chance the two of you will have a conversation about how to create easily mergeable code.

Because the code is in a database, and every syntactic element in the program has a unique ID, there is less guesswork about what will happen when changes are merged. Further, both of you see the merged changes on your screen as you work. You can flip back and forth removing and adding other layers from your view without changing any code, just as you can remove or add layers for your compiles.

Compilation takes little time since the database is already pre-parsed. Even so, you may have some background speculative compilation process on the server, working on whatever piece seems most stable, so that when you are ready to run to debug, most of the compilation will already be complete. When the server has nothing better to do, it can do various optimisations and native code generation, just so it will be ready when you ask for them.

How could such a thing be implemented? I imagine it might work like this: You have a central repository of code that lives pretty much entirely in RAM. It may have some ability to store some of the lesser used information on disk, and to lazily back it up to disk. However, persistent store classes let the designer of the SCID pretty well pretend the entire database is RAM resident.

You have a parse tree. You submit transactions to a common change server queue to request changes to the parse tree. The parse tree is never in an inconsistent state. It is impossible for it to represent an invalid program. Transactions are checked before they are applied atomically. So you don't change the parse tree keystroke by keystroke, but node by node. You can reconstruct the parse tree by replaying various streams of transactions. Every transaction is timestamped, and stamped with author, job layer and reason for the change. A global rename, for example, could be done with a single transaction. It would be trivial since there would actually be only one copy of the name stored in the database. All other references would be numeric or by address.

When you have several remote sites, you can replicate the entire database at each site. You might have a centralised scheme or a peer-to-peer scheme. Centralised is better if the communication lines are very high speed and the server is very powerful compared with the workstations. In a centralised scheme, for speed, the server broadcasts copies of both the source code database and the compiled results to everyone so they have local copies for high speed access. Basically transactions are broadcast and all copies of the database update themselves.

If a database is corrupted, it is automatically regenerated from one the remaining copies.

Transactions are serialised by timestamp, shared, and processed in standard order between all sites to keep all sites up to date. There has to be some mechanism for provisionally changing the local copy of the parse tree while waiting for the earlier-dated transactions to come from the other sites.

Fast Snail Mail

If you want to send something to Indonesia fast, you are out of luck. What you could do is send it electronically, have it printed there or copied to floppy, then delivered by courier to a non-connected Indonesian company. They could send stuff back by having it scanned, (possibly OCRed), then sent back. With a decent scanner you could handle large documents or high resolution. All transmissions are compressed, digitally signed and encrypted.

You build an easy to use suite of software to handle electronic submission, routing, billing. You then encourage people throughout the world to set up interconnecting stations. It would be low overhead, and a extra source of income. For the most part the owner of the site would have to do absolutely nothing except put printouts in envelopes and apply the pre-printed labels. Submissions are handled completely electronically.

If the law will allow it, you could also send money by the same scheme. You might be able to bypass banking laws by buying and selling some token, such as shares in the company, or computer services. Banks charge about $50 to wire money to Indonesia, and they keep it for a couple of weeks before delivering it. They make you fill in pages of forms, and wait in lines for hours. This service could handle it for peanuts with no wait time. Unfortunately with a buy/sell kludge you could end up with huge fees in sales taxes.

Foreign Language Trainer

The purpose of this Applet/application is to teach someone to pronounce words in a foreign language correctly. I will assume that language is Esperanto, though it would also apply to other languages. You can start simply and work up.

Form Filler

This is designed as an plug-in to a browser such as Netscape. The plug-in looks at a form, usually one containing name, address etc. and fills it in reasonably intelligently from a set of canned fields configured into the browser. It would have heuristic logic to decide which data goes in which fields. Test it out at various shopping cart sites and various sites that don't take email, where they want you to fill in a form to register a complaint. Even if you get it 80% right, you still save the user time typing.

Design a standard set of field names. Give your standard a fancy acronym. Bug websites who don't follow it, and send them a copy, explaining the advantages of standard field naming, and giving plenty of examples. Extend for other languages besides English. Make sure you can handle telephone numbers and addresses outside the USA.

HTML Glossary Presenter

It takes a glossary and imports it into an SQL database with crosslinks between entries. You might maintain it as a database and export HTML periodically for the Java challenged, or you might maintain it as HTML and periodically import it into the SQL database.

You use a Java program to browse it. When you are reading an entry, in the background it tries to guess what you will read next and reads links, and also the alphabetically previous and next entries. The Applet just hands the HTML record off to the browser to render.

If all goes well you should be able to browse almost as fast as if the entire database were already downloaded in RAM. A simpler version would simply take "See xxx,yyy" entries in the glossary and verify/build/rebuild HTML links for them, generating plausible, unique NAME targets.

GridBag Amanuensis

I have used the Symantec Visual Café 2.5a visual layout editor to generate source code for GridBag layouts. As you work, the display adjusts immediately to show you the effective results of the new constraints. It is quite a bit better than coding by hand and doing compile test cycles, but it still leaves a fair bit to be desired. Here are some additional features I would like, besides the ability to fill in GridBag constraints via fill-in-the-blanks.

Hermit Crab Variable Length Record Files

Hermit crab files allow you to efficiently manage variable length records that can grow and shrink. You can access records by an integer relative record number. Hermit crab files do not provide access by key, though you might build an in-RAM index or Hashtable by key fields that gives you the relative record number. They are called hermit crab files because as records grow, they abandon space "shells", which are then recycled for use by smaller records.

For details on how they work see this essay.

HTML Table Reorganiser

This lets you extract an HTML table and reorganise it, e.g. remove a column, transpose two column, add an empty column, permute the columns. Basically you hand it a tuple:

1,9,3,8,0,0,2

Which means construct me a new table composed of column 1 of the original, then 9, then 3 then a dummy, then another dummy, then column 2. Ignore the rest of the table.

The other thing it can do is transpose, convert the rows to columns and the columns to rows.

HTML Table Sorter

This lets you extracts an HTML table and sort the rows based on the specified columns. e.g.

1,4

means sort all the rows using column 1 as the primary key and column 4 as the secondary tie-breaker key.

After you have that working, tackle the problem of converting a long skinny single-row or a single-column table column into a grid, either by rows or by columns. Then convert it back again to either a single row or single column. You might even also have logic to count the elements, and do a square root to determine the grid size to make a square grid.

Then consider the problem of merging two tables, and selecting columns or rows from the combined table. For example you might want to take two single column tables and merge them to create a single table with two columns. You might want to join the tables by matching a common key using an inner or outer join. Sounds like SQL doesn't it? Consider implementing the whole reorg utility by importing tables into SQL, doing all your fancy stuff in SQL, then exporting tables. You need to have a way of stripping tags such as <b> so that SQL can sort columns. However you need to keep the tags around for when you export HTML tables again.

HTML Tidier

The tidier makes it easier to perform edits on the HTML source with a text editor. It might do thing such as:

Ideal lover database search

Registry to help you find your ideal lover using principles from Hendrix, Keyes and Vialogos. You fill out a very detailed HTML questionnaire and get a list of the best matches. It is an exercise in inner and outer honesty. Scanned photos too. I have already been doing this manually for gay people.

Infinite Disk, integrated file migration/backup

You write a program that backs up rarely used files over the Internet, using an ADSL connection to a server. Whenever space gets low, it frees disk space of files that have been backed up. When the user goes to open those files, there is a pause while they are retrieved from the server.

This scheme also acts as backup. All files can be backed up. To speed backup, before they are sent to the server they can be super compressed, and only changes sent via deltas.

The backup server can conserve disk space by noticing that, for example, two customers both have identical copies of MS Word For Windows DLLs installed. The server only needs to keep one copy of each DLL. It has to be careful. Customers may have identically named files that are not identical.

This is very old idea. Univac 1106 mainframes with far less than a megabyte of RAM used to integrate file migration with backup to tape.

INI File Tidier

Windows 9x/NT still use INI files in addition to the registry. A *.INI file has [sections], parameters and comments: The task of your program is to sort all the sections in alphabetical order, and within each section sort all the parameters in order. Any sections without any entries are deleted. There may be duplicate sections in the orignal. Merge them. When there are duplicate parameters in a section, take the first one and ignore the rest. Delete any parameters with null values. For readability, put a blank line between sections.

The only thing tricky about this is handling comments. Does a comment belong with the parameter/section above or below? If there is a blank line separating it from one but not the other, the answer in obvious. If the comment is the first or last line of the file, again the answer is obvious. If you can't resolve what to do with the comment you can handle the ambiguity in one of four configurable ways:

  1. Get the user to resolve the ambiguity by inserting a blank line, then rerunning your program.
  2. Assume it belongs to the line above.
  3. Assume it belongs to the line below.
  4. Assume it belongs to both, and duplicate the comment.
By configurable, I mean the user makes this choice ahead of time. You don't ask the user on every comment.

You could do the sorting and merging of duplicates in at least three ways:

  1. Create one giant array with cleverly constructed keys and sort it. Then merge.
  2. Create an array of arrays with variable length rows. Sort the rows then sort each column. Then merge. You may need two or more passes through the data to find out how big to make the rows and columns.
  3. Create a tree -- binary, BTree or n-ary. As you build the tree, keep it in order and merge duplicates.
Since INI files are not that large, you can use totally RAM-resident techniques, and you don't have to worry too much about the efficiency of your algorithm in terms of order(n).

Once you have that working, you can then create variants on a theme. Merge a small ini file of changes into a big one. Setting a parameter to nothing deletes the parameter. A section with no entries is deleted.

Compare two messy ini files and display the essential differences by creating a tidied, comment-less version of both and comparing the parse trees in RAM.

Get your friends to donate a variety of real-life messy INI files, the dirtier the better, for you to test your program on.

Once you have that working, create this variant. You create a template example ini file that contains all the currently legal sections and fields. You use that model to clean out a messy INI file that is full of junk fields left over from previous versions of the program.

See Sort in the Java glossary.

Java File System

One problem with Java is it lacks a portable file system. Because every file system is different on every platform, Java implements the lowest common denominator.

What you do is allocate a giant file say 2 gigabytes. It looks like one file to the host OS, but to Java it will be a file system containing many files. If you do your work well, eventually it may be the only file system of some future OS. You invent methods to allocate space inside this file for subfiles. You want to create all the ordinary file services:

  1. Lookup file by fully qualified name.
  2. Lookup file by high order part of name.
  3. Lookup file by low order part of name.
  4. Lookup files by last access or modify date. This allows a program like AltaVista Discovery to rapidly discover the files that have changed and need reindexing. Recently deleted files need to be logged too so that indexers can drop them from the indexes. Perhaps privileged application programs should be allowed to hook themselves in and monitor all file status changes system wide.
  5. Enumerate files that match some filter composed of a number of positive and negative wildcards, or user-written.
  6. Lookup file by handle.
  7. read/write sequential and random.
  8. caching, and delayed write (lazy write) caching.
  9. associations -- by extension is a bit dangerous. A file should be permitted to have several associations.
  10. Which app installed/created this file? Is it user data, or part of an app package? URL where I can get an update of it if it is corrupt.
  11. Security -- owners, read-only, hidden, system, file signatures/keys/checksums.
  12. Dates -- last access, last update.
  13. locking.
  14. defragging. See the notes on the defragging project. The whole key to a fast file system is to organise your disk layout to minimise arm movement, i.e. to keep it defragged as you go.
  15. Marthaing -- keeping the most active files in the prime real estate near the outer edge (near the beginning of the disk) near the swap file and the directory, and the least active files, and the bulk of the disk space in the inner tracks (near the end of the disk). This could be done on per cylinder basis, or per file or per cluster.

    This file placement might presume a weekly reorg, or even better an ongoing low priority background reorg.

    When you start using a file, it gets moved by a background process to prime real estate near the MFT, swapfile and directories. Background processes move files that have been not used in a while to less desirable real estate further away, freeing up space, but not too much free space in the desirable real estate territory. The bulk of your free space should be out in the least desirable territory. Roughly speaking files should be contiguous, and in order by last access date, and within that by directory with the most recently accessed files in the more prime locations.

    The reorg process might be more integrated with ordinary I/O, so that whenever you write, that cluster gets moved to prime real estate as a side effect. You might do all your writes sequentially, interleaving writes from various files. If you had a two headed disk, you could dedicate one head to writes and one to reads. The write head would not move much. You write in a sequential circular squirrel cage pattern to a hot zone on disk. A background process moves the oldest data out of the hot zone so you don't wrap around and overwrite. You can think of it as a second level of disk caching. You have a primary cache in RAM, and a secondary level in prime disk real estate.

    CPUs are getting faster relative to hard disk. RAM for caching control information is getting cheaper. This means that, in future, it will pay to use more and more elaborate techniques to minimise disk head motions. You will be able to afford to do a lot of thinking to avoid a tiny bit of needless movement.

    This leaves the problem of a giant file that you update only rarely. Perhaps the positioning algorithm needs to take into account both frequency of use and last access date. Perhaps you need decaying hit counter. You multiply previous hits by 0.9 and add today's hits to determine priority for prime real estate. (The actual decay algorithm would be more sophisticated.) You also would also factor in size. A 1K file with 10 hits is much more deserving of prime real estate that an 100 MB file with 10 hits.

  16. Temporary files -- clean up of orphaned temporary files. Tracking who orphaned them.
  17. Getting more advanced, you might want an API for persistent objects. Traditional file systems are hierarchical. You need something so you can find stuff in a million different ways.
  18. Your filesystem would probably be designed with the idea of massive amounts of RAM being available, and would not worry quite so much about getting everything on rotating media immediately. You would only have to deal with graceful shutdown, perhaps with 10 minutes notice from your UPS. This would give you a big speed boost over traditional file and database systems that are overly paranoid.
You can start very simply. without a directory, requiring users to allocate files by absolute block numbers, and just ensuring you don't allocate the same block twice. You have been bitching that Microsoft and IBM could not write a file system to save their lives -- here's you chance to see if you can write a faster one. Once you prove your structures, maybe the MS, IBM or Linux folk will consider supporting your new partition type.

Java Powered TV Commercials

Does it drive you nuts when they play the exact same TV commercial over and over. In future Java powered TV commercials will slightly vary the commercial each time it is played. This has several advantages: The commercial contains MPEG video, audio snippets, and text overlays that are combined into a full length commercial each time it is broadcast.

Your job as students is to create a prototype commercial, perhaps that has RealPlayer 5-frame a second video, or just stills, that demonstrates the concept, and can be used for more interesting Web page ads. The differences introduced need not be great. They may just be the top three takes of any one scene instead of the top one. They may just be difference in order, background music, voiceover, or titles with the same or almost the same video.

See random numbers in the Java glossary.

Java Source Code Beautifier

This tides Java source code to the Sun specification. It need not be a generic totally-configurable program. Just go for the Sun standard. The one possible exception would be allow you to turn off K&R brace formatting and use the traditional format where braces align. A parser such as JavaCC or ANTLR might be useful. See parser in the Java glossary.

You might do some additional tidyings:

The problem is the added comments are bulky. A better solution is a SCID that lets you hop to the next whatever you want to see, and visually highlight important features of the language without using a lot of precious screen real estate.

You might tackle just a small subset of the problem, e.g. converting braces between same-line K&E style and line-by-themselves style. After you succeed, you might add another feauture, then another.

Java Source Code Presenter

This converts Java source into HTML for public presentation, much like Jonathan Revusky's JavaToHTML. It would colour various syntatic elements. It would use bold on a variable in the place is it defined. It would use variable sized () and {} depending on how deeply nested they are. You can see some examples of the sort of output I'm after by looking at the sample Java source code in various parts of the Java glossary such as the HSB or properties entry. It might use special markings to delimit methods and classes. It would render the HTML in JavaDOC comments (e.g <b>length</b> &lt; 100) in human form (e.g. length < 100). For importing ordinary Java source code, a parser such as JavaCC or ANTLR might be useful. See parser in the Java glossary.

The tricky part is figuring out how to do decent variable left indenting in HTML. I know of three techniques:

  1. Use a dummy gif and stretch its width. This is slow, verbose, and does not look all that good in all browsers. Remember to use border="0" to avoid a line around the dummy gif.The advantage is you can also control vertical spacing by stretching the gif height.
  2. Use <ul> to indent and </ul> to unindent. This has the unwanted side effect of extra blank lines.
  3. Use <td colspan="n"></td> <td colspan="max-n">real stuff</td>. You can't control the indentation column widths with <TABLE CELLSPACING="30", because it affects vertical spacing as well. You hide the tableness of the whole thing by using <TABLE BORDER="0". This technique has the added advantage of properly indenting wrapped long lines that don't quite fit on the screen in some browsers. See the HTML used to create the waterfall entry in the Java glossary.

Java Source Code SCID-style browser/editor

SCID means Source Code In Database. This is an enormous project, but you could start small. The basic idea is your pre-parse your code and put it in a database. The problem is programs are getting huger and huger. We need tools to help you temporarily ignore most of them so you can concentrate on your immediate needs. We need tools to rapidly navigate programs. We need tools to help you get an mental forest picture before delving into the tree detail.

I have been talking up the SCID idea since the early 70s. Mostly people have just hooted with derisive laughter. However, SCID-think is gradually catching on. The RADs, such an Visual Café and Inprise Jbuilder, let you write code to control the properties of widgets on the screen by right clicking on visual elements to view the associated properties. You can tick off entries in pop-up listboxes and checkboxes or fill in the blanks. This is an important step away from thinking of programs strictly as linear streams of ASCII characters. Java Studio lets you view and write Java code by playing plumber -- visually connecting JavaBeans.

I think it is a case that the shoemaker's children have no shoes. Programmers in creating source code in linear text files do the equivalent of keeping their accounting books using a CPM Wordstar text editor. We would never dream of handing a customer such error prone tools for manipulating such complicated cross-linked data as source code. If a customer had such data, we would offer a GUI-based data entry system with all sorts of point and click features, extreme data validation, and ability to reuse that data, view it in many ways, and search it by any key.

Once you have your program pre-parsed, you can display the program in a variety of ways. Here are just a few examples:

JavaDOC tools

iDoc does a reasonable job of proofreading JavaDOC, but maintaining JavaDOC is very tedious. Here are some tools you might write.

Linkcop Clone

Linkcop is a program that wanders its way through your website checking internal links and optionally checking external ones.

Your clone would be a little smarter than the original.

When a link automatically takes you somewhere else, your clone should correct your original HTML to point directly to the new location. It has to be a bit clever. You don't want to replace valid URLs with 500 character CGI references that will change the next minute.

When a link is broken, your clone should try to fix it for you. For example if "http://oberon.ark.com/~Zeugma" has disappeared, it should try "http://www.zeugma.com" and "http://www.zeugma.org". Failing that it might be able to make some guesses by combing the URL names in some search engine results. It marks its corrections with a special GIF. Someone can then manually check these corrections out.

You should be able to give it a list of pages to check (or wildcards), a list of pages to avoid (or negative wildcards), and whether you want the indirectly-linked pages also checked (/I) and whether you want subdirectories checked (/S). So for example, you might want it just to do a quick check of all your offsite amazon.com links, just checkin internal links enough to effect that, i.e. not checking any gif links, internal # links, or other offsite links. It takes a long time to manually research a dead link. You don't won't to be told again about dead links you already know about.

You should be able to control it completely either from the command line or from a GUI.

Before declaring a link dead it should probe it several ways:

  1. First with HEAD. This is the most efficient since he website returns just the header, not the whole document.
  2. Next with GET. Some sites won't return just headers. You have to ask for the whole page. You can abort once the header portion arrives.
  3. Try polling again after all the other pages have been polled. This may give the site time to clear a temporary backlog or hardware problems.
  4. Optionally just put the non-responding sites in a list to be automatically rechecked on several successive days. Only if URLs stay dead, do you bother notifying the user about them. Of course, if a site is up, but there is no such document, there is not much point in retrying or delaying notification of the problem.

Lint for Java

LINT is a program for C/C++ that points out potential errors in source programs, things that are formally syntactically correct, but which might do something different from what the programmer intended. We need something similar for Java. Here are some things it might warn you about:

Localisation Tool

There are two basic approaches to localisation (preparing a program in different languages or for different countries).
  1. Hand the program over in its entirety to some speakers of the native language, and let them adjust it -- Strings, layouts, colour schemes etc. When you issue an update, you do the whole process essentially from scratch.
  2. Code every string in the program as some call to a function with a meaningless integer argument string id that selects the localised string. The code is bulky and unreadable.
I propose a hybrid of the two approaches. For your application, you require a database of strings and a tool for both programmers and translators to update the database. The database essentially consists of various translations of each String, all tied together via a computer-assigned ID.

There are several special "languages":

  1. programmer: the incoherent English text with atrocious spelling originally written by the programmer.
  2. notes: where the programmers can put notes to help translators.
  3. questions: where the translators can leave questions for the programmers about the precise meaning of the strings that need translation.
There are additional fields: Programmers manipulate the database primarily by changing the Java source code. A custom parser periodically runs through the Java code finding new strings, and finding ones that could be discontinued.

Translators primarily manipulate the database with a GUI that lets them rapidly find non-up-to-date string translations in a given language. Various translators can take copies of the database, work separately and later merge their works.

When it comes time to ship an localised application there are several possible techniques, all of which work off the same database of Strings.

Special problems:

Mailreader/Newsreader

The problems with existing mailreader/newsreaders are: In the new scheme, all mail is compressed, digitally signed and encrypted. All newsposts are compressed and digitally signed. Even completely anonymous posts are signed with a special anonymous digital signature, and even noms-de-plume are digitally signed. All transmission is 8-bit. The whole business of exchanging and verifying keys, encryption and decryption, compression and decompression is totally automatic and transparent. The user is totally unaware of it.

You can track a piece of mail just the way you can track a parcel, to discover if it has been received at the recipient's ISP, at the recipient's computer and at the recipient. You don't have to do anything active to track mail. You just see little status icons changing. If a piece of mail is overdue, it will show up in a way that attracts your attention.

You can enclose programs with your email or posts. They don't actually necessarily get sent until the recipient asks for them. With one click the recipient can install that program.

Message can contain all manner of HTML features. All browsers of this style of EMAIL can handle that, so there is no need to track whether the recipient can accept HTML.

Like ICQ, someone cannot send you mail without your prior permission. They can't send you mail because they don't have your public key to encrypt the mail. You receive ONLY encrypted mail. They can only send you a tiny request permission to send message.

By periodically changing your public key, and sending the new key only to those you want to continue receiving mail from, effectively cuts off any pests you may have unwisely given your key to. The basic idea is no one can send you stuff you don't want. It does not even get into your ISP's mail server. Spammers and other pests can spend your datacommunications resources or your time.

Your address book automatically updates. When you move, change ISPs etc, everyone in your address book authorised to send you mail gets informed, and their address books automatically update. Eventually the scheme would be used to automatically inform all the magazines you subscribe to of your new mailing address.

Whatever information you choose to broadcast, automatically stays up to date in other people's address books, including possibly personal details like height, weight, children's names, credit card numbers, bank account number.. You would disclose different amounts of this information to different people. Once you mark it disclosable, it would stay up to date automatically in everyone else's address book.

Attribution (quoting and tracking who said what) would be handled technically by not embedding quotes in messages. Instead a reference to the original message and an offset/length would be embedded in the message. When the viewer expands this It would look like an ordinary quotation, but with guaranteed accuracy and a guaranteed accurate author. There is no way to put words in another's mouth or ascribe them to the wrong person.

You can file your mail and newsgroups, tagging them, possibly automatically, with many possible keys. You don't need a strictly hierarchical system. There is an SQL database lookup so you can retrieve messages without having to linearly search the entire message base.

Native Class Amanuensis

Let us say that you wanted to write a native class to access the volume serial number of a drive. This is just a couple of lines of assembler. However, to glue that code into Java would require a major intellectual effort. You would have to learn JNI and Microsoft RNI, and all the steps. 95% of what you read would be irrelevant to your tiny task. By the time you did another such project, it would all be a haze, and you would have to start your earning from scratch.

I suggest writing an amanuesis that generates all the glue code, control files, headers etc., and a set of instructions on just what needs to be done manually.

You tell the Amanuensis the name of your native class, the methods and signatures, the native platforms you want to support, and it generates all the necessary bubblegum, including some C and assembler skeleton programs that just access the parameters, create a dummy return value and return. They are heavily commented, so that most C or assembler programmers would be able to just look at those comments, flesh out the real code and go, without any understanding of JNI/RNI.

To start, pick just one platform, say Win95 and just JNI, and generate only the C skeleton. Design your classes so that you or others can plug in delegates later to handle other variants.

Password Protector

This is a sort of software wallet for passwords. I have dozens of passwords and CD-keys. There is no way I can remember them all. It is not wise to have them all recorded in a flat file anyone could snoop at, or on a piece of paper in a desk drawer. This program remembers your passwords, and stores them in a secure encrypted way.

The program has two types of entry: regular, and secure. In regular mode the passwords/CD-keys are displayed. You never have to type blind. In secure mode, password/CD-keys are never displayed on the screen, even when you are typing them. You have to type passwords twice to be sure you typed correctly. Every field has its own radio button to determine type.

It works like this. You make up a new entry: give it a label, say "Sun Developer Site Access" Then you can create additional fields and name them, e.g. login, password, and fill in the blanks. You can select the most common types of field by radio button: login-id, password, CD-KEY. Beside each field is a radio button for selecting display/hide i.e. regular/secure.

You can click on an entry and have it sucked up into the clipboard for pasting elsewhere. Ideally you would co-ordinate with Clipmate, or use a hot key, or do something very clever so that you could have a PowerPaste, paste each field of the entry in succession.

The program encrypts the entire password file, and makes sure no unencrypted copy of it is ever left unwiped on disk. You gain access to the file with a common pass phrase. If you forget the passphrase there is an optional hint, hopefully composed cleverly to help you remember, but not enough to give a snoop any assistance.

Note that the passphrase or its HashCode is never recorded anywhere in the file. To ensure snoops cannot scavenge, remember to wipe (clear to 0) any array or temporary file containing unencrypted data or the passphrase before exiting, or after a time delay of inactivity.

How do you do the encryption? I propose three possible techniques. You might like to think up others:

  1. Weak encryption: Take the HashCode of the passphrase and use it to seed the random number generator. Use the random generator to generate a stream of bytes. XOR these with the file to either encode or decode it. See Random Numbers in the gotchas essay.
  2. Stronger encryption: Bone up on the PGP (Pretty Good Privacy) private/ public key algorithm. Use the PGP public key to encrypt the file and the private one to decrypt it. There may be an API for the code in PGP itself. Do some digging.
  3. Read up the "security" entry in the Java glossary. Use one of the more advanced encryption techniques such as Blowfish, whatever you can find code for.
  4. Check out JCE, the Java Cryptographic Extension API spec from SUN, with implementation for USA from SUN and one for the rest of the world from ABA.
The program would have a simple integrated backup/restore that does nothing more elaborate than copy the encrypted file to or from floppy.

If you have access to a Java powered hand-held unit, you might develop this as an application for it. Ideally it could send the codes into apps via a special keyboard intercept attachment or an IR beam. It might even form the basis of a general lock system to handle all IR activated locks. You could invent a cheap IR lock that sends a challenge phrase that needs to be encrypted and sent back. The handheld unit contains all the public and private keys necessary to deal with all the locks in your life. It might even be useful to simulate something like a car remote to prevent children playing with the remote from accidentally starting the car in a closed space and gassing the family to death.

Prebranded Software rental with auto updates

When you buy software online, you don't have to type in your name, address and shipping information every time. The idea is a create a scheme so attractive and so cheap that all vendors will adopt it. See my essay on Internet Futures for more information on how automatic updating works.

Software arrives prebranded (exactly where the branding occurs is immaterial. The important point is the user takes no manual action to brand.) Normally software will arrive in two parts -- a large publicly, freely accessible file of encrypted program and data files, and a private custom-built small key file containing branding information, a key to unlock the application, and crucial parts of the application code. The key file would be encrypted with the customer's public key.

If anything should go wrong with the download, the software, without any manual assistance, can restart it. The software is automatically installed once it arrives without asking any questions. If any special questions are needed, you answer them before you even start the download. Once you hit "PURCHASE", you can go for coffee, and come back to fully installed package.

Updates automatically arrive and are automatically applied unless you explicitly freeze the system. Presume a 24-hour ADSL Internet connection.

There is a wallet scheme (gas tank), so you can rent programs by the hour, month etc. There should be flexible payment schemes, to also include support, purchased ala carte or in bundles. There should also be provision to pay by usage -- e.g. documents printed, created, scanned, etc.

The common wallet means there is no need for frequent financial transactions. The wallet sends information to a central clearing house periodically about your usage. You get one entry a month to your credit card, even if you have rented/purchased dozens of packages.

You can monitor the gas consumption of any program at any time. Your wallet will warn you if you have inadvertently signed up with a guzzler.

Assume vendors will want to trash their competitors installations. Assume people will try to rent programs without paying, or pirate purchasable software. Figure out how to foil them, in a way that has zero inconvenience for the legit user.

Consider how unique electronic serial numbers on CPUs might make your job easier. Consider how an I/O board with a defective RAM chip on it might serve as a unique id. Consider how a CPU with a private/public encryption key could be used.

Eventually the scheme might be used for all manner of electronic commerce such as paying bills, buying toys, wiring people money overseas, paying for an infinite jukebox, getting stuff printed or typeset on a fancy nearby printer in the basement of your apartment building or at the nearby instaprint shop, music channels, digital TV, digital movies...

Advantages to Renting Rather Than Buying Software

  1. Software sales means the vendor is primarily concerned about sucking in first time buyers with flashy but useless featuritis.
  2. Software rental means the vendor is primarily concerned with keeping his existing customers happy by providing tools that work smoothly and efficiently.
  3. Technical support is more reliable and less expensive since every customer automatically has the latest version.
  4. With rental, I can afford to have thousands of programs on tap that I only rarely use.
  5. Software rental with automatic updates make piracy much harder. The constantly updating programs are a moving target, and there is always a danger the pirate will expose himself if products automatically go out seeking updates. Every program can be booby trapped in a million ways to stop working if it is not frequently updated. The booby traps can change every month. Public key encryption techniques can be used that will make piracy almost impossible. It will get even more difficult for pirates when everyone has a 24-hour connection to the Internet. Less piracy means a bigger market. A bigger market means more competition. More competition means lower prices for non-pirates.
  6. With rental, I can thoroughly try out programs over a long time without having to commit to them.
  7. I can cheaply drop a poor vendor at any time and switch to a competitor. This gives me clout with the vendor.
  8. Delivering and installing updates is 100% the vendor's problem. I don't have to waste any of my precious time with it. If he mucks up, I get another vendor.
  9. Delivering software and updates electronically is much cheaper than using retail distribution channels. Eventually this will mean lower prices.
  10. With piracy all but eliminated, and try before you buy very easy to administer, and automated updates reducing the support burden, smaller software development companies will be able to compete more effectively with the bigger companies that currently rely on expensive print advertising and mass retail outlets. This means more competition, higher quality products and lower prices.
  11. Without any paper work or computer busywork you could install 40 extra copies of an application to help you through a rush job.
  12. Payment can be very flexible, based on usage, time, flat rates, bulk discounts etc. You don't have to plan far in advance the way you do with purchases. You can make adjustments with seconds notice.
  13. The vendor has a steady source of rental income to pay for upgrades and bug fixes. He does not need to provide them out of the goodness of his heart. If he does not provide them, he stops getting paid. Currently, vendors that take the money and run are financially most rewarded.
  14. Consider the analogy of health club memberships. You get a much better service if you pay by the month than if you pay all up front for a lifetime membership.

Regex Proofreader

If you don't know what regular expressions are, have a look in the Java glossary. Basically they are search/replace patterns.

If you have ever used Funduc Search/Replace or SlickEdit or any program using regular expressions, you know what a beast they can be to proofread. You are often quite amazed by what they do to your files when you turn them loose.

The main problem is quoting. Does < mean the literal character < or is it a command? In Funduc Search/Replace it is a literal in the search argument, but a command in the replace argument. Arrgh! Every implementation of regexes uses a slightly different set of commands and command characters.

If a character is reserved as a command, when you want to use it literally, you must precede it with a \. This turns expressions into unreadable nightmares like this:

A very simple regex proofreader simply colour codes each letter as a literal or as a command. For example in Funduc search expressions the following characters need to be preceding with \ when used as literals: and the following characters in replace expressions need to be preceded with \: You must not precede any other literals with \.

You might display command characters in red and literal characters in blue, instead of like this:

like this:

A slightly cleverer proofreader w=d let you hide the \ characters (except those preceded by another \) and just display the raw colour-coded literals. You could think of this as unquoting. The unquoted regex expressions would then look much more like the actual strings they are intended to match. The expression may then look like this:

What then would you do with tab=\t, cr=\r and newline=\n? You might cook up some special glyphs to represent them more literally, much the way MS word can be persuaded to make visible the spaces, line and paragraph ends with special symbols, like this:

. You might just use the letters t, r and n but display them in green. If you can view them that way, why not type them that way? Instead of worrying about which letters need to be quoted, just use ctrl-R to load your "pen" with red ink, ctrl-B to load it with blue ink, and ctrl-G to load it with green ink -- or tap one of three coloured inkwells with your mouse, as if dipping an old goose quill pen. Just type your literals naturally and behind the scenes, your Applet will insert the \ characters as needed. If you typed an invalid command in red-command mode, it would just beep at you. You could even use the Enter key to more directly represent \n or the \r\n pair.

Everything so far is pretty easy. You don't need to know anything much about regex expressions to write such a simple proofreader. The only problem is Java does not support rich text -- multiple font colours in a TextArea. Unfortunately you can't handle the colours by generating HTML <FONT COLOR= commands and having something render it. You need to handle your colours at a lower level -- like a Canvas.

For a more advanced proofreader, things get a little tougher. The user should be able to submit some sample strings and see what the regex would do to them. Do they match the pattern. What do they get converted to. You have to simulate Funduc Search/Replace, SlickEdit or whatever other Regex engine you are proofing for. If you dig about the web, you may find source code to do this for you that you could canibalise. You have one advantage over the authors of the commercial regex programs. Your parsing and scanning code can be very slow and no one will mind. The user should be able to maintain little libraries of test strings that he can quickly test out. The before and after views should use colour to highlight the changes.

Once you have that working, you can now try something even tougher, generate a sample set of strings to feed to the regex that exercise each part of the regex expression. Some should match, some should not. By examining the results on that test set of strings, the author of the regex expressions should have a pretty good idea of all the things that regex will do when it is turned loose in the real world on files.

If your regex engine is sufficiently fast, you might consider turning your proofreader into a full blown search/replace engine, that can run scripts of commands or accept them one at a time via a GUI. You might also consider writing your own text editor with this as the crown jewel.

If you want to tackle a simplified version of this project, try creating proofreader for Java source code string literals. So that instead of seeing "C:\\mydir\\Myfile.txt" you would see "C:\myDir\Myfile.txt". Or "abcdef" instead of "abc\ndef".

Reimplement Standard Classes

This idea came from Mark Thornton. Reimplement java.lang.ThreadLocal so that synchronization only occurs when initially creating a value instance and when cleaning up old values (associated with dead threads).

Or more generally reimplement just about any standard class for greater efficiency, while of course retaining the same public interface. Try running a source code debugger tracing the actions of any standard class. Often the needless overhead will become clear. Profilers will also help.

Read the Bill Wilkinson's Deja News posts from comp.lang.java.* over the last year on the subject of serialisation. Reimplement serialisation in a way that would please Bill.

Devise a test suite to prove your classes behave the same as Sun's, or that clarify any deliberate differences, and to benchmark performance. Post on comp.lang.java.programmer and send copies to Sun.

SAX File Transfer Protocol

SAX is a more efficient file transfer protocol than HTTP or FTP which are based on TCP/IP. TCP/IP sends half its time acking every packet, which wastes bandwidth and slows things down. SAX is not an acronym. It was originally coined as just a mix of FAX and SEX since it was intended to replace inefficient FAXing of documents with sexy new protocol. Here are the key features of the SAX protocol. I will leave it up to you to fill in the details. The protocol itself is much simpler than other protocols:
  1. You can think of SAX as a sliding window protocol, like Kermit or Zmodem, where the window is the entire file being transferred.
  2. SAX does not ack individual packets. It only acks every 60 seconds or so as an are-you-there? I'm-alive heartbeat and when file transfer is complete. Either end can ask the other at any time "Are you still alive?" or "What is your view of the work still to be done?"
  3. Packets are fixed length. They carry with them the record number where they fit in the file. You multiply the record number by packet data length to get the offset where they fit.
  4. Every SAX packet is self-identifying - where it fits into the file being transferred. Packets may arrive out of order, or be repeated without any damage. Good packets are never discarded the way they are in most other protocols.
  5. Sender and receiver maintain bitmaps of which parts of the file they think have been successfully sent. When a sender sends a packet, it marks that chunk done. When the receiver receives a packet successfully it marks that chunk done. When the sender receives an ack, it marks unreceived chunks as undone. The file transfer is complete when the receiver acks with an empty to do list.
  6. An ack is a list of all the parts of the file that have not been received yet. The file transfer is over when the receiver's ack says all has been received, that there is no more work to do. The ack consists of a complete list of the parts of the file that still need to be transferred according to that end, given as a list of ranges and individual chunk numbers.
  7. The SAX protocol must be built atop IP or UDP, not sockets TCP/IP to avoid the automatic acks on every packet. Have a look at the java.net.DatagramPacket and java.net.DatagramSocket classes.
  8. Any packets received with bad checksums are just ignored. The Datagram classes should handle creating and verifying checksums and discarding bad packets for you.
  9. SAX is restartable in event of disconnect. The receiver tells the sender which bits of the file have not arrived yet. The file transfer starts the same way. The receiver keeps its bit map in the event of disconnect. The server rebuilds its bit map from the list of unfinished work the receiver sends it to restart the transfer.
  10. If the file is not already compressed, it is compressed prior to starting the transfer. Compressed files have a jar, cab or zip signature in the first few bytes. If the file used some unknown compression, SAX would recompress/decompress it, which would do no harm other than waste a few CPU cycles.
  11. SAX can also be used to transmit a stream of data, but it requires that a log be kept back to the last checkpoint.
  12. Pacing, that is avoiding sending packets so fast that many are lost, or too slowly, is not strictly part of the protocol. The sender can be clever, if it wants, by requesting an ack from the receiver. If it notices a pattern of lost intermediate packets, it can lower its sending rate. If it sees a pattern of perfect transmission, it can raise it. Perfect pacing speed is not necessary. It is free to experiment on the fly with different sending rates to see what works best. It may optimise for speed or efficiency (ratio of packets received to sent) or some blend. It may take into account its own load from other customers.
  13. Packets in the pipeline cause a complication. When the sender requests an ack, the receiver sends back its state. However several packets could be in the pipeline on their way to the receiver. When the sender gets the ack, it will look as if those packets had been lost. The protocol per has nothing to say about this problem. However, how can the sender deal with it?
  14. If your acks don't get a response within X seconds you retry a few times, then give up and start a new connection. For heartbeat acks, where you are primarily just making sure the other end is still alive, you keep sending data packets while you await the ack. The sender might send heartbeats even when for some reason it was too busy to send packets, just to keep the channel alive. It might send heartbeats at a higher packet priority status.
  15. You might consider piggybacking your ack requests on an outgoing data packet.
  16. You will probably want to implement this as a connection object with two associated threads, one to handle sending packets, ack requests and acks, and another thread to handle receiving them. You might as well make both ends symmetrical, capable of sending a file in either direction.
  17. Have each connection object deal with transferring only one file. If you need to send more than one file, use multiple connection objects, that may or may not run in parallel.
  18. How big should the packets be?
  19. How does the process end? You don't want to get into an endless ack the ack sequence. In the very worst case, you can recover from anything pathological when the receiver restarts the transfer.

    The sender will be the first to discover it has sent all the packets, (simply by maintaining an unsent packet count). Then it sends its status along with a request for status. If it does not hear back, it repeats this and eventually gives up, then throws away its bit map. Normally it will either get a request back for some missing packets, or an all-done status in which case it can shut down the channel.

    The receiver will eventually notice that it has received all packets. It sends its status then shuts down, and throws away its bit map. It does not really matter if that final status packet got through. The sender will give up soon enough.

    If the receiver is expecting packets and does not hear anything, it can prod the sender with a status packet, telling it of the packets it is still waiting for. If it still hears nothing, after several attempts, it shuts down the channel, but keeps its bit map, and restarts the entire process, perhaps after some long delay. If at any time the channel goes dead, the sender just gives up and forget the whole thing, and the receiver must restart, sending its status describing the chunks it still wants.

    Because of the way the protocol works, in theory the receiver might be able to request just a sampling of the file. It would fake a restart, specifying just the chunks it wanted. To the sender it would look like a restart with a few missing packets.

  20. SAX might implement a total file hash as a double check on the individual packet checksums, but it would not recheck each packet with a second checksum. The file level checksum would have to be defined in such a way that it could be computed as the chunks arrived in higgledy-piggledy order. Either end could request a checksum of the other and any range of chunks. This feature could be used by a clever strategy routine to narrow down where the damage is and repair a damaged chunk that was corrupted in transmission but fluked out and passed the packet checksum test.
  21. You may find it simpler to organise the protocol as several intercommunicating threads, each with a single purpose. E.g. one thread may just send packets. Another might just receive them. Another might keeps track of whether the other end is still alive. Another ensures the other end knows we are still alive. You may find the logic of the protocol is thus simplified.
  22. Once you get he raw transfer going, add this wrinkle. With each file send a serialised descriptor object. The object contains fields such as:
  23. SAX dates back to 1985 when I hired a guy to implement a SAX-based EMAIL system based on dial up phone lines and the X.25 packet net. I was trying to figure out how you communicate over long distance telephone satellite links or packet nets where you have a very long ack turn-around delay. I was also concerned about how you communicate with someone in Africa in the field using primitive non-error correcting modems and noisy local phone lines. SAX seems a good fit with the Internet and its long ack times caused by packets pinging from hop to hop like balls in a pin ball machine. Implementing SAX atop UDP should be even simpler than implementing it directly via modem since you don't have the problem of resynching after an error to find the beginning of the next packet, and you don't have the problem of reserving magic control characters that you must somehow quote when they occur in data.
  24. FTP has a problem if you want to maintain three mirror sites. You must upload a file three times over your narrow-band modem connection to the Internet. You should need to upload the file just once to one of the servers, and then orchestrate a direct server to server transfer to copy to the other two servers. All you should need is the necessary read/download write/upload privilege at the servers. Ideally, this would happen transparently. You would just tell SAX you wanted a file sent to three places, and it would figure out the fastest way to do it. FTP can do this now, but only if you have the rarely given shell script Telnet privilege. FTP can, in theory, can even handle such a third party transfer within the protocol itself, but the feature is rarely implemented. SAX would not require shell privilege, so it could be used by ordinary folk. On a cold night, years from now, you might even think about how you could use multicast packets to send to more than one site all at once.
  25. Eventually you want to encapsulate SAX so that it is as built-in as FTP. Instead of saying "ftp://" in your HTML you should be able to say "sax://" to download a file.
  26. FTP has a problem with conflict between uploaders and downloaders. If somebody is downloading a file off my website, I can't upload a new version of it. I have to wait until he is finished downloading. If the file is popular, I may have to wait until 3 AM when there is a tiny window of opportunity when no-one is downloading it. Similarly, when I am in the process of uploading a file, no one can download it. SAX should be clever about this. When you upload a file that already exists, you upload to a temporary file, so you don't block people from using the old version. When the transfer is complete, you can do a rename (if the OS is clever enough to allow renames of open files), or go into a wait loop until the file is free to do the rename. You need a much smaller time window in which to do the rename than you need to upload the entire file.
  27. Writing a smart file uploader in Java (that used renames and wait loops to avoid conflicts as described in the previous item) could be a little project all on its own, using ordinary FTP instead of SAX. You would use the ftpClient class methods. The advantage is, you would not need to convince your servers to install the SAX protocol. The disadvantage is it would not be as fast.
  28. The SAX protocol is very simple, or will be, once it is precisely nailed down. However, the strategy engine behind SAX can be as simple or sophisticated as you want. There is the challenge that allows for true competition between SAX implementors. There are some pieces in the SAX protocol that don't appear to be used, such as receiver being able to request the sender's bitmap status. These are for potential use by more sophisticated strategies and for troubleshooting/debugging.
Chris <chrisl@hamptons.com> has already got a prototype SAX protocol working.

SCROOM Netscape Navigator Plug-In

Have you ever been to a website that treated you badly. They may have: You vow never to visit that site again, but you keep finding yourself there because of hidden or misleading links in other sites to it.

SCROOM comes to the rescue. You give it a list of sites, both by name and by IP dotted quad that you never want to visit again. Whenever you accidentally click to go there, or a site tries to send you there, the request is blocked. The mechanism is very similar to how a Net Nanny would protect children from adult sites, but it has a quite different purpose. I'm told a commercial product called NetGranny has some logic like this in it. It might serve as a model.

Struct reader Amanuensis

Another boring task is writing code to read and write C structures using LEDataInputStream and DataInputStream. See the FileIO Amanuensis for the general idea. You would either:
  1. Feed it a C/C++ declaration and have it generate a set of Java class declarations, and the read and write method to read one struct.
  2. Check off the types and names of the various fields, creating a binary record descriptor. Use this to generate a C++ struct and the set of Java class declarations, and the read and write method to read one struct.
  3. Use reflection to discover the structure of the Java class which may have embedded references. You need to create a fixed length record, so you must presume each reference field is not null, and you also need caps on the lengths of String fields.
Even if you can't get it perfect, and just throw up your hands at the problem of references and variable length Strings, you can still save a ot of the bullwork.

A more efficient variant would generate code to read entire records into a byte array, then generate get/put accessor functions to read/write the various fields. The accessor would have to construct the binary value by shift and logical or of the bytes at the corresponding offsets in the byte array, using logic similar to that in LEDataInputStream and LEDataOutputStream.

Super Compressor

Ordinary ZIP (Lempel, Ziv Welch) compression depends on finding repetition and, roughly speaking, creating abbreviations for common words. This approach does not do well on small files. Here are some ideas for super compression.
  1. Imagine you had a dictionary of 64,000 words. You numbered them 1 to 64,000. Before you compress a text file, you look up each word and replace the word and its following space by a 16-bit binary number -- the word's index in the dictionary. You create specialized dictionaries for compressing/decompressing different kinds of files.
  2. Imagine the state of a compression engine just after it had finished compressing the entire JDK 1.2 Java source code. What if you preloaded that state just before compressing any Java source code file. You would then have a set of common Java abbreviations all ready to go. There would be no penalty for compressing a short file. You need to create loadable states for various types of files and distribute them along with the compression/decompression engine.
  3. Imagine you had fixed length records to compress. If you transposed the fields and records so that all the names, then all the addresses, then all the postal codes, then all the phone numbers... came in order, the compression engine could do much better because the repetition is clearer. Further you might recognize certain field types such as YYYY/MM/DD and replace them with 16-bit days since 1860 or some such.
  4. Write a compressor that uses information about compressing other files in the archive. This gets tricky. If you delete a item, you may have to adjust the remaining entries.

Sort Visualizer

This is a tool for teaching people how various sorts work. It lets you watch them work. It also lets you see how well they work on already sorted data, or data that is sorted in reverse order. People who like watching Norton SpeedDisk will enjoy the hypnotic effect of this program.

You start with an array 400 elements long. You initialise it like this:

Then you scramble the array. You can do this by generating random pairs of integers and exchanging those two array elements. Do this 10,000 times. See Random Numbers in the Java glossary to see how to generate random numbers 0..399.

Then you display the array in a 400x400 square grid of pixels. If array[4] == 10 you plot a dot at x=4 y=10.

There is source for RadixSort, HeapSort, QuickSort and ShellSort available to download.

You insert a call to your drawing routine inside the sort. The sort must call it every time it moves an element of the array. You might also insert it more easily on every compare. Some sorts may make a copy of the array or in some other way challenge you with a way to extract the needed data to plot. You might even have to understand how the sort works!

Then you start your sort. Now you can sit back and watch the points skitter about like tiny soldiers racing into order in a straight diagonal line as the sort progresses.

If it turns out you need speed, instead of repainting the entire grid each time, compare the old and new array and notice the differences. Then just adjust just the pixels that need it.

Make it an Applet so that people can play with it on the web. They should be able to choose which sort algorithm to use. You may need to insert a variable delay into your plot routine if the sort is too fast too see.

You might write some el-cheapo sorts that don't work well on a large number of elements to demonstrate why they should not be used. Write an insertion sort (one at a time it inserts elements, sliding all those above to make room) or a bubble sort that percolates the biggest element to the end via a series of exchanges. On each pass it ignores the previous biggest elements, and finds the next biggest. Write a scanner sort that finds the smallest element then copies it to another list, and either marks the original void, or slides elements above it down. Then it repeats.

If that was not enough challenge, add this wrinkle. For people with big screens allow the size of the grid to be as big as the screen, or maybe even bigger and allow them to pan around on it. You might even display your scrambling process to give the users something to watch while you scramble the data.

Play with colour, encoding extra information into the dots using hue, saturation and brightness. See HSB in the Java glossary. You might use warmness of the colour to encode how close the dot is to its home position, or how frequently or how recently it has been moved.

There is a similar Applet in the Sun Tutorials. It uses lines instead of dots, which may be even easier to see. It also uses threads, so that you can run several sorts simultaneously, in a horse race, using threads. It also contains code for a bubblesort and the slightly improved -directional bubble sort. The-directional sort bubbles the biggest element to the top, then the smallest element to the bottom, then the next biggest element to the top etc. alternating directions.

Another thing your harness could do it experimentally test if sorts are stable. You can't prove a sort is stable that way, but you can prove it is unstable. By stable I mean does not disturb existing order in the event of a tie in keys. You can turn an unstable sort into a stable one by using position in the array as a minor key.

Stay In Touch Database

You might have one of these for your grad class so you can all stay in touch over the years. It is simply an online database where you can store such things as name, address, phone, email, website, photo then and now, and public digital keys. You might also store free form text about family, spouse, job, what's happening with you.

It might be public or only open to the participants. You can use the list for contacting for reunions or seeing who you know who might be involved in a business service you need.

It could have many uses, church and club registries, printing form snail letters, sending bulk email. The key is it can be updated or searched online by those registered. It would have very strong editing to ensure no invalid phone numbers, postal codes, state/provinces etc were entered.

Its main function is rounding everyone up for 10 year reunions. It has some status fields to help in that regard. Has the person been contacted? Are they coming? Who knows them who might be able to contact them? Who are they bringing? What food have they agreed to bring? Can they provide transportation? Do they need transportation?

People could provide information, but not make it generally public. You could send someone a short message even if they were unwilling to give out email or physical address.

I provide a service like this, all handled manually, for the Living Lovers of Planet earth, people who were involved with the teachings of the late Ken Keyes.

It would most likely be implemented either with Java Applets running JDBC, or Servlets talking to a Java-less browser.

Submit-It Clone

There are various services such as Submit-It, Register-It and WebPromote that will submit your home page to a number of search engines for incorporation. They may charge $40 or more for the service. You might create a free version that runs as a Java application so people can do the work themselves. Basically it simulates a browser, and goes from search engine to search engine filling in forms. If anyone is interested, I have some sample code that fills in a form and submits it. The main work is finding all the search engines, researching the forms, and making sure the program does not submit an obsolete form. You need to maintain the program any time a form changes. A clever version may analyse generic forms and figure out how to fill them in based an its knowledge of individual fields. That way all you have to do is add code for new field types. The program automatically responds to changes in field order, or removal addition of fields.

Systray Deleecher

This project is not particularly well suited to Java, but more to C++. However, you might tackle it with a combination of Java and C++ or Java and MASM to get some practice writing native methods via JNI.

In the beginning, when Microsoft created Windows, and saw that it was good, Microsoft made a rule. If you want a program to automatically start up, just put an icon for it in the STARTUP folder. Every vendor and his dog crammed the STARTUP folder with their useless little doodads. People simply deleted them to stop the apps from autostarting. Every once in a while, users would accidentally delete something they should not have. There was no easy way to recreate it.

Microsoft itself and later other vendors then started getting clever, finding obscure way to make their programs automatically start -- ways that users could not defeat. Users had then to put up with all kinds of useless programs running all the time cluttering the systray or task bar, chewing up RAM, eating CPU cycles, and doing precious little useful to earn their keep.

I call these annoying autostarting programs leeches. Steve Crook coined the term as an improvement on my orginal worms which already had an established meaning. Your job is to create a deleecher that defeats all known strategies for autostarting. Your program should be able to remove any individual leech on request. When your program removes them, it should record how the leech used to be hooked into autostart, so it can be restored later at any time with just a mouse click.

I have recently learned their is a shareware deleecher already written that defangs nine different techniques for autostarting. It also allows you to temporarily disable autostarts without removing them entirely. It is called Startup Manager.

TweakDUN Clone

TweakDUN is a utility that looks at your WINDOWS\HOSTS. file in Windows9x or your winnt\system32\drivers\etc\hosts. file in NT and refreshes the IPs in it by looking them up via DNS. Your clone ensures that the entries for the IPs in your quick DNS look-up hosts. file are accurate. A hosts. file might look something like this:

IP domain comment
127.0.0.1 localhost #SKIP
# search engines
205.180.85.126 www.altvista.com
206.132.244.95 www.dogpile.com
198.3.98.99 www.excite.com
209.185.151.128 www.hotbot.com
209.224.233.192 www.InterNIC.com
209.67.228.51 www.lycos.com
# bix
199.93.4.67 x25.bix.com # alternate
199.93.4.127 bix.com # usual

To be compatible with TWEAKDUN, you should skip any entries marked with #SKIP like this:

Unlike the real TweakDUN, your clone should not behave like a bull in a china shop and remove any comments or comment lines -- text to the right of a #. This utility may be written in Java but only has application on Windows9x/NT. There may be equivalent files on other operating systems. You might tidy the file so that all the domain names line up nicely. One simple way to do that is to put lead zeros on each field of the IP address to pad them out to 3, though most people would probably prefer you not do that.

Uncrackable Encryption

Uncrackable encryption for the masses. It works transparently with the email package. It uses CDs or DVDs distributed ahead of time to those communicating. CDs are recorded with truly random numbers generated by monitoring radioactive decay. It is a OTP One Time Pad system using a simple XOR of message and key, using a key equal in length to the message, and never reusing a key. See the encipher utility documentation for more detail on how the scheme works.

A more advanced version of the project requires to you build a peripheral that is a true random number source. This would be useful for creating one-time pads, and also in various simulations. Here are several ways it could work:

In all cases, your problem is generating randomness fast enough, and making sure your clients don't sample so frequently they lose randomness.

Once you have this mechanism in place, you can build an uncrackably secure webphone. You use the standard Java microphone API to convert sound into numbers, compress them a little, encrypt the numbers and send them over the web in real time. On the other end you reverse the process, and play the sound. You logically have two channels, one in each direction. Keep in mind that snoops can still place hidden microphones at either end, or snoop on EMF radiation from your computer. It is best to use a laptop which emits fewer clues. However, snoops can't do anything simply by packet sniffing or intercepting and replacing packets. There is a slight delay as packets wend there way through the Internet, but not that different from a long distance call to Indonesia from Canada.

WebRing controller

WebRings are groups of like-minded websites. You can browse from site to site in a circle by clicking a button. You can go forward or back. You can jump to a random site. You can see a list of all the sites, and jump directly to any one of them. www.WebRing.org organises such rings for free.

There is a central controller CGI app running on a server that figures out where to jump to next when a given button is pressed. When the user hits next, prev, random or list, (usually on a custom CGI graphic map) the server forwards the user to the appropriate site in the circle. There are two major problems with WebRings as they currently exist:

  1. The HTML for linking at each site is not properly verified. The target URL should be precisely the spot on the page where the link control to the next site is. With the current unverified schemes, you may land at somebody's home page, and have to dig through his entire site to find the next link. (There might not even be one, if the site's owner made an HTML error, or dropped out of the ring without informing the Ring Master.) You may land on a page with links to 20 other WebRings on it. You have to find the link for the WebRing you are interested in. We need to automatically verify (and periodically reverify) that the HTML target URL and the link control HTML buttons at that target are properly configured. You need a precise target URL of the form: where "SuperRing" is the ring software scheme, "20978" is the server id, "7752" is the ring id, and "13413" is the site id. IDs are small integers evenly divisible by 17 (which acts as a simple checksum, yet allows ids (after division by 17) to directly index tables).
  2. Like the old serial Christmas tree lights, if any link in the chain goes down, the entire ring is broken. The central site needs to keep tabs on all the sites and remove ones (temporarily, and eventually permanently) that do not respond. Some rings have a button to bypass the next site to help get around the problem. However, that won't help unless the user knows to try that. It also requires there not be two dead sites in a row.
Your job is to write a servlet that runs a collection of WebRings, sending browsers onto the next link. You need to find low-cost, solutions to the above two problems. You want the user interface extremely simple so that almost anyone could set up a WebRing server and run it with little time investment.

There are four levels of organisation:


CMP_home You are visitor number
.
HTML Checked! award
CMP_home Canadian Mind Products You can get an updated copy of this page from http://mindprod.com/projects.html The Mining Company's
Focus on Java
Best of the Net Award