Student Java Projects
Last updated by Roedy Green
©1998-1999 Canadian Mind Products.
Introduction
Here are 49 ideas for student projects most likely involving
Java. Some ideas might be done by teams or a whole class. Some
may develop into commercial projects. You are free to take these
ideas and use them as you wish. I would appreciate hearing from
you if you decide to implement any of these ideas. I would also
like to hear any comments you would like me to add to the
descriptions, or ideas you have for projects. If you have such
project ideas on your own website, I would be happy to build
links here.
I have discussed most of the projects many times at length over
the last decades on www.bix.com. BIX has retained all postings in
archives. For most projects I have additional materials I can
provide you to give you hints on how to proceed.
A Tractable
Artificial Intelligence Problem
This is a fairly simple artificial intelligence problem that
mimicks, in miniature, the problem computer programmers have
extracting specifications from users and turning those
specifications into code.
Have a look at The Learn
To Count Applet, particularly the source code. It contains
code to count out in words in various languages.
I wrote the class for each language by getting a set of example
numbers and words from a native speaker or a phrasebook or by
sampling Internet posts, e.g. 21 == "vingt et un" in
French, and deducing the rules for converting numbers to words
and writing the code. I would then print out a set of sample
numbers using the rules (selecting the samples to ensure I cover
all the rule variants), and check them with a native speaker. He
would then give me some more examples to explain the exceptions
to the rules he had forgotten about previously. I would then
convert those exceptions into rules, code them, and go round
again until the native speaker was happy. Sometimes the native
speaker would tell me the rules directly. Usually, these were less
useful and less ambiguous than two or three examples.
I discovered there were various patterns in the rules. Different
languages had various combinations of the standard patterns.
The process was fairly easy, quite mechanical, and allowed for a
fair bit of trial and error rather than demanding brute reasoning.
Could you automate the task of writing code for additional
languages?
Could you write a program that, just shown some counting examples
from a foreign language, could compose a program to generate all
the longs out in words? The answer is yes, you could write such a
program, no problem, though it would be "cheating". It
would just require a bit of patience to manually write the code
for each human written language, then recognize it, and cough up
the previously optimised code when handed a set of examples in
that language.
What we want is something a little cleverer, that could work on a
language it had never seen before, but that was a human language.
Again you could cheat, in a lesser way this time, my simply
cataloging all features of counting in all written human languages,
writing custom code to recognize each feature, and cranking out
code by turning on or off pieces of preformed boilerplate.
One way to attack the problem is to ask yourself, "How
could I rapidly generate, generalise and optimise the cheating
solution?" Another is, how could I solve the simpler problem
of just handling the 20 or so languages the Learn To Count Applet
already knows, or some subset of them.
So think of it as a contest. You will be handed a Java class,
that will accept a long positive or negative and will return you
a String for the corresponding number in words in the target
language. Your program will then, by probing the class for
examples, and making inferences, construct a program that
reproduces the one inside the black box to convert numbers to
words for the unknown language inside the class.
You get points for:
- Elegance (mathematically simple and ingenious)
- Generality (ability to handle various curves thrown at
it.)
- Compactness, speed and beauty of both the code itself and the
code generated.
- How accurate generated code is.
- How few calls to the black box class to generate examples you
required. Only unique samplings count. There is no penalty for
asking for the same number more than once.
- If you want a slightly simpler problem, for fewer points,
assume you are given a bit map of interesting examples to start,
that cover at least one example of each of the quirks of the
language. You may still need to probe other numbers to test your
assumptions about the patterns behind those examples.
- You get bonus points if you can also generate code to go the
other way, take a word accurately spelled out in numbers in the
foreign language and convert it to a long.
- You also get bonus points for being the first to submit a
long->words class with Java source for any actual human
language to the Learn To Count class library. There is no point
in great duplication of effort there.
Book Store
Referral Applet
The intent of this Applet is to allow you to automatically
specify and keep up to date several sources for a book you
mention in a web page. Here is the problem that I and many web
authors have. We want to make a reference to a book e.g. Frank
van Gilluwe's excellent
The Undocumented PC (ISBN 0-201-47950-8). If you
click on that, you will be taken to www.Amazon.com. If you buy
the book I will get a 15% commission.
However, that is not necessarily the best place for you to
buy the book. If you look in the Java glossary under Book Stores you will find a
list of quite a few online book stores. One may have the book
more cheaply. One may be located closer to you. You may have a
special discount arrangement with one. Another may be able to
deliver it more quickly. You may have had a particularly bad or
good experience with some store. Some stores may not accept the
credit cards you use.
On my side, I may make even more lucrative arrangements with
other bookstores, that carry a smaller selection of books. I
don't have the time to keep track of which bookstores carry which
books and which stores have the best prices on each book. I don't
know from day to day which is the best place to buy each book.
Further, there is no best place. It depends on too many
factors.
I hope it is coming clear what this Applet does. You click on the
book, and it shows you just the bookstores that carry the book,
their price (including shipping, duty and taxes), delivery time,
sorted in descending order either by price or estimated delivery
time. It also tells you how fresh/reliable the information is
(perhaps using icons) for each bookstore. It may further
automatically filter out stores that don't handle one of your
available cards, or use icons to show which ones are accepted.
Instead of embedding a reference to Amazon in my HTML, I embed a
reference to the comparison shopper Applet, along with the book's
ISBN as a parameter.
Applets are not allowed to go shopping, because the sandbox
security mechanism stops them from talking to any server except
the one they were loaded from. However they could ask the server
for this information. A servlet or application on the server is
allowed to go comparison shopping. It can keep this information
reasonably up to date. It need not comparison shop more than once
per day per book. It need not comparison shop about a book no one
has asked about recently.
The scheme would let you go visit any of the book stores and then
do your shopping and browsing there directly. The Applet's job
is simply to direct the people to the most promising on-line
bookstores for each book.
You can make money with this if you make commission arrangements
with most or all the bookstores you refer to. Further you can
rent out your Applet and possibly database server service as well
to owners of other websites containing referrals to books. They
would then get the referral commissions. They may simply use your
Applet free, and you get the commissions or split them 50/50. The
benefit to them is pleasing people browsing their website.
Your little engine can work given any ISBN. It need not be
one you have on file. The comparison shopper on the server goes
out looking while the customer waits. It might even provide
estimated prices and availability when it can't get through right
away to all the bookstores.
The ISBN consists of 9 digits plus one check digit (0..9 or X).
To calculate the check digit you must multiply the last digit of
the true number by 2, the last but 1 by 3 etc. and add these
results. The number needed to fill this sum to the next multiple
of 11 is the check digit. If it is 10, the check digit is
replaced by the letter 'X' (historically correct of course). In
case the user mistypes an ISBN, (e.g. by transposing two digits)
you want to explain that to him, rather than misleading him by
telling him no one has the book in stock. You can read up on the
rules for calculating the expected check digit.
Here is an example valid ISBN 0201479508. Here is how you ensure
that it is valid.
- add up weighted digits, (including the check digit with
weight 1).
0*10 + 2*9 + 0*8 + 1*7 + 4*6 + 7*5 + 9*4 + 5*3 + 0*2 + 8*1
= 0 + 18 + 0 + 7 + 24 + 35 + 36 + 15 + 0 + 8
= 143
- divide by 11. The remainder should be 0.
143 ÷ 11 = 13 remainder 0. Yes! this is a valid ISBN.
ISBNs are sometimes written with dashes, breaking them up into
variable length fields, much like class A, B, C IP addresses. For
all practical purposes, you can ignore dashes and just remove
them. However, if you are a perfectionist and want to display
them properly you can read up on the
rules.
For code you can cannibalise to verify checkdigits and insert
dashes, see the ISBN Amanuensis.
If you want to get even more clever, you can hook into the search
engines of the various stores to determine the ISBN when the user
does not know that, then use the ISBN to go comparison shopping.
There are some comparison shoppers out there already, such as
mySimon, but this would be quicker since normally it would have
recent prices already on file on a given narrow class of books
e.g. Java.
How does this work under the covers? You need a very simple
database of book information, possibly a Hashtable, BTree or SQL
lookup by ISBN. This database is kept on the server. You can
access it via JDBC, CGI or a servlet talking HTTP or raw sockets.
At night the server goes shopping, getting fresh prices on the
most popular items first, making sure to hop from store to store
so as not to put an undue load on any one store. Eventually there
may be a protocol for stores to notify you of price or
availability changes so you don't need to pester them with
nightly polls.
You want to avoid polling any more frequently than absolutely
necessary to remain on good terms with the book stores. Reduce
polling frequency if the store does not carry the book. Reduce
polling frequency if the price has not changed in a long time.
Reduce the polling frequency if no other books at the store have
recently changed prices. Reduce polling frequency if the book is
not popular.
If you want to get fancy, you can use cookies to track your
customers, mainly so you know what country/state/province they
live in to compute shipping charges and taxes accurately. See the
list of country codes. For
a simple version, you might leave taxes and shipping out.
However, you need that information to properly comparison shop.
It can be hard to worm it out of a book store owner before
committing to purchase.
The hardest part of this project is gathering the price
information in many formats from many stores. The interfaces you
are dealing with were designed for humans, not computers. They
may change at any time. Before you get heavily into that,
prototype the idea with fake or hand calculated data. Smooth the
rough edges off your user interface, simulate the delays and see
if the idea is acceptably quick, and whether end users love or
hate it. It may be a licence to print money or a black hole for
it.
After you have all this working, you might do a thin client
version as well, that does not require an Applet. It works by
generating pure HTML at the server.
BusTel, exchange
electronic business cards during a voice call
BusTel is a way of rapidly full-duplex exchanging
name/address/credit information during a voice call. This fits in
with the "Common Phone" design contest (see BIX
telecomm.tech/phone.co) which will implement the BusTel protocol
in desk phones. You tap the heart/trust key to exchange
names/addresses. You tap twice to exchange credit card info. You
tap three times for direct debit exchange. This project has
potential to generate big bucks. Basically it works by exchanging
vCard packets full duplex in the middle of a voice call. There is
an API for applications to extract the vCard information. The
default version just displays it with a standard vCard viewer.
have a device for tying two phones back to back simulating a
telephone exchange that would be useful in debugging such a
project.
The other form of BusTel is to exchange encrypted vCards over the
Internet for shopping cart use. You and the vendor exchange
information about each other.
See vCard, JavaPhone in the Java Glossary.
Child
Abuse Database
Tracks suspicious incidents, in themselves not conclusive, but
when viewed in totality show a picture of imminent danger. This
is the brainchild Alexander Investigations run by Glen Morrison,
an ex-cop and private investigator. By tying together reports
based on numerous small details which indicate they may refer to
the same person, you can build a more complete picture of a
suspect.
Imagine computing a correlation matrix of every report against
every other report by how many points of similarity they had. You
might then be able to group reports into related families. It
would greatly help an investigator if he or she had to look only
at one family of reports at a time. The computer might then
report the largest families as ones most worthy of deeper
investigation.
If a very serious incident were reported, such as an attempted
abduction, an investigator could ask to see all reports closely
related to that crucial one. One of them might contain something
very valuable like a license number or very detailed physical
description. Perhaps one of the suspect's frequent haunts would
come clear. From those aggregate reports, the investigator might
be able to put together a picture of the suspect's movements over
time, and use that to help gather more information.
The main intent is to deal with anonymous sexual predators, but
the system might also work to report incidents of physically
abusive parents. By seeing a large number of reports of a pattern
of abuse from many different witnesses, authorities might be more
willing to act in a case where the parents are good liars and
pass off the results of abuse as accident proneness.
Confidentiality is very important. The whole project smacks of
"Big Brother is watching you". It treads on moral thin
ice. Most of the reports in the database will be completely
innocent. A father may be reported as lurking at a playground
when he was just waiting for his daughter to finish her swimming
lesson. Ideally some group independent of the police would run
the database. The database would never be stored on a machine
with an Internet connection and it would need to be stored in an
encrypted format. Since sexual child abusers may move from city
to city, ideally there should be ways of consolidating databases
or exchanging information between databases for different cities.
You might permit reporting suspicious events via the Internet,
but you would not want the main database online.
What sort of data might you potentially record about a suspicious
incident? Here is a very rough guide of the sorts of fields that
would be needed. To refine the fields and multiple choice value
encodings, you need to try encoding a few hundred sample
suspicious event reports.
Incident Report
- date/time called in.
- date/time of suspicious event.
- name of reporter (optional).
- gender of reporter.
- phone, address, contact information of reporter (optional),
- type of location of suspicious occurrence: playground,
school, park, residence, other.
- street address of suspicious occurrence.
- grid location (e.g. latitude longitude) Ideally this could be
deduced automatically from a street address.
- what happened that made you suspicious. encodings
needed.
Person Report
- name of suspect. Will not normally be known, but a first name
or nickname might.
- Contact information (phone, address, hangout) for the
suspect. Will not generally be known.
- height: work in either English or metric units. Internally
works in metric.
- weight: skinny, slim, average, heavyset, obese, athletic.
- race: Caucasian, black, hispanic, Asian, Indo.
- country of origin: Australia, New Zealand, USA, Germany,
France, England, ... See the list of country codes.
- glasses:.
- facial hair: clean shaven, mustache, beard.
- head hair: skinhead, bald, short, long, shoulder length,
shoulder plus.
- hair type: curly, straight, wavy.
- hair colour: brown, black, blond, red, dyed.
- eyes: brown, blue, green, hazel.
- tattoo: yes, no, description: (keyword encodings needed, make
up new ones on the fly).
- distinguishing features: (keyword encodings needed, make up
new ones on the fly).
Clothing Report
- jacket: blue, black, yellow, red, orange, green, indigo,
light, dark.
- pants: long, short, blue, black, yellow, red, orange, green,
indigo, light, dark.
- shirt: short sleaves, long sleaves, muscle shirt, blue,
black, yellow, red, orange, green, indigo, light, dark.
- watch: on left or right hand.
- rings: count.
- gloves: leather, knit.
- hat: baseball cap, colours, toque, cowboy hat, military,
police.
Vehicle Report
- license plate number, with province/state/country. See the list of country codes.
- type of car: van, car, big, small, American, European,
Japanese.
- colour: light, dark, red, blue, green, brown, black, white,
yellow, purple two-tone.
- damage: any combination of: front, back, left, right,
windows.
Matching logic would be very similar to that used in an Abundance
program I wrote for the Humane Society to match lost cats with
their owners. For example a match of red hair against blond is
considered a partial match -- better than red against black, but
not as good as red against red. Some sorts of match are more
important than others. You can build up a points/weighting
scheme. In the Abundance Cats program I did most of the
calculations for cascaded weighting, normalizing the weights at
compile time, and removed as much as possible of the matrix
calculation out of the inner loops. This made the matching
computation very fast. The databases could potentially get very
large. You will need to rely on SQL and various background batch
processing. You use SQL for coarse match filtering, then weighted
point scoring for fine matching.
Colour Chart
Have a look at the Netscape Colour
Chart. Create something similar done as a Java Applet so that
you could test your browser to see how it distorts Java colours.
There is no problem with all the Netscape/X11 colour names being
supported -- you can easily add them all as static final Color
constants or as ints, so you could dispense with the "using
name" columns of the table. Allow the following sorts:
alphabetical by colour name, HSB, HBS, SHB, SBH, BHS, BSH, RGB,
RBG, GRB, GBR, BRG, BGR. The Applet is a pretty toy that may also
help sharpen intuition about how the various colour
representations work. See RGB and HSB
in the Java glossary. The sorting could be done most rapidly with
RadixSort available with source
via the Java glossary.
For an added bonus, add columns that display the hue as
the spectral frequency and wavelength.
Then do a colour chooser bean. You can choose the colour by
keying: red/green/blue, hue/saturation/brightness, hex number,
decimal number or selecting Netscape X11 colour name. As you
change each value (or use the spinner controls), all the other
ways of expressing the colour change to stay in step.
Constructor Amanuensis
The Constructor Amanuensis is a Java Program that helps you
rapidly generate boring repetitive code to provide a variety of
constructors to a class.
Why use it?
- The code generated is more likely to be accurate. How many
times have you written a constructor then neglected to do
something with one of the parameters? or done receipt =
this.receipt instead of this.receipt = receipt? The
compiler will never warn you.
- Generating this sort of code by hand is tedious, boring and
hence error prone.
- Even if the amanuensis could not produce code 100% ready to
run, it would be accurate, and safer to use global search and
replace on to customise it.
One kind of code I find exceedingly boring to write and maintain
goes like this:
/**
* Entry constructor.
*
* @param alloc allocation code.
*
* @param gl general ledger account number.
*
* @param shortName up to five character code for the account.
*
* @param longName up to 30 character description for the account.
*
* @param what kind of receipt do you get,
* T=income tax B=business N=none.
*/
Entry(int
alloc, int gl, String shortName, String longName, char
receipt)
{
this.alloc = alloc;
this.gl = gl;
this.shortName = shortName;
this.longName = longName;
this.receipt = receipt;
}
// end Entry
constructor
Then you have to write several variants where you leave out some
or all of the fields.
How does this amanuensis work?
- You fill in the names of your class and the member variables,
possibly by cut/paste or my just pasting the entire code of your
class onto the Applet.
- You fill in the attributes of the fields by clicking
selecting listboxes, or fill in the blanks.
- You fill in a 0, 1 or 2 lines of JavaDoc for each field.
- You click the CLASS button, then it generates code
like this for you:
class Entry
{
/**
* allocation code.
*/
protected int alloc;
/**
* general ledger account number.
*/
protected int gl;
/**
* up to five character code for the account.
*/
protected String shortName;
/**
* up to 30 character description for the account.
*/
protected String longName;
/**
* what kind of receipt do you get, T=income tax B=business N=none
*/
protected char receipt;
}
// end class Entry
- Beside each member variable is a checkbox. You can turn it on
or off.
- You then click the CONSTRUCTOR button and it generates
the code for a constructor using those variables as input:
- You can then change the checkboxes and punch out the variant
constructors like cookie cutters. e.g.
/**
* Entry constructor.
*
* @param alloc allocation code.
*
* @param gl general ledger account number.
*
* @param shortName up to five character code for the account.
*
* @param longName up to 30 character description for the account.
*
* @param what kind of receipt do you get,
* T=income tax B=business N=none.
*/
Entry(int
alloc, int gl, String shortName, String longName, char
receipt)
{
this.alloc = alloc;
this.gl = gl;
this.shortName = shortName;
this.longName = longName;
this.receipt = receipt;
}
// end Entry constructor
- You then click the GET/SET button and it generates the
code for all the dummy get set JavaBean property methods for each
of the variables. You may throw some of these away later, or
polish them up by changing the public scope, or adding safety
checks.
- The amanuensis might save this template away, so that you
could come back to it, modify the comments, variable names, add
new variables, etc, and crank out everything again, without
having to edit each piece of it individually.
Defragger
A defragger is a program that rearranges files on hard disk to
speed up access. See defragger in the Java glossary.
You would want the defragger to put files in order by last access
date so that the rarely used files are near the center (end) of
the disk and the most used ones are near the outer edge
(beginning). You want each file in one contiguous lump. Ideally,
you want to move frequently used system files like the
directories, free space tables, and system swap file as well.
Within the i-node/MFT table you want to internally defrag as
well, putting the directories together, and ordering by last
access date, within that by directory. You can seek to optimise
either file open time or find-file if you don't have indexes to
rapidly search.
A defragger may also do other tidying such as consolidating free
space fragments, removing deleted dummy directory entries,
sorting directories, and collecting files from the same directory
together when their access timestamps are on the same day.
There may also be quick variants of the defragger that just seek
to quickly clean up the files with the most fragmentation.
The defragger comes in four pieces:
- The algorithm for how you shuffle the pieces of the files
around. It is a bit like a game of sliding squares. You can't
move a chunk into its final resting place if there is some other
file in that slot at present.
- The display that entertains and hypnotises the user as you
work. You display a coloured map of the disk showing its state
and what you are doing. It would display groups of 256 pixels
each representing a cluster, arranged in 16x16 squares. For small
disks each cluster may be represented by more than one pixel. You
might experiment with other groupings of pixels so that a
contiguous file showed as a long thin line, or a blob.
- The platform-dependent Java code for accessing and changing
the low-level disk structure.
- The platform-dependent native code for accessing and changing
the low-level disk structure.
Most operating systems have some files you may not move. You have
to simply work around them.
Because undebugged defraggers trash hard disks, you will want to
write a defragger simulator that is platform independent. You may
want to write platform dependent data-gatherers, and ask your
friends to send you copies of their messy disk layouts - not the
data, just the current file positions and fragmentation so you
can test your program on real-world situations.
In later debugging stages, you will want a large hard disk with a
free partition you can trash and recreate with a partition
copier.
The simulator lets various people try coding the algorithm safely
without trashing any disks or getting involved in the low level
details of defragging. You can then plug in the most efficient
defragger.
Your simulator's scorekeeper should reward doing i/o in large
contiguous lumps. It should reward keeping disk arm motions to a
minimum both in number and length of leap. It should, as best as
possible, simulate true elapsed time.
How do you defrag when there are other tasks using the files as
you work?
- You can use the operating system defrag interface such as the
inept one provided in NT. The problem with NT is that it lets you
move only one file at a time, though you can move a run of
clusters from that file at a time so long as they are all being
moved into a contiguous chunk of free space. You can read about
how the NT official defrag interface works at Systems
Internals. Moving small chunks at a time this way moving the
heads back and forth from source to destination is very, very
slow. It is quite incompatible with the algorithms I have
described for fast defragging. Further, you can't afford to do a
thorough defrag with such a limitation.
Ideally, you should be able to specify a long list of moves to
make, and NT would do them in one or more sweeps across the disk,
handling perhaps dozens of files per sweep, depending on how much
RAM it has available. The only restriction is that all the target
clusters need to be free before you submit the list. You would
not be allowed to do a list of moves that requires an internal
ordering of the moves to free up space before the rest of the
list of moves can complete. When you need to do that, you must
break the list in two and submit them separately.
If you watch Raxco PerfectDisk defragger's display as it uses the
NT interface, it can drive you nuts. Nothing appears to be
happening for 5 minutes at a stretch. It behaves like some bored
barber endlessly picking away at your hair, cutting individual
strands, shortening by never more than a millimeter per snip.
Then, all of a sudden, there is a noticeable little burst of
activity as its moves a large file, which can be done in
reasonably efficient chunks. Then it goes back to its nitting.
- You can run under some other OS on some other partition so
that the partition undergoing defragging is not active. e.g. you
might run your defragger under a tiny JavaOS partition to defrag
your Win9x FAT, OS/2 HPFS, NT NTFS and Linux ext-2 partitions in
sequence.
- You do your defragging by copying from one partition to
another empty one, perhaps on a spare hard disk.
- You run your program very early in the boot process before
any other disk-using tasks have started. This is how Partition
Magic deals with a similar problem.
What sort of algorithm should you use? Here is a suggestion. Sort
all your files so that you have a table of every cluster on disk
and where its final destination should be. This table can be
compressed where there are contiguous clusters within a file. You
need plenty of RAM and virtual RAM.
Imagine the disk arms are like a bus picking up passengers and
dropping them off as it sweeps back and forth across the disk.
The clusters being moved are like passengers. Here are some rules
of thumb.
- Don't pick up any more passengers if the bus is full.
- Try to keep the bus full.
- Don't pick up a passenger unless the passenger's destination
slot is currently free.
- Don't pick up a passenger unless the destination slot is in
the direction you are currently headed.
- Sweep the disk smoothly like an elevator. Don't hop all over
picking up and dropping off passengers the way most defraggers do
in their passion to defrag the disk methodically starting at
track 0.
- Prefer to pick up or drop off a line of passengers contiguous
on disk.
- Prefer to pick up clusters belonging to badly fragmented
files.
- Prefer to pick up clusters that belong to recently used
files.
- Prefer to pick up passengers that are far from their
destinations.
- If there are no passengers you can move directly to their
destinations, free up some space by moving a busload of
passengers to any free slots just past the end of where all the
files will eventually be moved. Don't move them all the way to
the end of the disk the way Norton's defragger does.
The conventional wisdom is that defraggers must be crashproof. If
the power fails at any time during the defrag process, you must
lose no data. Perfection is impossible, since you can always fail
in the middle of a write to the directory or the free space
table. However, you can aim for recovery if the failure occurs
between writes. You must move files before you marked them moved
in the directory and free space tables. For a moment they exist
in two places. Only after the move is recorded in the directory
may you reuse the old space. The problem is crashproofing makes
the defrag take longer. NT takes this to ridiculous extremes. It
would take days to properly defrag a disk using the official
defrag one cluster at a time interface. Ironically this extra
time exposes you to more potential for power failure.
You might try simulating the algorithms used in the commercial
defraggers such as Norton
SpeedDisk for Windows9X, and NT, Executive Software
Diskeeper, Raxco PerfectDisk, or
PowerQuest Partition Magic. Then
if your algorithms are substantially faster, you might be able to
persuade them to reissue their products using your algorithms.
Because of the way it is written, your defragger might have more
general application, e.g. defragging the memory heap in a JVM to
optimise virtual RAM swapping. You might use it inside
proprietary firmware inside a hard disk to handle
defragging/marthaing that was transparent to the operating
system, thus creating a seemingly super-fast drive. It would
appear to have almost 0 seek time. It might be implemented
internally as two separate drives that spend their lives
mirroring each other while simultaneously defragging to the other
drive. By having two sets of heads, either on the same set of
surfaces on separate sets, you would get away with much less head
motion. The software can then be as messy as it likes. The
hardware will clean up after it. In fact, the irony will be,
defragging with software will cause the system to temporarily
slow down!
Deleter, Deletes
Files with Wildcards
Programs like 4DOS and 4NT are painfully slow if you want to
sweep your hard disk to remove junk files, e.g. temp?.*, temp.*,
*.tmp, C:\temp\*.*. We need a program that will take a set of
wildcards (positive and negative), and in one pass over the
directories delete the described files. With the /S
switch you request recursion over subdirectories. You should also
be able to specify multiple directories with wildcards (positive
and negative).
Delta
Creator
Let us say you send somebody a file, then you change the original
file. Instead of sending the whole new file, you can send just
the changes. How do you detect the changes?
I suggest inventing a generic way to do this that works with
files other than just text files. Current schemes are not very
bright and end up creating delta files almost as big as the new
file. They are not good at noticing that text has simply been
reordered.
Here's how you might tackle it. For each file type you write a
chunker. It divides the file into chunks. For a text file,
it may divide at each newline character. For an MSWord file it
may divide at each paragraph. For a FORTH BLK file, it might
divide at each 1024 bytes. For a Btree file it chunks out each
record...
Then you compute a hash on each chunk of the original file and
index it in a Hashtable along with a reference offset and length
into the old file. You don't put the text itself in the
Hashtable.
To compose the delta file, you look at each chunk in the new
file. You compute its hash, and look it up in your Hashtable. If
you find a match, you make a note that the chunk can be had from
offset/length of the old file. If you don't find a match, you
have to insert the chunk from the new file into your delta file.
Very often you will get a sequential run of matching chunks from
the old file. In your notes, you can collapse these entries into
one big offset/length reference.
Creating delta files of *.exe and *.com files requires some
special handling. Experiment by adding a single instruction to a
program and notice how that tiny change ripples and cause changes
throughout the rest of the file. The idea is to send only the
fact that one extra instruction was added, and somehow
generate all the rippling side effects instead of
transmitting them.
Once you have these low level tools in place, set up a scheme so
that customers are automatically kept up to date from a master
copy, using the Internet. The scheme should recover even if
customer should accidentally delete files, corrupt files, or
restore out of date files.
Dissolve
Designer
I saw a special effect for the first time on TV in a bank
commercial. It was a dissolve from one scene to another. The
image first dissolved into coloured tiny tiles (each the color of
the average of the pixels in the image in that tile.). The tiles
independently scooted about to a new position where they formed a
new image. During this time the tiles smoothly change colour, and
fatness of the diamond shape to morph into the new image. The
transition was a sort of two-bladed cyclone pattern.
I thought a dissolve designer might be a fun toy for a newbie to write.
The program may be slower than TV, but it does not have to work in real
time to be displayed full speed on TV.
The end user could control various parameters of the dissolve
(e.g. tile size, intermediate colours, trajectories) and see the
results on pairs of images he supplies. A fancier version would
let the end user write simple functions to be included. You might
parse these or compile them and load them via Class.forName, the
way the LearnToCount
program does.
Domain
Registry
Domain registry exchange where all domain names for sale are
listed. You can search for suitable names by many different
criteria. The project supports itself by banner ads, broker
subscriptions (ad-free), and possibly a small listing fee for a
prime real-estate location. To be successful you will need to be
perceived as the place to go where all names are listed, whether
the owners pay a listing fee or not.
You need a way for owners to register their names online. You can
get the owner details from the InterNIC database. You should send
the owner an automated email asking permission to list. That way
you can foil pranksters putting up other people's domain names
for sale.
Part of this database would be who is in ultimately in charge of
assigning names for each TLD (Top Level Domain) and
what the yearly fees are, and whether the server must be located in the
country of registry. Check with iana to begin your
researches.
Another part of the database would be brokers. You track which
TLDs they handle, and what their fees are, and where on the
planet they have their offices. After you get a fair sized
database, you could start charging for a listing, or for a prime
real estate listing. Check with RIPE and
CoreNic to start
your researches collecting brokers.
See the essay on domain names for
further background.
Domain
Search Engine
This is a database designed solely to help you find a website. By
forcing all hits to point only to a home page, you prune out a
huge amount of material, making it faster to find a website.
A variant of this is a product database. All hits in it point to
a single definitive page on the vendor's website.
Dynamic Version Control
Current version control systems stink big time. PVCS drives you
mad because the modules you want are always locked by someone
else. CVS drives you mad because other people's changes
constantly stop you from getting a clean compile. You have the
problem of manual or automated merges of changes done based on
superficial similarities in bits of text. The process is
hideously error prone. Talking about the last stage of the
project has branded me a lunatic. Here is my vision.
We will have programming teams scattered over the planet. They
will simultaneously be working on the same code with a 24 -hour
Internet high speed ADSL or better connection. As you make each
change to the source code, that change is broadcast and reflected
to anyone who is simultaneously looking at that part of the
program.
Changes are made atomically by syntactic element, not character
by character. They are transactions to your database. The
database always reflects a syntactically correct program,
including correctly spelled variable names. It might not work,
but it is syntactically correct at all times.
I now have to wave my hands a bit. Each programmer works
logically on a transparent sheet of film, overlaid over the
program called a job sheet (or job for short). He puts his
changes on that film. Other programmers work on their own layers
of film. Each layer of film represents a job -- a unit of work to
make some complete change. A programmer may have several layers
on the go at once, and several programmers may work on the same
layer.
You can choose to hide or apply other programmer's layers to your
work. You reveal them to see how upcoming changes will interfere
with your work. You can hide them if the new layers don't quite
work yet and you need to test your code.
There is a process where a layer is proven to work, and gets
incorporated into the base layer. It can be backed out, but
basically once in is incorporated, you have to deal with any
changes it has created.
Highlighting helps you see areas where your changes are
conflicting with somebody else's both before and after that
person's changes are committed. Since both of you can see the
conflict arising, there is a good chance the two of you will have
a conversation about how to create easily mergeable code.
Because the code is in a database, and every syntactic element in
the program has a unique ID, there is less guesswork about what
will happen when changes are merged. Further, both of you
see the merged changes on your screen as you work. You can
flip back and forth removing and adding other layers from your
view without changing any code, just as you can remove or add
layers for your compiles.
Compilation takes little time since the database is already
pre-parsed. Even so, you may have some background speculative
compilation process on the server, working on whatever piece
seems most stable, so that when you are ready to run to debug,
most of the compilation will already be complete. When the server
has nothing better to do, it can do various optimisations and
native code generation, just so it will be ready when you ask for
them.
How could such a thing be implemented? I imagine it might work
like this: You have a central repository of code that lives
pretty much entirely in RAM. It may have some ability to store
some of the lesser used information on disk, and to lazily back
it up to disk. However, persistent store classes let the designer
of the SCID pretty well pretend the entire database is RAM
resident.
You have a parse tree. You submit transactions to a common
change server queue to request changes to the parse tree. The
parse tree is never in an inconsistent state. It is impossible
for it to represent an invalid program. Transactions are checked
before they are applied atomically. So you don't change the parse
tree keystroke by keystroke, but node by node. You can
reconstruct the parse tree by replaying various streams of
transactions. Every transaction is timestamped, and stamped with
author, job layer and reason for the change. A global rename, for example,
could be done with a single transaction. It would be trivial
since there would actually be only one copy of the name stored in
the database. All other references would be numeric or by
address.
When you have several remote sites, you can replicate the entire
database at each site. You might have a centralised scheme or a
peer-to-peer scheme. Centralised is better if the communication
lines are very high speed and the server is very powerful
compared with the workstations. In a centralised scheme, for
speed, the server broadcasts copies of both the source code
database and the compiled results to everyone so they have local
copies for high speed access. Basically transactions are
broadcast and all copies of the database update themselves.
If a database is corrupted, it is automatically regenerated from
one the remaining copies.
Transactions are serialised by
timestamp, shared, and processed in standard order between all
sites to keep all sites up to date. There has to be some
mechanism for provisionally changing the local copy of the parse
tree while waiting for the earlier-dated transactions to come
from the other sites.
Fast Snail
Mail
If you want to send something to Indonesia fast, you are out of
luck. What you could do is send it electronically, have it
printed there or copied to floppy, then delivered by courier to a
non-connected Indonesian company. They could send stuff back by
having it scanned, (possibly OCRed), then sent back. With a
decent scanner you could handle large documents or high
resolution. All transmissions are compressed, digitally signed
and encrypted.
You build an easy to use suite of software to handle electronic
submission, routing, billing. You then encourage people
throughout the world to set up interconnecting stations. It would
be low overhead, and a extra source of income. For the most part
the owner of the site would have to do absolutely nothing except
put printouts in envelopes and apply the pre-printed labels.
Submissions are handled completely electronically.
If the law will allow it, you could also send money by the same
scheme. You might be able to bypass banking laws by buying and
selling some token, such as shares in the company, or computer
services. Banks charge about $50 to wire money to Indonesia, and
they keep it for a couple of weeks before delivering it. They
make you fill in pages of forms, and wait in lines for hours.
This service could handle it for peanuts with no wait time.
Unfortunately with a buy/sell kludge you could end up with huge
fees in sales taxes.
Foreign
Language Trainer
The purpose of this Applet/application is to teach someone to
pronounce words in a foreign language correctly. I will assume
that language is Esperanto, though it would also apply to other
languages. You can start simply and work up.
- Point at a word. The computer pronounces it in Esperanto. All
you need here is a web page with downloadable AU
files.
- Use the JMF classes to record the
student's voice. Play his voice back to him. Play the
instructor's voice back pronouncing the word or phrase
correctly.
- As above, but play the two voices simultaneously, one in each
ear.
- Analyse the voiceprints in various ways, and display the
student and instructor visually side by side.
- Animate motions of the tongue.
- Allow the student to input arbitrary Esperanto sentences. The
computer pronounces them them as best it can by stitching
together recordings of individual words.
- Use generic text to speech algorithms. This allows the
student to hear arbitrary texts written in Esperanto with roughly
the fluency of Steven Hawking. It would also be useful for the
blind to access materials in Esperanto.
- Other language skill drills not directly linked to
pronunciation, such as computer asks a question in English and
the student must answer in Esperanto or vice versa, or
Esperanto-Esperanto. See the Learn To Count Applet for
classes to give Esperanto number drill.
- Internet tie in to live human instructor who monitors perhaps
40 students simultaneously. His interactions are not
necessarily in real time the way they are in a conventional
language lab. Students can submit verbal and written
questions.
- IBM's Alphaworks
have a speech recognition package that works with Java. Perhaps
it could be hooked up to recognize how well a student pronounces
a given word. If not, perhaps its authors might consider exposing
a bit more of the package's innards to allow it to be used this
way.
- Think about how you might use this techology that you have
developed to help people with sensory or motor disabilities,
including employment opportunities.
Form
Filler
This is designed as an plug-in to a browser such as Netscape. The
plug-in looks at a form, usually one containing name, address
etc. and fills it in reasonably intelligently from a set of
canned fields configured into the browser. It would have
heuristic logic to decide which data goes in which fields. Test
it out at various shopping cart sites and various sites that
don't take email, where they want you to fill in a form to
register a complaint. Even if you get it 80% right, you still
save the user time typing.
Design a standard set of field names. Give your standard a fancy
acronym. Bug websites who don't follow it, and send them a copy,
explaining the advantages of standard field naming, and giving
plenty of examples. Extend for other languages besides English.
Make sure you can handle telephone numbers and addresses outside
the USA.
HTML Glossary Presenter
It takes a glossary and imports it into an SQL database with
crosslinks between entries. You might maintain it as a database
and export HTML periodically for the Java challenged, or you
might maintain it as HTML and periodically import it into the SQL
database.
You use a Java program to browse it. When you are reading an
entry, in the background it tries to guess what you will read
next and reads links, and also the alphabetically previous and
next entries. The Applet just hands the HTML record off to the
browser to render.
If all goes well you should be able to browse almost as fast as
if the entire database were already downloaded in RAM. A simpler
version would simply take "See xxx,yyy" entries in the
glossary and verify/build/rebuild HTML links for them, generating
plausible, unique NAME targets.
GridBag Amanuensis
I have used the Symantec Visual Café 2.5a visual layout
editor to generate source code for GridBag layouts. As you work,
the display adjusts immediately to show you the effective results
of the new constraints. It is quite a bit better than coding by
hand and doing compile test cycles, but it still leaves a fair
bit to be desired. Here are some additional features I would
like, besides the ability to fill in GridBag constraints via
fill-in-the-blanks.
- Optionally show a pale green grid to delimit the cells. Labels
them around the perimeter.
- Allow you to select a group of elements and delete, or shift
them as a group in steps.
- Allow you to insert a empty row or column.
- Allow you to tidy the table with a single command to remove
all empty rows and columns.
- Allow you to select a group of elements, and simultaneously
set some attribute on all of them. The top left element acts as a
proxy for the others. So, for example, moving the top left
element to a new position would move all the other selected cells
to the corresponding position, not to the same cell.
- It should never let you accidentally put two elements in the
same grid cell. However, if you decided to have it allow you to
temporarily have collisions, both of the cell contents should be
visible and the conflict should be highlighted. It should be easy
to select either of the two cell's inhabitants.
-
- It should let you swap two selections.
- Allow you to change the type of a cell without losing all the
other attributes, e.g. convert a label to a TextField or vice
versa.
- Allow you to rename the fields.
- Allow you to use named Color static final constants or
complex expressions. It must not insist on selecting an absolute
Color number.
- Though it may work by saving pickled objects or data tables
rather than generating source, it should also be able to import
GridBag Java source code that it did not produce. It should leave
undisturbed any embedded code it does not understand.
- It should attempt to optimise the resulting code, avoiding
duplicate Font, Insets and Color objects.
Hermit Crab
Variable Length Record Files
Hermit crab files allow you to efficiently manage variable length
records that can grow and shrink. You can access records by an
integer relative record number. Hermit crab files do not provide
access by key, though you might build an in-RAM index or
Hashtable by key fields that gives you the relative record
number. They are called hermit crab files because as records
grow, they abandon space "shells", which are
then recycled for use by smaller records.
For details on how they work see this essay.
HTML Table Reorganiser
This lets you extract an HTML table and reorganise it, e.g.
remove a column, transpose two column, add an empty column,
permute the columns. Basically you hand it a tuple:
1,9,3,8,0,0,2
Which means construct me a new table composed of column 1 of the
original, then 9, then 3 then a dummy, then another dummy, then
column 2. Ignore the rest of the table.
The other thing it can do is transpose, convert the rows to
columns and the columns to rows.
HTML
Table Sorter
This lets you extracts an HTML table and sort the rows based on
the specified columns. e.g.
1,4
means sort all the rows using column 1 as the primary key and
column 4 as the secondary tie-breaker key.
After you have that working, tackle the problem of converting a
long skinny single-row or a single-column table column into a
grid, either by rows or by columns. Then convert it back again to
either a single row or single column. You might even also have
logic to count the elements, and do a square root to determine
the grid size to make a square grid.
Then consider the problem of merging two tables, and selecting
columns or rows from the combined table. For example you might
want to take two single column tables and merge them to create a
single table with two columns. You might want to join the tables
by matching a common key using an inner or outer join. Sounds
like SQL doesn't it? Consider implementing the whole reorg
utility by importing tables into SQL, doing all your fancy stuff
in SQL, then exporting tables. You need to have a way of
stripping tags such as <b> so that SQL can sort columns.
However you need to keep the tags around for when you export HTML
tables again.
HTML
Tidier
The tidier makes it easier to perform edits on the HTML source
with a text editor. It might do thing such as:
- Put 0, 1 or 2 newline characters before and after each tag as
appropriate. If there are newline characters or whitespace
characters that should not appear before or after a tag, remove
them to tighten up. You will probably need to make this table
driven since not everyone will agree which tags deserve a new
line and blanks lines before or aft. Some tags such as
<font> must be left alone other than possibly
consolidating whitespace inside, before or aft, into a single
newline or space. If you examine the HTML source of this document
you are reading, you can see it has been tidied with a crude
search/replace script for consistent newline behaviour.
By default you might make sure every <P>, <BR>,
<Hx>, <TR>, <TD>, <DT>, <DD>,
<OL>, <UL> and <LI> tag starts on a new line
with a blank line ahead of it.
- Remove excess blanks lines. No more than one blank line in a
row.
- Use consistent all upper or all lower case tags, or perhaps
upper case for outside BODY and lower case for inside BODY.
- Remove lead trail blanks on <LI>...</LI>
<DT>...</DT>.
- Avoid lines longer than 60 characters. Break lines at
auspicious points -- but never in the middle of a quoted string.
Avoid reflowing <PRE>.
- Sort meta tags in alphabetical order
- Put parameters of IMG in some standard order, probably not
alphabetical, e.g. SRC then WIDTH, HEIGHT, BORDER then ALT. Ditto
for other tags with parameters such as FONT, DT, TABLE etc.
- Generate a new KEYWORDS metatag based on scooping the phrases
in H1, H2 and DT tags (avoiding FONT tags of course). Combine
it with the existing KEYWORDS metatag, removing duplicates and
putting the phrases in alphabetical order.
- Indent to show structure.
- insert missing </LI>, </DT>, </td>,
</P> etc. tags. Exactly which ones you want should be
configurable. So you may also remove unwanted clutter end
tags.
- replace "with ", & with &, < with
< etc. where they occur in the context of ordinary text.
- Convert css tags to old style on request. That way you can
have the convenience of composing with css, but broadcast
compatible old style tags. Don't get too carried away, but you
might add some user macros too.
- consolidate tags, e.g.
remove <b></b>
<b>both</b>
<b>bold</b> => <b>both bold</b>.
Similarly for:
<font size="+1"><font
color="red">...</font></font>
=> <font size="+1"
color="red">...</font>.
- Convert colour names to hex or vice versa. See the chart of Netscape colour names.
e.g. convert <FONT
COLOR="papayawhip"> to <FONT
COLOR="#ffefd5">.
- A parser such as JavaCC or ANTLR might be useful. See parser
in the Java glossary.
- Provide an option to convert HTML to plain text by stripping
out all the tags. There are a few wrinkles to watch out for.
- > may occur improperly unpaired with <. In which case
treat it as >.
- < and > may appear inside quotes inside <..> in
which case they should be treated as ordinary characters.
- References should optionally be converted to English so they
are not lost. e.g.
You can read up on the <a
href="http://www.cwi.nl/~dik/english/codes/isbn.html">
rules</a> for calculating the expected check
digit.
could be rendered as:
You can read up on the rules [see
http://www.cwi.nl/~dik/english/codes/isbn.html] for calculating
the expected check digit.
If you wanted to get really clever you would avoid mentioning the
reference if it were already present nearby in plainttext. e.g.
You can read up on the <a
href="http://www.cwi.nl/~dik/english/codes/isbn.html">
rules</a> for calculating the expected check digit at
www.cwi.nl/~dik/english/code/isbn.html.
should be rendered as:
You can read up on the rules for
calculating the expected check digit at
www.cwi.nl/~dik/english/code/isbn.html.
- You might also optionally render <b>Wow!</b> as *Wow!*.
Ideal lover
database search
Registry to help you find your ideal lover using principles from
Hendrix, Keyes and Vialogos. You fill out a very detailed
HTML questionnaire and get a list of the best matches. It is an
exercise in inner and outer honesty. Scanned photos too. I have
already been doing this manually for gay people.
Infinite
Disk, integrated file migration/backup
You write a program that backs up rarely used files over the
Internet, using an ADSL connection to a server. Whenever space
gets low, it frees disk space of files that have been backed up.
When the user goes to open those files, there is a pause while
they are retrieved from the server.
This scheme also acts as backup. All files can be backed up. To
speed backup, before they are sent to the server they can be super compressed, and only
changes sent via deltas.
The backup server can conserve disk space by noticing that,
for example, two customers both have identical copies of MS Word
For Windows DLLs installed. The server only needs to keep one
copy of each DLL. It has to be careful. Customers may have
identically named files that are not identical.
This is very old idea. Univac 1106 mainframes with far less than
a megabyte of RAM used to integrate file migration with backup to
tape.
INI File
Tidier
Windows 9x/NT still use INI files in addition to the registry. A
*.INI file has [sections], parameters and comments:
[USER PREFS]
HOME URL=D:\mydir\index.html
; this is a comment on a line by
itself
MAX WINDOW HISTORY LINES=100 ; this is a tail end comment
The task of your program is to sort all the sections in
alphabetical order, and within each section sort all the
parameters in order. Any sections without any entries are
deleted. There may be duplicate sections in the orignal. Merge
them. When there are duplicate parameters in a section, take the
first one and ignore the rest. Delete any parameters with null
values. For readability, put a blank line between sections.
The only thing tricky about this is handling comments. Does a
comment belong with the parameter/section above or below? If
there is a blank line separating it from one but not the other,
the answer in obvious. If the comment is the first or last line
of the file, again the answer is obvious. If you can't resolve
what to do with the comment you can handle the ambiguity in one
of four configurable ways:
- Get the user to resolve the ambiguity by inserting a blank
line, then rerunning your program.
- Assume it belongs to the line above.
- Assume it belongs to the line below.
- Assume it belongs to both, and duplicate the comment.
By configurable, I mean the user makes this choice ahead of time.
You don't ask the user on every comment.
You could do the sorting and merging of duplicates in at least
three ways:
- Create one giant array with cleverly constructed keys and
sort it. Then merge.
- Create an array of arrays with variable length rows. Sort the
rows then sort each column. Then merge. You may need two or more
passes through the data to find out how big to make the rows and
columns.
- Create a tree -- binary, BTree or n-ary. As you build the
tree, keep it in order and merge duplicates.
Since INI files are not that large, you can use totally
RAM-resident techniques, and you don't have to worry too much
about the efficiency of your algorithm in terms of order(n).
Once you have that working, you can then create variants on a
theme. Merge a small ini file of changes into a big one. Setting
a parameter to nothing deletes the parameter. A section with no
entries is deleted.
Compare two messy ini files and display the essential differences
by creating a tidied, comment-less version of both and comparing
the parse trees in RAM.
Get your friends to donate a variety of real-life messy INI
files, the dirtier the better, for you to test your program on.
Once you have that working, create this variant. You create a
template example ini file that contains all the currently legal
sections and fields. You use that model to clean out a messy INI
file that is full of junk fields left over from previous versions
of the program.
See Sort in the Java glossary.
Java File
System
One problem with Java is it lacks a portable file system. Because
every file system is different on every platform, Java implements
the lowest common denominator.
What you do is allocate a giant file say 2 gigabytes. It looks
like one file to the host OS, but to Java it will be a file
system containing many files.
If you do your work well, eventually it may be the only
file system of some future OS.
You invent methods to allocate space inside this file for
subfiles. You want to create all the ordinary file services:
- Lookup file by fully qualified name.
- Lookup file by high order part of name.
- Lookup file by low order part of name.
- Lookup files by last access or modify date. This allows a
program like AltaVista Discovery to rapidly discover the files
that have changed and need reindexing. Recently deleted files
need to be logged too so that indexers can drop them from the
indexes. Perhaps privileged application programs should be
allowed to hook themselves in and monitor all file status changes
system wide.
- Enumerate files that match some filter composed of a number
of positive and negative wildcards, or user-written.
- Lookup file by handle.
- read/write sequential and random.
- caching, and delayed write (lazy write) caching.
- associations -- by extension is a bit dangerous. A file
should be permitted to have several associations.
- Which app installed/created this file? Is it user data, or
part of an app package? URL where I can get an update of it if it
is corrupt.
- Security -- owners, read-only, hidden, system, file
signatures/keys/checksums.
- Dates -- last access, last update.
- locking.
- defragging. See the notes on the defragging project. The whole key to a fast
file system is to organise your disk layout to minimise arm
movement, i.e. to keep it defragged as you go.
- Marthaing -- keeping the most active files in the prime real
estate near the outer edge (near the beginning of the disk) near
the swap file and the directory, and the least active files, and
the bulk of the disk space in the inner tracks (near the end of
the disk). This could be done on per cylinder basis, or per file
or per cluster.
This file placement might presume a weekly reorg, or even better
an ongoing low priority background reorg.
When you start using a file, it gets moved by a background
process to prime real estate near the MFT, swapfile and
directories. Background processes move files that have been not
used in a while to less desirable real estate further away,
freeing up space, but not too much free space in the desirable
real estate territory. The bulk of your free space should be out
in the least desirable territory. Roughly speaking files should
be contiguous, and in order by last access date, and within that
by directory with the most recently accessed files in the more
prime locations.
The reorg process might be more integrated with ordinary I/O, so
that whenever you write, that cluster gets moved to prime real
estate as a side effect. You might do all your writes
sequentially, interleaving writes from various files. If you had
a two headed disk, you could dedicate one head to writes and one
to reads. The write head would not move much. You write in a
sequential circular squirrel cage pattern to a hot zone on disk.
A background process moves the oldest data out of the hot zone so
you don't wrap around and overwrite. You can think of it as a
second level of disk caching. You have a primary cache in RAM,
and a secondary level in prime disk real estate.
CPUs are getting faster relative to hard disk. RAM for caching
control information is getting cheaper. This means that, in
future, it will pay to use more and more elaborate techniques to
minimise disk head motions. You will be able to afford to do a
lot of thinking to avoid a tiny bit of needless movement.
This leaves the problem of a giant file that you update only
rarely. Perhaps the positioning algorithm needs to take into
account both frequency of use and last access date. Perhaps you
need decaying hit counter. You multiply previous hits by 0.9 and
add today's hits to determine priority for prime real estate.
(The actual decay algorithm would be more sophisticated.) You
also would also factor in size. A 1K file with 10 hits is much
more deserving of prime real estate that an 100 MB file with 10
hits.
- Temporary files -- clean up of orphaned temporary files.
Tracking who orphaned them.
- Getting more advanced, you might want an API for persistent
objects. Traditional file systems are hierarchical. You need
something so you can find stuff in a million different ways.
- Your filesystem would probably be designed with the idea of
massive amounts of RAM being available, and would not worry quite
so much about getting everything on rotating media immediately.
You would only have to deal with graceful shutdown, perhaps with
10 minutes notice from your UPS. This would give you a big speed
boost over traditional file and database systems that are overly
paranoid.
You can start very simply. without a directory, requiring users
to allocate files by absolute block numbers, and just ensuring
you don't allocate the same block twice. You have been bitching
that Microsoft and IBM could not write a file system to save
their lives -- here's you chance to see if you can write a faster
one. Once you prove your structures, maybe the MS, IBM or Linux
folk will consider supporting your new partition type.
Java Powered TV Commercials
Does it drive you nuts when they play the exact same TV
commercial over and over. In future Java powered TV commercials
will slightly vary the commercial each time it is played. This
has several advantages:
- The commercial is less boring.
- Over time the commercial can present a wide range of
information.
- The commercial costs no more than a regular one. It does not
require any special effort on the part of the broadcaster to
modify it.
- The commercial can adapt itself to the audience demographics
or time of day.
The commercial contains MPEG video, audio snippets, and text
overlays that are combined into a full length commercial each
time it is broadcast.
Your job as students is to create a prototype commercial, perhaps
that has RealPlayer 5-frame a second video, or just stills, that
demonstrates the concept, and can be used for more interesting
Web page ads. The differences introduced need not be great. They
may just be the top three takes of any one scene instead of the
top one. They may just be difference in order, background music,
voiceover, or titles with the same or almost the same video.
See random numbers in the Java glossary.
Java
Source Code Beautifier
This tides Java source code to the Sun specification.
It need not be a generic totally-configurable program. Just go
for the Sun standard. The one possible exception would be allow
you to turn off K&R brace formatting and use the traditional
format where braces align. A parser such as JavaCC or ANTLR might
be useful. See parser in the Java glossary.
You might do some additional tidyings:
- Put attributes in a standard order, perhaps configurable.
e.g. always order them private static final int.
- Java has missing keywords. There is no unstatic when
you want to explicitly mark a method or variable as
instance. I suggest optionally inserting comment of the
form /* instance */ to mark instance variables.
- Similarly there is no way to explicitly mark a method as not
private, not public, not protected. I suggesting adding the
comment /* default */. We need a better name! Perhaps
package, but that is already a keyword. Perhaps
family or friend. Unfortunately friend
has inaccurate connotations from C++.
- There is no explicit unfinal keyword to mark each
variable. I suggest introducing a variable declaration with a
/* var */ comment. This would make it easy to scan for
variable declarations with a text editor. Java's syntax tends to
camouflage them.
- There is no easy way to scan in a text editor for the
beginning of each method. There is no distinctive keyword to mark
the beginning of each method the way there is in other languages.
Insert a special /* method */ comment at the start of
each method.
- Collect together declarations and methods in the same order
that JavaDOC would put them, with the publics at the top, etc,
and optionally alphabetically within categories or sorted by
type. It must be possible to turn this sorting feature off.
- Consider marking every variable declaration with a lead in
/* dcl */ comment, to make it easy to skip from
declaration to declaration with a text editor.
- Of course, you need a way to selectively remove all these
extra comments as well.
The problem is the added comments are bulky. A better solution is
a SCID that lets you hop to the next whatever you want to see,
and visually highlight important features of the language without
using a lot of precious screen real estate.
You might tackle just a small subset of the problem, e.g.
converting braces between same-line K&E style and
line-by-themselves style. After you succeed, you might add
another feauture, then another.
Java
Source Code Presenter
This converts Java source into HTML for public presentation, much
like Jonathan Revusky's JavaToHTML. It would colour various
syntatic elements. It would use bold on a variable in the place
is it defined. It would use variable sized () and {} depending on
how deeply nested they are. You can see some examples of the sort
of output I'm after by looking at the sample Java source code in
various parts of the Java glossary such as the HSB or properties
entry. It might use special markings to delimit methods and
classes. It would render the HTML in JavaDOC comments (e.g
<b>length</b> < 100) in human form (e.g.
length < 100). For importing ordinary Java source code,
a parser such as JavaCC or ANTLR might be useful. See parser in
the Java glossary.
The tricky part is figuring out how to do decent variable
left indenting in HTML. I know of three techniques:
- Use a dummy gif and stretch its width. This is slow, verbose,
and does not look all that good in all browsers. Remember to use
border="0" to avoid a line around the dummy gif.The
advantage is you can also control vertical spacing by stretching
the gif height.
- Use <ul> to indent and </ul> to unindent. This has the
unwanted side effect of extra blank lines.
- Use <td colspan="n"></td> <td
colspan="max-n">real stuff</td>. You can't
control the indentation column widths with <TABLE
CELLSPACING="30", because it affects vertical spacing
as well. You hide the tableness of the whole thing by using
<TABLE BORDER="0". This technique has the added
advantage of properly indenting wrapped long lines that don't
quite fit on the screen in some browsers. See the HTML used to
create the waterfall entry in the Java glossary.
Java Source Code
SCID-style browser/editor
SCID means Source Code In Database.
This is an enormous project, but you could start small. The basic
idea is your pre-parse your code and put it in a database. The
problem is programs are getting huger and huger. We need tools to
help you temporarily ignore most of them so you can concentrate
on your immediate needs. We need tools to rapidly navigate
programs. We need tools to help you get an mental forest picture
before delving into the tree detail.
I have been talking up the SCID idea since the early 70s. Mostly
people have just hooted with derisive laughter. However,
SCID-think is gradually catching on. The RADs, such an Visual
Café and Inprise Jbuilder, let you write code to control
the properties of widgets on the screen by right clicking on
visual elements to view the associated properties. You can tick
off entries in pop-up listboxes and checkboxes or fill in the
blanks. This is an important step away from thinking of programs
strictly as linear streams of ASCII characters. Java Studio lets
you view and write Java code by playing plumber -- visually
connecting JavaBeans.
I think it is a case that the shoemaker's children have no shoes.
Programmers in creating source code in linear text files do the
equivalent of keeping their accounting books using a CPM Wordstar
text editor. We would never dream of handing a customer such
error prone tools for manipulating such complicated cross-linked
data as source code. If a customer had such data, we would offer
a GUI-based data entry system with all sorts of point and click
features, extreme data validation, and ability to reuse that
data, view it in many ways, and search it by any key.
Once you have your program pre-parsed, you can display the
program in a variety of ways. Here are just a few
examples:
- Make the beginnings of methods more visually obvious. The
plain Java syntax tends to camouflage them. Ditto for temporary
variable declarations.
- Hide comments.
- Show just loop structure.
- Show just code involved with class X.
- Hide all code that involves calls to the java.awt package
- Highlight all code involving floating point.
- Highlight all code that does bit shifting or division.
- Highlight all uses of the sin method, but only in class
Moses, not Math.
- Show me a switch statement as if it had been handled
with a set of subclasses.
- Show me a set of subclasses as if the code had been handled
by a giant switch. This lets me compare the code in all
subclasses. Similarly let me compare just two subclasses.
- Show me the logic as if it had been written with a decision
table.
- Show me the definition of this variable or method just by
clicking it.
- Tell me which classes and methods call this method and how
many times. This XREF is always up to date because the source is
in a database. It does not need to be scanned periodically to
create a fresh XREF.
- Tell me which classes and methods look at this variable and
how many times. Similarly, tell me which ones change it.
- Collapse/expand level of detail. e.g. collapse detail of CASE
bodies, LOOP bodies, IF/ELSE bodies, parameter details leaving
just the names of the methods being called, collapse purely
arithmetic assignments.
- By using specially coded comments you can hide/reveal various
classes of them. You can hide code and just read comments, or
perhaps just overview comments, just the comments explaining what
the various classes are for etc.
- Show me the names of the parameters next to the parameters
themselves in each method invocation so I can proof-read it.
- Show me the JavaDOC comment for that parameter.
- Show me the program with the Spanish strings inserted. Show
it to me with the Spanish variable names where they are
available, but use English ones where not.
- Show me the program with all needless () levels removed that
our newbie programmer put in. The () are not actually stored in
the database. They are regenerated to suit the individual
programmer preference.
- Show me the program with extra () inserted because I can
never remember the precedence distinctions between && and
<.
- Show me the program with the Whazmotron custom glyph set so
that I can easily pick out if begin end, loop begin end, class
begin end, method begin end.
- What classes are available to me at this point in the
program? What local variables are in scope?
- You should be able to ask, what methods are available at this
point in the program that produce a Zomblat object? What methods
are available that take a Zomblat object as a parameter? What
methods are available that take both a Zomblat and a Color
object?
- What methods are available anywhere that produce a Zomblat
object? What methods are available that take a Zomblat object as
a parameter?
- Show/hide the Eiffelian pre/post assertions. You can fill in
dialog boxes about each parameter, variable or return results.
For ints you may specify the acceptable low-high bounds. For
strings you would specify whether they may be, null, empty
"" and whether they may have lead/trail blanks. You
might specify that they must be all upper case, all numeric, all
lower case, no accented letters etc. For enumerations you would
specify the list of allowable values. For debugging, you can turn
this code on to ensure all the conditions are being met.
- Capture additional information about fields useful for data
entry, such as low-high bounds, blank if zero, left leading zero
fill, commas, lists of legal values, justification, natural
layout parameters, field name or display, prompts, field widths,
validation routines, ... Programs can access this data rather
than specifying it inline in the code. This keeps everything
about the variable in one place where it can be easily accessed
and changed. It also facilitates searching for fields that share
some property and bulk replacing it.
- That was an ambiguous name for that method. Change it
everywhere it is used to this clearer name, but don't change it
where that same name is used in another class. Computer, be
clever, don't pester me to figure it out for you which ones
should be changed.
- Display the program using foreground/background color, font
family, font size, font style (bold, italic), lines, glyphs and
icons to pack as much additional information on the screen as
possible. For example you might be able to tell a stack/temporary
variable, from an instance variable from a static variable from a
constant just by looking the font, or some slight shade of
foreground or background colour difference, e.g. dark brown,
orangey/brown, and light brown foreground. The clues may be
almost subliminal.
Variable pitch fonts are possible without giving up alignment.
They put more on the screen and are more readable than fix pitch
fonts. Exactly how these abilities are used will change
constantly depending on your current task. The idea is to encode
information about symbols in their look.
- You could use the full colour abilities of the modern screen
to give subliminal clues, e.g. by automatically assigning a
portion of the spectrum to each package/class using a pastel
shades as the backgrounds to any references to methods or
variables of that class. You could bold face the definition of
any identifier to make it stand out.
- Chris Uppal
suggested using colour coding by author. He noted that in every
shop where he had worked, there were programmers he could trust
and ones he did not. If he were attempting to understand or debug
a chunk of code it would help if he knew which stretches could be
trusted to do what they claimed to do, and which stretches
warranted more scrutiny.
- You could encode the age of code by colour. Generally the
newest code is most suspect if there is a problem. Sometimes old
code, that was done before some specification change occurred,
needs to be examined and ticked off as compatible with the new
spec. You can use colour to help keep track of which code has
been checked.
- You could ask to globally visit all references to a given
method or variable, and tick them off once each was dealt
with.
- You could do quite a bit of code writing by point and click.
There is no need to type a variable or method name, just select
it from a palette of likely variable or method names. You could
type personal abbreviations for them and have them expanded. You
could view code with your personal names or the official ones.
For example to write a FOR loop there might be boxes you fill in
for the various intializer, terminator, and incrementor
expression. They would default to int i=0,
i<n, and i++. You could give the loop a name
to be displayed when its body is collapsed. You could convert the
loop to one that ran backwards by a single click, or to one that
generated a WHILE or UNTIL loop similarly. If you ticked
enumeration all you would need type is the name of the
name of the Enumeration generator. The rest would be generated
for you, accurately.
- Alternate display with common functions displayed as if they
were operators using special glyphs (here simulated here with
red). For example, instead of seeing:
You might see
Instead of
You might see
- The SCID would act as a Java lint, displaying suspicious or
unusual code in a special colour, and perhaps ask you for
confirmation when you inserted code of the form if (myString
== "abc") or if
(myBool = a & b).
- Show or hide explicit conversions.
- Show get/set method invocations as if they were direct access
to an associated property variable. This simplifies the syntax.
Instead of seeing:
you would see:
- Display using lines or slight shade variations in background
colour to mark the bounds of ifs and loops. Programs would look
more like flow charts, or more like text with highlights, as the
programmer preferred for the current purpose. Vertical striation
watermarks in the background would make it easier to see matching
alignments.
- What I am talking about here is not permanently
highlighting floating point operators and operands, or example,
but just for the next 10 minutes, because floating point is
the thing I am concentrating on at the moment.
The syntax colouring schemes I am familiar with are designed to
be done once and left alone once you have them tweaked the way
you like them.
For a SCID, you need not only ways to change the syntax
highlighting, but to rapidly flip between presets to
enhance the current interest and to suppress the
current irrelevancies. You also need ways to rapidly set
up new interest constellations.
I use the word highlight in a broader sense. With a GUI
and SCID you may use combinations of colour, font, size, glyphs,
background colour, hiding, folding, lines, geometric shapes, bold,
italic etc. etc. to make a certain constellation of currently
interesting features leap out at you. Different interesting
features would use different highlighting techniques to grab your
attention simultaneously.
- Similar to Bell Lab's SeeSoft
generate bird's eye graphical displays of the entire project that
use colour coded pixels to indicate such things as code age or
hot spots where a profiler determines code spends most of its
time executing. You can zoom in on interesting places to see the
actual code.
- The Xerox Parc people have been experimenting with a new way
of organising Java programs called Aspect
Oriented Programming as a way specifying facts in only one
place declaratively rather than by sprinking them redundantly
throughout the procedural code. Doing that makes code much easier
to maintain. Declaratively specifying a huge amount of
information that is traditionally handled procedurally is the key
to my own computer language, Abundance, whose primary design goal
was ease of maintenance. You can specify information
declaratively and automatically generate the corresponding
procedural Java bubblegum.
Also see Cristina Lopes' PhD
defense of Aspect oriented Programming.
- If you elected to view an IF as a flow chart, you could more
easily compare the true and false branches line by line side by
side. With the modern GUI's ability to rapidly pan in 2D or even
3D, we should break the mindset that programs have to be a
single linear column of text. We can pay more attention to what
actually works for the eye, not what is easiest to code. I am
pretty sure than long horizontal lines of text, streching all the
way across the screen, so popular now, will prove to be
suboptimal. You might look for inspiration to website navigation
aids.
- Display complicated expressions in true mathematical form
much as Tek or the Microsoft equation editor would display them,
with variable sized parentheses and denominators under
numerators. To help understand expressions, you could ask bits of
them to collapse on screen. A simple version would adjust the
amount of space around each operator to indicate relative
precedence. Low precedence operators would be surrounded by more
space. Java has 13 levels of precedence, but you would not
normally find them all in one expression. The relative
distinctions in spacing could be obvious. You would not
have a fixed amount of space for each precedence level.
- Exploit new high res 1 meter square LCD panels so you have
room to see everything at once, visually navigating your way
around the entire code space, rather than peeking at it through a
toilet tube the way we do now.
- With everything preparsed, writing your own custom code
conversions would be a lot easier. For example, you might write a
translator from AWT to Swing code.
- For importing ordinary Java source code, a parser
such as JavaCC or ANTLR might be useful. See parser in the Java
glossary.
- Ability to add shortcuts to the syntax such as moods and
for-each loops. Instead of saying x.keyin(); y.keyin();
z.keyin(); you can declare keyin as a mood, and
say: keyin x, y, z; You could say things like: for
( MyArray) { MyArray.x += MyArray.y + 1; } to run through
all the elements of the array and provide an implied default
subscript inside the loop body.
You can invent your own language shortcuts. No other programmer
need view them. They would see perfectly standard Java. However,
when you viewed their code, your shortcuts would be applied so
you would not have wade through their reams of dinosaurian
repetition. If your shortcut got in the way, you could drop it,
and instantly see standard Java again. After all, this is
software, right? It is supposed to be malleable and
comfortable. With traditional coding techniques software becomes
so rigid. It is harder to change that the supposed
hardware.
- I suspect that SCIDs will create a revolution in terseness of
language design. It will come about gradually like this. A SCID
will give you the ability to temporarily hide
bookkeeping/busywork/plumbing/wiring (pick your analogy) details.
Next will come the call for the SCID to automatically generate
those details. Next will come the complete suppression of them
from the application programmer's awareness. They will be hidden
completely behind the walls, no longer part of the day-to-day
application programming language. Every time you can suppress 30%
of the busywork, a whole new set of patterns emerge that were
formerly obscured by all that fussy detail. Suddenly, you
discover new ways to more tersely specify your desires to the
computer. You see new levels of
bookkeeping/busywork/plumbing/wiring that can be similarly
hidden, revealing still more deep structure. this is not just
speculation. I have seen this process in the evolution of my own
Forth-based language, Abundance.
- SCIDs will also have another influence on language design.
Right now we are stuck in a mindset that a computer language is a
linear stream of vanilla 7-bit ASCII characters. SCIDs will
loosen that up. We are already seeing that with tools like the
Symantec layout editor. A program can be a diagram in 2D. Font
style can have semantic meaning. Noam Chomsy might put it
this way, Programs may have many temporary surface
representations of a single deep structure. We will see multiple
alias names for variables, multilingual variants of the Strings
that can be flipped with the click of a menu item.
- A SCID could potentially store a lot more information, (not
normally visible) than a text file representation would. For
example, you could fairly easily automatically record who changed
each individual element in each line of code, why and when and as
part of which job layer. See dynamic version control for
what I mean by job layer. Global renamings would be labelled as
such, not as a million separate little transactions. This
information could also be used by the boss to track precisely
what a telecommuting employee did during the week.
- For some notes on how a SCID might be implemented so that
many users could be updating the same code simultaneously from
several globally dispersed sites, again see dynamic version control.
- SCIDs are not a totally mythical beast. Smalltalk and Logo
programmers have been using them for a long time. IBM's Visual
Age Java compiler uses a SCID. SCID users are very enthusiatic
about them, even though I think the current crop of tools have
just begun to exploit the possibilities.
JavaDOC
tools
iDoc does a reasonable job of proofreading JavaDOC, but
maintaining JavaDOC is very tedious. Here are some tools you
might write.
- A propagator. You write JavaDOC for an interface, then you
want to propagate it to the classes that implement that
interface. Ideally it could be interactive, making it possible to
replace, keep, or merge.
- Parameter Boilerplate. If parameters are uniquely named,
every place you see the parameter formType in any method,
there is canned boilerplate that goes with it, plus perhaps a
sentence of something custom for that particular method. You want
a way of being able to globally apply such boilerplate and to
update such boilerplate. Ditto for boiler plate explaining
exceptions.
- JavaDOC needs tidying. To be maximally readable, the
parameters must be in the same order as in the method. There
should be a blank line between each @param. Exceptions should be
listed in alphabetical order both on the method header and in the
JavaDOC. The parameters and text should line up in columns as if
the text had been placed in an HTML header.
- JavaDOC gets turned into HTML. The code should be verified as
valid HTML, perhaps just simply tag pair matching. I'm not user
if characters like < are allowed or if you must spell them out
as <.
- A parser such as JavaCC or ANTLR might be useful. See parser
in the Java glossary.
Linkcop
Clone
Linkcop is a program that
wanders its way through your website checking internal links and
optionally checking external ones.
Your clone would be a little smarter than the original.
When a link automatically takes you somewhere else, your clone
should correct your original HTML to point directly to the new
location. It has to be a bit clever. You don't want to replace
valid URLs with 500 character CGI references that will change the
next minute.
When a link is broken, your clone should try to fix it for you.
For example if "http://oberon.ark.com/~Zeugma" has
disappeared, it should try "http://www.zeugma.com" and
"http://www.zeugma.org". Failing that it might be able
to make some guesses by combing the URL names in some search
engine results. It marks its corrections with a special GIF.
Someone can then manually check these corrections out.
You should be able to give it a list of pages to check (or
wildcards), a list of pages to avoid (or negative wildcards), and
whether you want the indirectly-linked pages also checked (/I)
and whether you want subdirectories checked (/S). So for example,
you might want it just to do a quick check of all your offsite
amazon.com links, just checkin internal links enough to effect
that, i.e. not checking any gif links, internal # links, or other
offsite links. It takes a long time to manually research a dead
link. You don't won't to be told again about dead links you
already know about.
You should be able to control it completely either from the
command line or from a GUI.
Before declaring a link dead it should probe it several ways:
- First with HEAD. This is the most efficient since he
website returns just the header, not the whole document.
- Next with GET. Some sites won't return just headers. You have
to ask for the whole page. You can abort once the header portion
arrives.
- Try polling again after all the other pages have been polled.
This may give the site time to clear a temporary backlog or
hardware problems.
- Optionally just put the non-responding sites in a list to be
automatically rechecked on several successive days. Only if URLs
stay dead, do you bother notifying the user about them. Of
course, if a site is up, but there is no such document, there is
not much point in retrying or delaying notification of the
problem.
Lint for
Java
LINT is a program for C/C++ that points out potential errors in
source programs, things that are formally syntactically correct,
but which might do something different from what the programmer
intended. We need something similar for Java. Here are some
things it might warn you about:
- Missing break in case clause. Default fallthough was a
dangerous language feature of C that Java copied.
- Variables declared but not used.
- Expressions like this: secs << 32 | frac &
0xffffffffL; where the meaning of the code heavily hinges on
the precise differences in operator precedence of operators other
than +-*/.
- Using an object name instead of the class name to qualify a
static method invocation.
- Using a method name with the same name as the class. It was
probably intended as a constructor.
- Checking switch statements for consistency. All the
enumerated values of a set should be accounted for. Unfortunately
the Java language does not yet provide a mechanism to group
static final constants into enumeration sets.
- Check for setting a temporary variable to a value that is
never subsequently referenced.
- Check that get/set methods do indeed get/set some information
from the object.
Localisation Tool
There are two basic approaches to localisation (preparing a
program in different languages or for different countries).
- Hand the program over in its entirety to some speakers of the
native language, and let them adjust it -- Strings, layouts,
colour schemes etc. When you issue an update, you do the whole
process essentially from scratch.
- Code every string in the program as some call to a function
with a meaningless integer argument string id that selects the
localised string. The code is bulky and unreadable.
I propose a hybrid of the two approaches.
For your application, you require a database of strings and a
tool for both programmers and translators to update the database.
The database essentially consists of various translations of each
String, all tied together via a computer-assigned ID.
There are several special "languages":
- programmer: the incoherent English text with atrocious
spelling originally written by the programmer.
- notes: where the programmers can put notes to help
translators.
- questions: where the translators can leave questions
for the programmers about the precise meaning of the strings that
need translation.
There are additional fields:
- locale: which language this String belongs to.
- upToDate: a boolean associated with each String. If it is
false, it means the translation needs to be reviewed.
- plural: This is the variant of the string to use if the
subject of the string is plural.
- feminine: This is the variant of the string translation to
use if the subject of the string is a female.
- initials: who translated the String (or provided the original
wording).
- invariant: a boolean. If true this String, perhaps a filename,
must remain the same in all locales.
Programmers manipulate the database primarily by changing the
Java source code. A custom parser periodically runs through the
Java code finding new strings, and finding ones that could be
discontinued.
Translators primarily manipulate the database with a GUI that
lets them rapidly find non-up-to-date string translations in a
given language. Various translators can take copies of the
database, work separately and later merge their works.
When it comes time to ship an localised application there are several possible
techniques, all of which work off the same database of Strings.
- All the Strings are placed end to end in UTF in a giant flat
resource file. The string literals are replaced my method calls
with a integer ID, which is translated internally into a byte
offset in the giant flat file, by an array lookup, one array per
language.
- As above, but some subset of the text of the various languages
also appear in the Java source code as comments.
- As above, but with the popular strings of the current
language cached in a Hashtable in RAM.
- Several variants of the program are created, each with the
string literals replaced with those of a specific target
language. This generates the fastest, most compact code, but
makes on the fly switching harder.
Special problems:
- Even in English, you may need to customise strings to
account for the gender or number of the various parties
described. In French, you must also account for the gender of the
implied speaker. In English, it is fairly straightforward to
write the strings using male gender, and then use custom touch-up
code to convert the gender. E.g. "Your son was late for school
ten times last month. If he does not shape, up we will have to
recommend him for home schooling." could be mechanically touched
up to "Your daughter was late for school ten times last month. If
she does not shape up, we will have to recommend her for home
schooling." In more complex cases, you will have to use the notes
section and translate the string several ways, and have
application code select the correct string. There is no need to
specially mark the gender-specific words. If the generic touch-up
code does not work, provide separate strings. Don't waste your
time specially encoding every gender-specific word.
- A variable may have to be introduced into the wording. e.g.
"Insert the CD-ROM into drive R:". The R: may need to
change to some other letter. This gets even more complicated when
there are several variables. The order may be different in other
languages. I suggest handling these by embedding variable strings
with ad hoc pseudo HTML variables of the form &drive; The
application code provides a single substitution string for the
form: "drive="R:" color="brown""
Where the colour brown itself was a localised string.
- Embedded variable counts e.g. "Your son was tardy 4
times last month" need special handling for singular and
plural. I suspect you could write custom touch-up code to clean
up the specific sorts of language you use, e.g. "1
times" could be replaced by "once" Where the
touch-up code fails, polish the touchs-up logic, or use separate
strings rather than driving yourself crazy trying to specially
mark every potentially plural word.
Mailreader/Newsreader
The problems with existing mailreader/newsreaders are:
- Spam.
- People posting anonymously, but pretending to be actual
people. When people post anonymously, even under an assumed name,
that should be clear to all who read them (or who decide to
ignore them.)
- Forging posts. I have even seen forged confessions of child
molesting.
- lack guaranteed delivery.
- Lack of privacy
- Very inefficient use of bandwidth with needless quoting of 8bit
characters and no compression.
- Confusion an ascribing quotes.
In the new scheme, all mail is compressed, digitally signed and
encrypted. All newsposts are compressed and digitally signed.
Even completely anonymous posts are signed with a special
anonymous digital signature, and even noms-de-plume are digitally
signed. All transmission is 8-bit. The whole business of
exchanging and verifying keys, encryption and decryption,
compression and decompression is totally automatic and
transparent. The user is totally unaware of it.
You can track a piece of mail just the way you can track a
parcel, to discover if it has been received at the recipient's
ISP, at the recipient's computer and at the recipient. You don't
have to do anything active to track mail. You just see little
status icons changing. If a piece of mail is overdue, it will
show up in a way that attracts your attention.
You can enclose programs with your email or posts. They don't
actually necessarily get sent until the recipient asks for them.
With one click the recipient can install that program.
Message can contain all manner of HTML features. All browsers of
this style of EMAIL can handle that, so there is no need to track
whether the recipient can accept HTML.
Like ICQ, someone cannot send you mail without your prior
permission. They can't send you mail because they don't have your
public key to encrypt the mail. You receive ONLY encrypted mail.
They can only send you a tiny request permission to send message.
By periodically changing your public key, and sending the new key
only to those you want to continue receiving mail from,
effectively cuts off any pests you may have unwisely given your
key to. The basic idea is no one can send you stuff you don't
want. It does not even get into your ISP's mail server. Spammers
and other pests can spend your datacommunications resources or
your time.
Your address book automatically updates. When you move, change
ISPs etc, everyone in your address book authorised to send you
mail gets informed, and their address books automatically update.
Eventually the scheme would be used to automatically inform all
the magazines you subscribe to of your new mailing address.
Whatever information you choose to broadcast, automatically stays
up to date in other people's address books, including possibly
personal details like height, weight, children's names, credit
card numbers, bank account number.. You would disclose different
amounts of this information to different people. Once you mark it
disclosable, it would stay up to date automatically in everyone
else's address book.
Attribution (quoting and tracking who said what) would be handled
technically by not embedding quotes in messages. Instead a
reference to the original message and an offset/length would be
embedded in the message. When the viewer expands this It would
look like an ordinary quotation, but with guaranteed accuracy and
a guaranteed accurate author. There is no way to put words in
another's mouth or ascribe them to the wrong person.
You can file your mail and newsgroups, tagging them, possibly
automatically, with many possible keys. You don't need a strictly
hierarchical system. There is an SQL database lookup so you can
retrieve messages without having to linearly search the entire
message base.
Native Class Amanuensis
Let us say that you wanted to write a native class to access
the volume serial number of a drive. This is just a couple of
lines of assembler. However, to glue that code into Java would
require a major intellectual effort. You would have to learn JNI
and Microsoft RNI, and all the steps. 95% of what you read would
be irrelevant to your tiny task. By the time you did another such
project, it would all be a haze, and you would have to start your
earning from scratch.
I suggest writing an amanuesis that generates all the glue
code, control files, headers etc., and a set of instructions on
just what needs to be done manually.
You tell the Amanuensis the name of your native class, the
methods and signatures, the native platforms you want to support,
and it generates all the necessary bubblegum, including some C
and assembler skeleton programs that just access the parameters,
create a dummy return value and return. They are heavily
commented, so that most C or assembler programmers would be able
to just look at those comments, flesh out the real code and go,
without any understanding of JNI/RNI.
To start, pick just one platform, say Win95 and just JNI, and
generate only the C skeleton. Design your classes so that you or
others can plug in delegates later to handle other variants.
Password Protector
This is a sort of software wallet for passwords. I have dozens of
passwords and CD-keys. There is no way I can remember them all.
It is not wise to have them all recorded in a flat file anyone
could snoop at, or on a piece of paper in a desk drawer. This
program remembers your passwords, and stores them in a secure
encrypted way.
The program has two types of entry: regular, and secure. In
regular mode the passwords/CD-keys are displayed. You never have
to type blind. In secure mode, password/CD-keys are never
displayed on the screen, even when you are typing them. You have
to type passwords twice to be sure you typed correctly. Every
field has its own radio button to determine type.
It works like this. You make up a new entry: give it a label, say
"Sun Developer Site Access" Then you can create
additional fields and name them, e.g. login, password, and fill
in the blanks. You can select the most common types of field by
radio button: login-id, password, CD-KEY. Beside each field is a
radio button for selecting display/hide i.e. regular/secure.
You can click on an entry and have it sucked up into the
clipboard for pasting elsewhere. Ideally you would co-ordinate
with Clipmate, or use a hot key, or do something very clever so
that you could have a PowerPaste, paste each field of the entry
in succession.
The program encrypts the entire password file, and makes sure no
unencrypted copy of it is ever left unwiped on disk. You gain
access to the file with a common pass phrase. If you forget the
passphrase there is an optional hint, hopefully composed cleverly
to help you remember, but not enough to give a snoop any
assistance.
Note that the passphrase or its HashCode is never recorded
anywhere in the file. To ensure snoops cannot scavenge, remember
to wipe (clear to 0) any array or temporary file containing
unencrypted data or the passphrase before exiting, or after a
time delay of inactivity.
How do you do the encryption?
I propose three possible techniques. You might like to think up
others:
- Weak encryption: Take the HashCode of the passphrase and use
it to seed the random number generator. Use the random generator
to generate a stream of bytes. XOR these with the file to either
encode or decode it. See Random
Numbers in the gotchas essay.
- Stronger encryption: Bone up on the PGP (Pretty
Good Privacy) private/ public key algorithm. Use
the PGP public key to encrypt the file and the private one to
decrypt it. There may be an API for the code in PGP itself. Do
some digging.
- Read up the "security" entry in the Java glossary.
Use one of the more advanced encryption techniques such as
Blowfish, whatever you can find code for.
- Check out JCE,
the Java Cryptographic Extension API spec
from SUN, with implementation for USA from SUN and one for the
rest of the world from ABA.
The program would have a simple integrated backup/restore that
does nothing more elaborate than copy the encrypted file to or
from floppy.
If you have access to a Java powered hand-held unit, you might
develop this as an application for it. Ideally it could send the
codes into apps via a special keyboard intercept attachment or an
IR beam. It might even form the basis of a general lock system to
handle all IR activated locks. You could invent a cheap IR lock
that sends a challenge phrase that needs to be encrypted and sent
back. The handheld unit contains all the public and private keys
necessary to deal with all the locks in your life. It might even
be useful to simulate something like a car remote to prevent
children playing with the remote from accidentally starting the
car in a closed space and gassing the family to death.
Prebranded Software rental with auto
updates
When you buy software online, you don't have to type in your
name, address and shipping information every time. The idea is a
create a scheme so attractive and so cheap that all vendors will
adopt it. See my essay on Internet Futures for more information
on how automatic updating
works.
Software arrives prebranded (exactly where the branding occurs is
immaterial. The important point is the user takes no manual
action to brand.) Normally software will arrive in two parts -- a
large publicly, freely accessible file of encrypted program and
data files, and a private custom-built small key file containing
branding information, a key to unlock the application, and
crucial parts of the application code. The key file would be
encrypted with the customer's public key.
If anything should go wrong with the download, the software,
without any manual assistance, can restart it. The software is
automatically installed once it arrives without asking any
questions. If any special questions are needed, you answer them
before you even start the download. Once you hit
"PURCHASE", you can go for coffee, and come back to
fully installed package.
Updates automatically arrive and are automatically applied unless
you explicitly freeze the system. Presume a 24-hour ADSL Internet
connection.
There is a wallet scheme (gas tank), so you can rent programs by
the hour, month etc. There should be flexible payment schemes, to
also include support, purchased ala carte or in bundles. There
should also be provision to pay by usage -- e.g. documents
printed, created, scanned, etc.
The common wallet means there is no need for frequent financial
transactions. The wallet sends information to a central clearing
house periodically about your usage. You get one entry a month to
your credit card, even if you have rented/purchased dozens of
packages.
You can monitor the gas consumption of any program at any time.
Your wallet will warn you if you have inadvertently signed up
with a guzzler.
Assume vendors will want to trash their competitors
installations. Assume people will try to rent programs without
paying, or pirate purchasable software. Figure out how to foil
them, in a way that has zero inconvenience for the legit user.
Consider how unique electronic serial numbers on CPUs might make
your job easier. Consider how an I/O board with a defective RAM
chip on it might serve as a unique id. Consider how a CPU with a
private/public encryption key could be used.
Eventually the scheme might be used for all manner of electronic
commerce such as paying bills, buying toys, wiring people money
overseas, paying for an infinite jukebox, getting stuff printed
or typeset on a fancy nearby printer in the basement of your
apartment building or at the nearby instaprint shop, music
channels, digital TV, digital movies...
Advantages to Renting Rather Than Buying Software
- Software sales means the vendor is primarily concerned
about sucking in first time buyers with flashy but useless
featuritis.
- Software rental means the vendor is primarily
concerned with keeping his existing customers happy by providing
tools that work smoothly and efficiently.
- Technical support is more reliable and less expensive since
every customer automatically has the latest version.
- With rental, I can afford to have thousands of programs on
tap that I only rarely use.
- Software rental with automatic updates make piracy much
harder. The constantly updating programs are a moving target, and
there is always a danger the pirate will expose himself if
products automatically go out seeking updates. Every program can
be booby trapped in a million ways to stop working if it is not
frequently updated. The booby traps can change every month.
Public key encryption techniques can be used that will make
piracy almost impossible. It will get even more difficult for
pirates when everyone has a 24-hour connection to the Internet.
Less piracy means a bigger market. A bigger market means more
competition. More competition means lower prices for
non-pirates.
- With rental, I can thoroughly try out programs over a long
time without having to commit to them.
- I can cheaply drop a poor vendor at any time and switch to a
competitor. This gives me clout with the vendor.
- Delivering and installing updates is 100% the vendor's
problem. I don't have to waste any of my precious time with it.
If he mucks up, I get another vendor.
- Delivering software and updates electronically is much
cheaper than using retail distribution channels. Eventually this
will mean lower prices.
- With piracy all but eliminated, and try before you buy very
easy to administer, and automated updates reducing the support
burden, smaller software development companies will be able to
compete more effectively with the bigger companies that currently
rely on expensive print advertising and mass retail outlets. This
means more competition, higher quality products and lower
prices.
- Without any paper work or computer busywork you could install
40 extra copies of an application to help you through a rush
job.
- Payment can be very flexible, based on usage, time, flat
rates, bulk discounts etc. You don't have to plan far in advance
the way you do with purchases. You can make adjustments with
seconds notice.
- The vendor has a steady source of rental income to pay for
upgrades and bug fixes. He does not need to provide them out of
the goodness of his heart. If he does not provide them, he stops
getting paid. Currently, vendors that take the money and run are
financially most rewarded.
- Consider the analogy of health club memberships. You get a
much better service if you pay by the month than if you pay all
up front for a lifetime membership.
Regex
Proofreader
If you don't know what regular expressions are, have a look in
the Java glossary. Basically they are search/replace
patterns.
If you have ever used Funduc
Search/Replace or SlickEdit or any program
using regular expressions, you know what a beast they can be to
proofread. You are often quite amazed by what they do to your
files when you turn them loose.
The main problem is quoting. Does < mean the literal
character < or is it a command? In Funduc
Search/Replace it is a literal in the search argument, but a
command in the replace argument. Arrgh! Every implementation of
regexes uses a slightly different set of commands and command
characters.
If a character is reserved as a command, when you want to use
it literally, you must precede it with a \. This turns
expressions into unreadable nightmares like this:
A very simple regex proofreader simply colour codes each letter
as a literal or as a command. For example in Funduc search
expressions the following characters need to be preceding with
\ when used as literals:
- + * ? ( ) [ ] \ | $ ^ !
and the following characters in replace expressions need to be
preceded with \:
You must not precede any other literals with \.
You might display command characters in red and literal
characters in blue, instead of like this:
like this:
A slightly cleverer proofreader w=d let you hide the \
characters (except those preceded by another \) and just
display the raw colour-coded literals. You could think of this as
unquoting. The unquoted regex expressions would then look
much more like the actual strings they are intended to match. The
expression may then look like this:
What then would you do with tab=\t, cr=\r and
newline=\n? You might cook up some special glyphs to
represent them more literally, much the way MS word can be
persuaded to make visible the spaces, line and paragraph ends
with special symbols, like this:
. You might just use the letters t, r and
n but display them in green.
If you can view them that way, why not type them
that way? Instead of worrying about which letters need to be
quoted, just use ctrl-R to load your "pen" with
red ink, ctrl-B to load it with blue ink,
and ctrl-G to load it with green ink -- or tap one
of three coloured inkwells with your mouse, as if dipping an old
goose quill pen. Just type your literals naturally and behind the
scenes, your Applet will insert the \ characters as
needed. If you typed an invalid command in red-command mode, it
would just beep at you. You could even use the Enter key
to more directly represent \n or the \r\n pair.
Everything so far is pretty easy. You don't need to know anything
much about regex expressions to write such a simple proofreader.
The only problem is Java does not support rich text -- multiple
font colours in a TextArea. Unfortunately you can't handle the
colours by generating HTML <FONT COLOR= commands and
having something render it. You need to handle your colours at a
lower level -- like a Canvas.
For a more advanced proofreader, things get a little tougher. The
user should be able to submit some sample strings and see what
the regex would do to them. Do they match the pattern. What do
they get converted to. You have to simulate Funduc
Search/Replace, SlickEdit or whatever other Regex engine you are
proofing for. If you dig about the web, you may find source code
to do this for you that you could canibalise. You have one
advantage over the authors of the commercial regex programs. Your
parsing and scanning code can be very slow and no one will mind.
The user should be able to maintain little libraries of test
strings that he can quickly test out. The before and after views
should use colour to highlight the changes.
Once you have that working, you can now try something even
tougher, generate a sample set of strings to feed to the
regex that exercise each part of the regex expression. Some
should match, some should not. By examining the results on that
test set of strings, the author of the regex expressions should
have a pretty good idea of all the things that regex will do when
it is turned loose in the real world on files.
If your regex engine is sufficiently fast, you might consider
turning your proofreader into a full blown search/replace engine,
that can run scripts of commands or accept them one at a time
via a GUI. You might also consider writing your own text editor
with this as the crown jewel.
If you want to tackle a simplified version of this project, try
creating proofreader for Java source code string literals. So
that instead of seeing "C:\\mydir\\Myfile.txt" you
would see "C:\myDir\Myfile.txt". Or
"abc¶def" instead of "abc\ndef".
Reimplement Standard Classes
This idea came from Mark Thornton.
Reimplement java.lang.ThreadLocal so that synchronization only
occurs when initially creating a value instance and when cleaning
up old values (associated with dead threads).
Or more generally reimplement just about any standard class for
greater efficiency, while of course retaining the same public
interface. Try running a source code debugger tracing the actions
of any standard class. Often the needless overhead will become
clear. Profilers will also help.
Read the Bill
Wilkinson's Deja News posts from comp.lang.java.* over the
last year on the subject of serialisation. Reimplement
serialisation in a way that would please Bill.
Devise a test suite to prove your classes behave the same as
Sun's, or that clarify any deliberate differences, and to
benchmark performance. Post on comp.lang.java.programmer and send
copies to Sun.
SAX File Transfer
Protocol
SAX is a more efficient file transfer protocol than HTTP or FTP
which are based on TCP/IP. TCP/IP sends half its time acking
every packet, which wastes bandwidth and slows things down.
SAX is not an acronym. It was originally coined as just a mix of
FAX and SEX since it was intended to replace inefficient FAXing
of documents with sexy new protocol. Here are the key features of
the SAX protocol. I will leave it up to you to fill in the
details. The protocol itself is much simpler than other
protocols:
- You can think of SAX as a sliding window protocol, like
Kermit or Zmodem, where the window is the entire file being
transferred.
- SAX does not ack individual packets. It only acks every 60
seconds or so as an are-you-there? I'm-alive heartbeat and when
file transfer is complete. Either end can ask the other at any
time "Are you still alive?" or "What is
your view of the work still to be done?"
- Packets are fixed length. They carry with them the record
number where they fit in the file. You multiply the record number
by packet data length to get the offset where they fit.
- Every SAX packet is self-identifying - where it fits into
the file being transferred. Packets may arrive out of order, or
be repeated without any damage. Good packets are never discarded
the way they are in most other protocols.
- Sender and receiver maintain bitmaps of which parts of the
file they think have been successfully sent. When a sender sends
a packet, it marks that chunk done. When the receiver receives a
packet successfully it marks that chunk done. When the sender
receives an ack, it marks unreceived chunks as undone. The file
transfer is complete when the receiver acks with an empty to do
list.
- An ack is a list of all the parts of the file that
have not been received yet. The file transfer is over when the
receiver's ack says all has been received, that there is no more
work to do. The ack consists of a complete list of the parts of
the file that still need to be transferred according to that end,
given as a list of ranges and individual chunk numbers.
- The SAX protocol must be built atop IP or UDP, not sockets
TCP/IP to avoid the automatic acks on every packet. Have a look
at the java.net.DatagramPacket and java.net.DatagramSocket
classes.
- Any packets received with bad checksums are just ignored. The
Datagram classes should handle creating and verifying checksums
and discarding bad packets for you.
- SAX is restartable in event of disconnect. The receiver tells
the sender which bits of the file have not arrived yet. The file
transfer starts the same way. The receiver keeps its bit map in
the event of disconnect. The server rebuilds its bit map from the
list of unfinished work the receiver sends it to restart the
transfer.
- If the file is not already compressed, it is compressed prior
to starting the transfer. Compressed files have a jar, cab or zip
signature in the first few bytes. If the file used some unknown
compression, SAX would recompress/decompress it, which would do no
harm other than waste a few CPU cycles.
- SAX can also be used to transmit a stream of data, but it
requires that a log be kept back to the last checkpoint.
- Pacing, that is avoiding sending packets so fast that many
are lost, or too slowly, is not strictly part of the protocol.
The sender can be clever, if it wants, by requesting an ack from
the receiver. If it notices a pattern of lost intermediate
packets, it can lower its sending rate. If it sees a pattern of
perfect transmission, it can raise it. Perfect pacing speed is
not necessary. It is free to experiment on the fly with different
sending rates to see what works best. It may optimise for speed
or efficiency (ratio of packets received to sent) or some blend.
It may take into account its own load from other customers.
- Packets in the pipeline cause a complication. When the sender
requests an ack, the receiver sends back its state. However
several packets could be in the pipeline on their way to the
receiver. When the sender gets the ack, it will look as if those
packets had been lost. The protocol per has nothing to say
about this problem. However, how can the sender deal with it?
- Ignore the problem. It will all come out in the wash. On the
next ack, these packets will be marked received. If you prefer to
send new packets rather than resend old ones, the problem will
most likely be resolved before you get around to resending the
questionable packets.
- Mark these unacked recently-sent packets as state
unknown. When sending packets, avoid sending state unknown
ones until you have sent any packets known unambiguously to be
unsent. Only send state unknown packets when you
have nothing better to do, waiting for an ack.
- If your acks don't get a response within X seconds you retry
a few times, then give up and start a new connection. For
heartbeat acks, where you are primarily just making sure the
other end is still alive, you keep sending data packets while you
await the ack. The sender might send heartbeats even when for
some reason it was too busy to send packets, just to keep the
channel alive. It might send heartbeats at a higher packet
priority status.
- You might consider piggybacking your ack requests on an
outgoing data packet.
- You will probably want to implement this as a connection
object with two associated threads, one to handle sending
packets, ack requests and acks, and another thread to handle
receiving them. You might as well make both ends symmetrical,
capable of sending a file in either direction.
- Have each connection object deal with transferring only one
file. If you need to send more than one file, use multiple
connection objects, that may or may not run in parallel.
- How big should the packets be?
- Make them the size of the biggest packet that can flow down
the link without being broken into subpackets on one of the hops.
See the TweakDUN utility for more information on how that
works.
- Make the packets huge, say 64K (or whatever datagrams allow)
to lower disk i/o overhead. The disadvantage of this is datagrams
need to be reassembled from subpackets before any processing. If
any subpacket is lost, the whole datagram is lost.
- Run a little test to find the optimal size by experiment. It
will depend on error rate, MTU, network congestion, etc.
- Make it a configurable option. The user does experiments to
tweak the number for each remote site and time of day if they are
fanatical.
- It is important to avoid per-packet overhead since the
packets may be relatively small. Unlike most layered protocols,
SAX trusts the lower layers to do their job and verify checksums
and figure out how long the packet is, and therefore how much
actual data it contains. I would think you could trim the
overhead down to a one-byte command followed by a 2 or 4 byte
chuck number. You might implement the command byte as a set of
bits that can be ORed so that you can piggyback a request for ack
on a normal data packet.
- How does the process end? You don't want to get into an
endless ack the ack sequence. In the very worst case, you can
recover from anything pathological when the receiver restarts the
transfer.
The sender will be the first to discover it has sent all the
packets, (simply by maintaining an unsent packet count). Then it
sends its status along with a request for status. If it does not
hear back, it repeats this and eventually gives up, then throws
away its bit map. Normally it will either get a request back for
some missing packets, or an all-done status in which case it can
shut down the channel.
The receiver will eventually notice that it has received all
packets. It sends its status then shuts down, and throws away its
bit map. It does not really matter if that final status packet
got through. The sender will give up soon enough.
If the receiver is expecting packets and does not hear anything,
it can prod the sender with a status packet, telling it of the
packets it is still waiting for. If it still hears nothing, after
several attempts, it shuts down the channel, but keeps its bit
map, and restarts the entire process, perhaps after some long
delay. If at any time the channel goes dead, the sender just
gives up and forget the whole thing, and the receiver must
restart, sending its status describing the chunks it still wants.
Because of the way the protocol works, in theory the receiver
might be able to request just a sampling of the file. It
would fake a restart, specifying just the chunks it wanted. To
the sender it would look like a restart with a few missing
packets.
- SAX might implement a total file hash as a double check on
the individual packet checksums, but it would not recheck each
packet with a second checksum. The file level checksum would have
to be defined in such a way that it could be computed as the
chunks arrived in higgledy-piggledy order. Either end could
request a checksum of the other and any range of chunks. This
feature could be used by a clever strategy routine to narrow down
where the damage is and repair a damaged chunk that was corrupted
in transmission but fluked out and passed the packet checksum
test.
- You may find it simpler to organise the protocol as several
intercommunicating threads, each with a single purpose. E.g. one
thread may just send packets. Another might just receive them.
Another might keeps track of whether the other end is still
alive. Another ensures the other end knows we are still alive.
You may find the logic of the protocol is thus simplified.
- Once you get he raw transfer going, add this wrinkle. With
each file send a serialised descriptor object. The object
contains fields such as:
- description of what the file is for.
- which application the file belong to.
- version.
- date and time.
- creator.
- URL to go for an update.
- URN (future).
- platform for which the file is intended.
- mime type.
- SAX dates back to 1985 when I hired a guy to implement a
SAX-based EMAIL system based on dial up phone lines and the X.25
packet net. I was trying to figure out how you communicate over
long distance telephone satellite links or packet nets where you
have a very long ack turn-around delay. I was also concerned
about how you communicate with someone in Africa in the field
using primitive non-error correcting modems and noisy local phone
lines. SAX seems a good fit with the Internet and its long ack
times caused by packets pinging from hop to hop like balls in a
pin ball machine. Implementing SAX atop UDP should be even
simpler than implementing it directly via modem since you don't
have the problem of resynching after an error to find the
beginning of the next packet, and you don't have the problem of
reserving magic control characters that you must somehow quote
when they occur in data.
- FTP has a problem if you want to maintain three mirror sites.
You must upload a file three times over your narrow-band modem
connection to the Internet. You should need to upload the file
just once to one of the servers, and then orchestrate a direct
server to server transfer to copy to the other two servers. All
you should need is the necessary read/download write/upload
privilege at the servers. Ideally, this would happen
transparently. You would just tell SAX you wanted a file sent to
three places, and it would figure out the fastest way to do it.
FTP can do this now, but only if you have the rarely given shell
script Telnet privilege. FTP can, in theory, can even handle such
a third party transfer within the protocol itself, but the
feature is rarely implemented. SAX would not require shell
privilege, so it could be used by ordinary folk. On a cold night,
years from now, you might even think about how you could use
multicast packets to send to more than one site all at once.
- Eventually you want to encapsulate SAX so that it is as
built-in as FTP. Instead of saying "ftp://" in your
HTML you should be able to say "sax://" to download a
file.
- FTP has a problem with conflict between uploaders and
downloaders. If somebody is downloading a file off my website, I
can't upload a new version of it. I have to wait until he is
finished downloading. If the file is popular, I may have to wait
until 3 AM when there is a tiny window of opportunity when no-one
is downloading it. Similarly, when I am in the process of
uploading a file, no one can download it. SAX should be clever
about this. When you upload a file that already exists, you
upload to a temporary file, so you don't block people from using
the old version. When the transfer is complete, you can do a
rename (if the OS is clever enough to allow renames of open
files), or go into a wait loop until the file is free to do the
rename. You need a much smaller time window in which to do the
rename than you need to upload the entire file.
- Writing a smart file uploader in Java (that used renames and
wait loops to avoid conflicts as described in the previous item)
could be a little project all on its own, using ordinary FTP
instead of SAX. You would use the ftpClient class methods. The
advantage is, you would not need to convince your servers to
install the SAX protocol. The disadvantage is it would not be as
fast.
- The SAX protocol is very simple, or will be,
once it is precisely nailed down. However, the strategy engine
behind SAX can be as simple or sophisticated as you want.
There is the challenge that allows for true competition
between SAX implementors. There are some pieces in the SAX
protocol that don't appear to be used, such as receiver being
able to request the sender's bitmap status. These are for
potential use by more sophisticated strategies and for
troubleshooting/debugging.
Chris <chrisl@hamptons.com>
has already got a prototype SAX protocol
working.
SCROOM Netscape
Navigator Plug-In
Have you ever been to a website that treated you badly. They may
have:
- Created 20 frames on your screen. As you killed each one,
several others sprang up Hydra-like to take its place.
- The site promised great wonders but did not deliver.
- The site contained some nasty Trojan that crashed your
machine.
You vow never to visit that site again, but you keep finding
yourself there because of hidden or misleading links in other
sites to it.
SCROOM comes to the rescue. You give it a list of sites, both by
name and by IP dotted quad that you never want to visit again.
Whenever you accidentally click to go there, or a site tries to
send you there, the request is blocked. The mechanism is very
similar to how a Net Nanny would protect children from adult
sites, but it has a quite different purpose. I'm told a
commercial product called NetGranny has some logic like this in
it. It might serve as a model.
Struct reader Amanuensis
Another boring task is writing code to read and write C
structures using LEDataInputStream and DataInputStream. See the
FileIO Amanuensis for the general idea. You would either:
- Feed it a C/C++ declaration and have it generate a set of
Java class declarations, and the read and write method to read
one struct.
- Check off the types and names of the various fields, creating
a binary record descriptor. Use this to generate a C++ struct and
the set of Java class declarations, and the read and write method
to read one struct.
- Use reflection to discover the structure of the Java class
which may have embedded references. You need to create a fixed
length record, so you must presume each reference field is not
null, and you also need caps on the lengths of String
fields.
Even if you can't get it perfect, and just throw up your hands at
the problem of references and variable length Strings, you can
still save a ot of the bullwork.
A more efficient variant would generate code to read entire
records into a byte array, then generate get/put accessor
functions to read/write the various fields. The accessor would
have to construct the binary value by shift and logical or of the
bytes at the corresponding offsets in the byte array, using logic
similar to that in LEDataInputStream and LEDataOutputStream.
Super
Compressor
Ordinary ZIP (Lempel, Ziv Welch) compression depends on finding
repetition and, roughly speaking, creating abbreviations for
common words. This approach does not do well on small files. Here
are some ideas for super compression.
- Imagine you had a dictionary of 64,000 words. You numbered
them 1 to 64,000. Before you compress a text file, you look up
each word and replace the word and its following space by a
16-bit binary number -- the word's index in the dictionary. You
create specialized dictionaries for compressing/decompressing
different kinds of files.
- Imagine the state of a compression engine just after it had
finished compressing the entire JDK 1.2 Java source code. What if
you preloaded that state just before compressing any Java source
code file. You would then have a set of common Java
abbreviations all ready to go. There would be no penalty for
compressing a short file. You need to create loadable states for
various types of files and distribute them along with the
compression/decompression engine.
- Imagine you had fixed length records to compress. If you
transposed the fields and records so that all the names, then all
the addresses, then all the postal codes, then all the phone
numbers... came in order, the compression engine could do much
better because the repetition is clearer. Further you might
recognize certain field types such as YYYY/MM/DD and replace them
with 16-bit days since 1860 or some such.
- Write a compressor that uses information about compressing
other files in the archive. This gets tricky. If you delete a
item, you may have to adjust the remaining entries.
Sort
Visualizer
This is a tool for teaching people how various sorts work. It
lets you watch them work. It also lets you see how well
they work on already sorted data, or data that is sorted in
reverse order. People who like watching Norton SpeedDisk will
enjoy the hypnotic effect of this program.
You start with an array 400 elements long. You initialise it like
this:
for (
int i=0; i<400; i++ )
Then you scramble the array. You can do this by generating random
pairs of integers and exchanging those two array elements. Do
this 10,000 times. See Random Numbers in the Java glossary to see
how to generate random numbers 0..399.
Then you display the array in a 400x400 square grid of pixels. If
array[4] == 10 you plot a dot at x=4 y=10.
There is source for RadixSort, HeapSort, QuickSort and ShellSort available to
download.
You insert a call to your drawing routine inside the sort. The
sort must call it every time it moves an element of the array.
You might also insert it more easily on every compare. Some sorts
may make a copy of the array or in some other way challenge you
with a way to extract the needed data to plot. You might even
have to understand how the sort works!
Then you start your sort. Now you can sit back and watch the
points skitter about like tiny soldiers racing into order in a
straight diagonal line as the sort progresses.
If it turns out you need speed, instead of repainting the entire
grid each time, compare the old and new array and notice the
differences. Then just adjust just the pixels that need it.
Make it an Applet so that people can play with it on the web.
They should be able to choose which sort algorithm to use. You
may need to insert a variable delay into your plot routine if the
sort is too fast too see.
You might write some el-cheapo sorts that don't work well on a
large number of elements to demonstrate why they should not be
used. Write an insertion sort (one at a time it inserts elements,
sliding all those above to make room) or a bubble sort that
percolates the biggest element to the end via a series of
exchanges. On each pass it ignores the previous biggest elements,
and finds the next biggest. Write a scanner sort that finds the
smallest element then copies it to another list, and either marks
the original void, or slides elements above it down. Then it
repeats.
If that was not enough challenge, add this wrinkle. For people
with big screens allow the size of the grid to be as big as the
screen, or maybe even bigger and allow them to pan around on it.
You might even display your scrambling process to give the users
something to watch while you scramble the data.
Play with colour, encoding extra information into the dots using
hue, saturation and brightness. See HSB in the Java glossary. You
might use warmness of the colour to encode how close the dot is
to its home position, or how frequently or how recently it has
been moved.
There is a similar Applet in the Sun
Tutorials. It uses lines instead of dots, which may be even
easier to see. It also uses threads, so that you can run several
sorts simultaneously, in a horse race, using threads. It also
contains code for a bubblesort and the slightly improved
-directional bubble sort. The-directional sort bubbles the
biggest element to the top, then the smallest element to the
bottom, then the next biggest element to the top etc. alternating
directions.
Another thing your harness could do it experimentally test if
sorts are stable. You can't prove a sort is stable that
way, but you can prove it is unstable. By stable I mean
does not disturb existing order in the event of a tie in keys.
You can turn an unstable sort into a stable one by using position
in the array as a minor key.
Stay In
Touch Database
You might have one of these for your grad class so you can all
stay in touch over the years. It is simply an online database
where you can store such things as name, address, phone, email,
website, photo then and now, and public digital keys. You might
also store free form text about family, spouse, job, what's
happening with you.
It might be public or only open to the participants. You can use
the list for contacting for reunions or seeing who you know who
might be involved in a business service you need.
It could have many uses, church and club registries, printing
form snail letters, sending bulk email. The key is it can be
updated or searched online by those registered. It would have
very strong editing to ensure no invalid phone numbers, postal
codes, state/provinces etc were entered.
Its main function is rounding everyone up for 10 year reunions.
It has some status fields to help in that regard. Has the person
been contacted? Are they coming? Who knows them who might be able
to contact them? Who are they bringing? What food have they
agreed to bring? Can they provide transportation? Do they need
transportation?
People could provide information, but not make it generally
public. You could send someone a short message even if they were
unwilling to give out email or physical address.
I provide a service like this, all handled manually, for the
Living Lovers of Planet earth, people who were involved with the
teachings of the late Ken Keyes.
It would most likely be implemented either with Java Applets
running JDBC, or Servlets talking to a Java-less browser.
Submit-It
Clone
There are various services such as Submit-It, Register-It and WebPromote that will submit
your home page to a number of search engines for incorporation.
They may charge $40 or more for the service. You might create a
free version that runs as a Java application so people can do the
work themselves. Basically it simulates a browser, and goes from
search engine to search engine filling in forms. If anyone is
interested, I have some sample code that fills in a form and
submits it. The main work is finding all the search engines,
researching the forms, and making sure the program does not
submit an obsolete form. You need to maintain the program any
time a form changes. A clever version may analyse generic forms
and figure out how to fill them in based an its knowledge of
individual fields. That way all you have to do is add code for
new field types. The program automatically responds to changes in
field order, or removal addition of fields.
Systray Deleecher
This project is not particularly well suited to Java, but
more to C++. However, you might tackle it with a combination of
Java and C++ or Java and MASM to get some practice writing native
methods via JNI.
In the beginning, when Microsoft created Windows, and saw that it
was good, Microsoft made a rule. If you want a program to
automatically start up, just put an icon for it in the STARTUP
folder. Every vendor and his dog crammed the STARTUP folder with
their useless little doodads. People simply deleted them to stop
the apps from autostarting. Every once in a while, users would
accidentally delete something they should not have. There was no
easy way to recreate it.
Microsoft itself and later other vendors then started getting
clever, finding obscure way to make their programs automatically
start -- ways that users could not defeat. Users had then to put
up with all kinds of useless programs running all the time
cluttering the systray or task bar, chewing up RAM, eating CPU
cycles, and doing precious little useful to earn their keep.
I call these annoying autostarting programs leeches. Steve
Crook coined the term as an improvement on my orginal
worms which already had an established meaning. Your job
is to create a deleecher that defeats all known strategies for
autostarting. Your program should be able to remove any
individual leech on request. When your program removes them, it
should record how the leech used to be hooked into autostart, so
it can be restored later at any time with just a mouse click.
I have recently learned their is a shareware deleecher already
written that defangs nine different techniques for autostarting.
It also allows you to temporarily disable autostarts without
removing them entirely. It is called Startup
Manager.
TweakDUN
Clone
TweakDUN is a utility that looks at your WINDOWS\HOSTS.
file in Windows9x or your
winnt\system32\drivers\etc\hosts. file in NT and
refreshes the IPs in it by looking them up via DNS. Your clone
ensures that the entries for the IPs in your quick DNS look-up
hosts. file are accurate. A hosts. file might
look something like this:
IP |
domain |
comment |
127.0.0.1 |
localhost |
#SKIP |
# search engines |
205.180.85.126 |
www.altvista.com |
206.132.244.95 |
www.dogpile.com |
198.3.98.99 |
www.excite.com |
209.185.151.128 |
www.hotbot.com |
209.224.233.192 |
www.InterNIC.com |
209.67.228.51 |
www.lycos.com |
# bix |
199.93.4.67 |
x25.bix.com |
# alternate |
199.93.4.127 |
bix.com |
# usual |
To be compatible with TWEAKDUN, you should skip any entries
marked with #SKIP like this:
127.0.0.1 localhost #SKIP
Unlike the real TweakDUN, your clone should not behave like a
bull in a china shop and remove any comments or comment lines --
text to the right of a #. This utility may be written in Java but
only has application on Windows9x/NT. There may be equivalent
files on other operating systems. You might tidy the file so that
all the domain names line up nicely. One simple way to do that is
to put lead zeros on each field of the IP address to pad them out
to 3, though most people would probably prefer you not do that.
Uncrackable Encryption
Uncrackable encryption for the masses. It works transparently
with the email package. It uses CDs or DVDs distributed ahead of
time to those communicating. CDs are recorded with truly random
numbers generated by monitoring radioactive decay. It is a OTP
One Time Pad system using a simple XOR of
message and key, using a key equal in length to the message, and
never reusing a key. See the encipher utility documentation
for more detail on how the scheme works.
A more advanced version of the project requires to you build a
peripheral that is a true random number source. This would
be useful for creating one-time pads, and also in various
simulations. Here are several ways it could work:
- Monitoring the timing between click of radio active (such as
the tiny piece of Americium in a smoke detector) registered by a
crude geiger counter.
- Noisy transistor, or transistor bank.
- FM receiver that listens to the hiss between stations.
- Monitoring the precise timing between the user's
keyclicks
- Measuring some irregular biometric function.
In all cases, your problem is generating randomness fast enough,
and making sure your clients don't sample so frequently they lose
randomness.
Once you have this mechanism in place, you can build an
uncrackably secure webphone. You use the standard Java microphone
API to convert sound into numbers, compress them a little,
encrypt the numbers and send them over the web in real time. On
the other end you reverse the process, and play the sound. You
logically have two channels, one in each direction. Keep in mind
that snoops can still place hidden microphones at either end, or
snoop on EMF radiation from your computer. It is best to use a
laptop which emits fewer clues. However, snoops can't do
anything simply by packet sniffing or intercepting and
replacing packets. There is a slight delay as packets wend there
way through the Internet, but not that different from a long
distance call to Indonesia from Canada.
WebRing
controller
WebRings are groups of like-minded websites. You can browse from
site to site in a circle by clicking a button. You can go forward
or back. You can jump to a random site. You can see a list of all
the sites, and jump directly to any one of them. www.WebRing.org organises such
rings for free.
There is a central controller CGI app running on a server that
figures out where to jump to next when a given button is pressed.
When the user hits next, prev, random
or list, (usually on a custom CGI graphic map) the
server forwards the user to the appropriate site in the circle.
There are two major problems with WebRings as they currently
exist:
- The HTML for linking at each site is not properly verified.
The target URL should be precisely the spot on the page
where the link control to the next site is. With the current
unverified schemes, you may land at somebody's home page, and
have to dig through his entire site to find the next link. (There
might not even be one, if the site's owner made an HTML error, or
dropped out of the ring without informing the Ring Master.) You
may land on a page with links to 20 other WebRings on it. You
have to find the link for the WebRing you are interested in. We
need to automatically verify (and periodically reverify)
that the HTML target URL and the link control HTML buttons at
that target are properly configured. You need a precise target
URL of the form:
http://www.MySite.com/ringpage.html#SuperRing$20978$7752$13413
where "SuperRing" is the ring software scheme,
"20978" is the server id, "7752" is the ring
id, and "13413" is the site id. IDs are small integers
evenly divisible by 17 (which acts as a simple checksum, yet
allows ids (after division by 17) to directly index tables).
- Like the old serial Christmas tree lights, if any link in the
chain goes down, the entire ring is broken. The central site
needs to keep tabs on all the sites and remove ones (temporarily,
and eventually permanently) that do not respond. Some rings have
a button to bypass the next site to help get around the problem.
However, that won't help unless the user knows to try that. It
also requires there not be two dead sites in a row.
Your job is to write a servlet that runs a collection of
WebRings, sending browsers onto the next link. You need to find
low-cost, solutions to the above two problems. You want the user
interface extremely simple so that almost anyone could set up a
WebRing server and run it with little time investment.
There are four levels of organisation:
- Individual Site Master. One
for each site on a ring. A site may be on several rings. They may
be on rings using different software schemes.
- Ring Master. Decides the
ring's charter, and who should be allowed on the ring. Monitors
that members are conforming to the charter. Handles technical
support for the sites on the ring. Maintain the master list of
sites on the ring, and a home page for the ring. Composes a logo
for the ring.
- Server Master. Runs a
server than services many rings. Handles technical support from
the Ring Masters, but not from individuals or individual Site
Masters. Maintains a master list of rings handled by the server
and a home page for the server's ring service.
- Global Directory Master.
Much the way DNS works, the server sites automatically share
their directories of rings to form a master directory of rings.
This way each site has a copy of the global directory of all
rings on the Internet (ideally even including ones not using your
software). Alternatively, server sites may not keep a local copy
of the complete list, but instead maintain a link to where a
global copy can be obtained. End users of the system are, for the
most part, unaware of these organisation levels. They can find a
list of sites on a ring, a list of other WebRings on the server
or a master list of all the WebRings (categorised and keyword
searchable) just by clicking buttons.