So far, my best project ideas have all been solutions searching for problems. Luckily for me, the problem space of Search is large enough that this might work. Here’s my best idea for a term project: A program reads in a block of text to use as a reference corpus. There may be several smoothing options:

Corpus Smoothing

No smoothing performed
the reference corpus is expected to contain the entire vocabulary.
Laplacian smoothing with a given vocabulary size
This may not be useful without providing “filler words” for the unknown-word probabilities.
Dictionary-based smoothing
This could use /usr/share/dict/web2 or something of the like. Reference term frequency distribution is based entirely on letter frequency distribution in the dictionary. The term “xyzzy” will have low probability simply because its constituent letters are uncommon, and the term “eats” will have high probability because its constituent letters are very common.

Query Smoothing

This reference corpus is used to perform smoothing on queries. When provided with a query, it will present a smoothed term distribution function using some sort of recursive similarity metric. The potential metrics are (in order of increasing cost):

Basic (bigram) Markov chains
If term X follows term Y, then the presence of Y probabalistically implies the presence of X. For example, in “a nice hat”, “a” implies “nice” and “nice” implies “hat”.
Bidirectional Markov chains
Identical to Markov chains, but calculated using forward bigrams (basic Markov) and reverse bigrams. For example, in “a nice hat”, “nice” implies both “hat” and “a”. This is equivalent to taking the text, reversing it, and summing the term distributions of standard Markov on each.
Distance-based Markov chains
If a term X is within N terms of term Y, then Y implies X. This is a symmetric relationship. For example, in “a nice hat”, “a” implies “nice” and “hat”, “nice” implies “a” and “hat”, and “hat” implies “a” and “nice”.
Structure-based markov Chains
This requires that the reference corpus be segmented into structures (potentially hierarchical structures), which could be done by POS Tagging. In this example, if a term X is within the same structure as Y, then Y imples X. Notice that this is not symmetric. In the hypertext <p>I like <strong>black beans</strong> with soup</p>, “soup” implies “beans” but “beans” does not imply “soup” because the <strong> structure is nested inside the <p> structure. Hypertext documents generated with semantics in mind will likely produce the best results with this method.

In all cases, the similarity metric is applied recursively until some threshold is met (maximum depth or minimum probability contribution).


What is this useful for? Well, I’m not really sure, but here are a few application ideas that could be tried:

Query modification
The highest-ranking terms in the smoothed distribution could be included as feedback for a query. Using a small corpus, this would function to bias a query towards a particular subject. As a on-line process, this could function similar to search engines’ existing “find related documents” operations. Risks involved:

  1. Query modification has been attempted before and in general is not regarded as a significant improvement.
  2. It may be difficult for a system that only performs query modification to become popular, particularly if it requires extra labor seeding the reference corpus.
Query generation via document clustering
Combining multiple documents creates an implicit cluster. A user could produce these clusters by categorizing a set of bookmarks. As an example, I could create a cluster “Planted Tanks” and include several web pages from The Krib and other aquarium-related websites, and use that to drive automatic query generation (“find similar pages”) or query modification (“search within this topic”). The best part is that this system could sit in front of existing search engines and provide enhanced search as a side-effect of an online bookmarking system. There are two primary risks to the success of this system:

  1. A user’s classification of bookmarks may not produce effective clusters for many broad topics, or for topics that require extrinsic knowledge.
  2. Bookmarks to large resources may only access the front page, which contains little useful content. In this case, there is knowledge that is hidden from the system unless it recursively adds linked pages.


The more I think about it, the more I like the bookmark-clustering idea. But what about actually executing it?

A C++ programmer, a Python programmer, a Ruby programmer, and a PHP programmer walk into a bar. The bartender says, “Hey guys, free drinks if you build me a bookmark application with cluster-based searching.”

The programmers had to pay for their drinks because they couldn’t agree on a language to use.

So how could a project like this be split up to make the programmers happy? It could be layered somehow:

  • PostgreSQL or MySQL is used to store data.
  • PHP generates the HTML web interface. It has read-only access to the database, and it can issue SOAP or XML-RPC commands to the Python.
  • Ruby does … I don’t know what, because I don’t know what Ruby is particularly good at. Someone else will have to clue me in on this.
  • Python is used to crawl the bookmarked websites and generate content suitable for processing (i.e. take HTML output and delimit it, either as a list or tree of fragments, depending on the query smoothing method desired). It has read/write access to the database and can invoke C++ code either through shared-object modules or through os.system().
  • C++ performs number crunching for generating probabilities and whatever else. It has read/write access to the database.