There’s a lot of controversy over the idea of web proxies that block sites, for use in the office, on kids’ computers, or in schools. A lot of the controversy centres around the censorware companies’ blocking of sites based around their political ideals (the Peacefire people conducted an interesting experiment showing up the hypocrisy of some of the companies). Most of the companies don’t publish their block lists, but those lists are normally at least partially generated by keyword searches for “objectionable” words. I wonder if a Bayesian filtering sort of system might be more reasonable? It’s been pretty successful at detecting spam, and identifying sites you might not want your kids or employees to browse (assuming you buy into the concept of censorware at all) is the same sort of textual analysis problem. It would be easy to have the list of “sites I hit that were blocked” stored somewhere for review, so that incorrectly blocked sites could be reclassified as acceptable, and that reclassification process improves the accuracy of the blocking algorithm. (This doesn’t solve the problem of false positives; not blocking sites that should be blocked, but OK.) Wouldn’t be too difficult to do, I wouldn’t think.
Bayesian blocking software
I'm currently available for hire, to help you plan, architect, and build new systems, and for technical writing
and articles. You can take a look at some projects I've worked on and
some of my writing. If you'd like to talk about your upcoming project,
do get in touch.