In reading Scott's post about weblog comment spam, I was reminded of a thought I've had for some time now. But rather than just tell you, I'll tell you how I came upon the idea and see how quickly you come to the same conclusion.

When I'm asked to interview job candidates at work, it's usually in one of a few capacities. Most often it's "the database interview" in which I get to figure out how much the interviewee knows about relational databases--specifically MySQL. (Gee, I wonder why the pick me for that.)

Other times I'm either interviewing folks for what I call a "cultural fit." That basically means I'm trying to figure out if he or she will "fit in" at Yahoo while also conveying an idea of what it's like to work at Yahoo, both the good and bad.

A few times I've also been asked to interview folks in a general technical capacity and to see how well they think about thorny issues, solve problems, etc. When I do that, one of my favorite lines of questioning involves search engine technology and the challenges of indexing the whole web.

At some point we end up discussing PageRank and similar techniques for figuring out site popularity and the various ways that one can abuse those techniques. So I eventually ask something like this:

Assuming that you have a map of the entire web (a link map or "graph" if you want to get all computer science about it), can you think of ways that you might try to detect and ultimately combat link spammers who are clearly trying to game the system?

The ensuing discussion is usually interesting, mostly because the candidate has rarely ever thought about the issue. But when prompted to do so a light bulb usually goes off. Sometimes it takes a few seconds but it usually happens.

Think about it a bit. I bet you'll come up with a few approaches. They may not be perfect, but that's hardly the point.

Now back to Feedster and Technorati. If you're still reading this, I'll pose the obvious question:

Assuming that you have a sizable list of all the weblogs around (meaning that you're Feedster or maybe Technorati), you crawl them regularly (or at least fetch their RSS feeds), know how often they update, and even know which ones frequently cross-link, can you think of one or more techniques for detecting weblog comment spam almost as it happens?

Yeah? Me too.

Furthermore, you could probably even offer a service that works with TypeKey, MT-Blacklist, and other solutions to proactively warn about the spammers.

Posted by jzawodn at May 12, 2004 09:08 PM

Reader Comments
# Bernhard Seefeld said:

I wrote about that some while ago. I wonder wether Google and Yahoo implemented something like this alreadys (in which case they should actually talk about it!)

on May 13, 2004 12:39 AM
# jr said:

Wait, I'm confused.

Most RSS feeds don't contain the comments, only the posts. For that matter, a significant portion of blogs display comments as a seperate pop-out window. The only way that such services would be able to detect comment spam directly would be to either publish comments as part of the principle RSS feed, a seperate RSS feed, or as a well defined part of the page. (Thus solving the "Comments" vs. "Yak Back Bro, Woot." issue.), or we create a "comment ping" service, similar to the way that posts can register themselves to yahoo!, blo.gs, feedster and the ilk.

I'm not opposed to that approach, although it would involve work on both the blog author's side as well as the comment clearing warehouse that would analyze incoming comments for trends, but it's probably "do-able".

Still, I can't think that something like that would be better than sending them through Google's pagerank stripper and letting that defeat the whole reason that folks spam comments to begin with. As an added bonus, other search engines don't have to do anything other than ignore URLs with the common stripping header (particularly since it's indicative by the page owner that they do not wish that link to be followed.)

Will you still get some idiot spam monkey who's too stupid to realize that it's not going to help? Sure, you bet. You're going to get them anyway.

on May 13, 2004 07:59 AM
# clever said:

Google not used links into blog for check page rank

on May 14, 2004 01:38 AM
# Bob Wyman said:

Grumble: I think you should have included PubSub.com in the list of sites that *should* be able to detect LinkSpamming...

One simple method would be to simply watch for sites that get very rapid boosts in their PubSub LinkRanks. Such burst-detection, possibly combined with analysis of the linked-to site's content, could give you a very good predicter of LinkSpamming.

The problem, of course, is that neither comments nor trackbacks are commonly included in the RSS files that PubSub and other sites typically read. Thus, in order to enable such detection, monitoring sites would have to scan the HTML of the blogs themselves (which would be *very* inefficient). A more infrastructure-heavy alternative would be for comments and trackbacks to be routed through a central and broadly used facility. (Or, blog hosters could agree to share a common facility while providing "branded" services.)

Even if the problem of detecting LinkSpamming can be solved, there remains the question: "What do you do about it once you've found it?" Do you send someone an email? If so who? Do you put the site on a blacklist? If so, does everyone's blog software now have to read that blacklist? Merely knowing that it is happening is only one of many steps that need to be taken.

bob wyman

on May 14, 2004 11:59 AM
# jon said:

I get no spammers on my blog atall because I use Haloscan which pops up the comment box in non crawlable form javascript form.

I figure as i run a gambling and sports blog i'd get bundles of spammers if i had crawlable comments.

Works for me boss.

on May 15, 2004 07:51 AM
Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

 

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.