Would you be surprised to know that some people who work in the search engine "industry" know who is responsible for a lot of the comment spam out there? I met some of them recently. And some of them even have blogs of their own. Seriously.
I haven't written much about this yet, but with the recent problems that have been exposed in MovableType (see: Comment Spam Load Issue, More on Comment Spamming, and Spam and the Tragedy of the Commons).
One of the comment spammers asked me: "You know why we spam blogs, don't you?" And I knew the answer. They do it because blogs are easy targets and because, just like e-mail spam, it works.
Jay Allen said:
If I chase the spammers out of my yard and onto the neighbors, it's only a matter of time until they come back. No, we all need to disincentivize these fuckers now.
He's right. And there's and 80/20 solution that ought to go a long way toward solving this problem. We know that spam works because of web page ranking algorithms based on link counting (PageRank, WebRank, whatever). But as humans, we can clearly distinguish between content posted by a blog's owner and that posted by random, anonymous, and possibly malicious users (or spambots). Search engines today seem not to, but there's a reasonable argument that it's worth putting some effort into.
If you assume the following:
- 80% of blogs are hosted by or produced on one of the more popular blogging platforms
- 80% of people don't significantly tweak the default templates available in their blogging software
- those people are the least likely to be actively fighting spam and, as a result, have more spam than the 20% of blogs where the owner is more defensive
Then a partial solution is fairly clear. I've heard and seen others discuss it over the past few months. The search engines needs to be smarter about reading and indexing content.
When folks like Tim build software that classifies pages, the software needs to be able to recognize the difference between links produced by the blog owner(s) and those contributed by readers and spambots.
Once you can identify the difference between those two types of links, you simply stop using the second type of link when calculating rank. Sure, you can still count them for the purpose of providing link counts--just donn't factor them into the ranking.
How's that for removing the incentive?
I bet you'd like to know what software the blog spammers use to run their own weblogs, wouldn't you?
Posted by jzawodn at December 18, 2004 06:53 PM
Ever have that feeling where you really want to join in a conversation, but know it's a bad idea?
I will say this: transparency is a cool thing, except in adversarial domains. And now I'm going to shut up.
Very interesting point. I think the search engines have to accept responsibility for this kind of spam, and so they should do something about it as soon as possible. In this post I suggest either a new markup tag or a specific DIV class name identifying content not authored by the site owner. (I tried to track back but received an error: "Ping 'http://jeremy.zawodny.com/mt/mt-tb.cgi?tb_id=3236' failed: Need a TrackBack ID (tb_id).".)
Heh.
Tim's right. That exactly why I'm doing neither of the following (before anyone asks):
- providing a URLs to blog spammers's blogs
- telling you anything about how Y! Search deals with spam
Does that mean this isn't a discussion worth having? I think not.
Jeremy: Must have been an interesting chat that you had..I can't imagine meeting one of those people; you're long and invisible adversary just coming right up to you and saying, "Hey, i'm a comment spammer. Nice to meet you."
Or something of the sort.
But even links in (legitimate) comments should contribute to PageRank: When I discuss some issue on my blog and people write comments with relevant links, then Google should should rank these sites higher.
Your solution assumes that all links in comments are to spammers' sites.
Then again, solving the comment spam problem would probably be worth breaking PageRank just a little.
You mean braking PageRank even more than it already is? ;-)
Sorry, couldn't resist.
But yeah, I'm saying that it's worth being a bit hard nosed about this. I'm sure others will disagree.
Just be glad they don't let me near the Y! Search code!
But sites can employ that distinction today. Simply change all externally provided links so that they're run through a redirection service.
Once *everybody* starts doing that, maybe some of this silliness will stop.
Well, one of the reasons I HATE that, Brad, is because because you lose the referrer information. If someone clicks through to my site from a comment on Simon's site, I have no idea where it came from. And referrer information is valuable.
Also, there is the issue of PageRank. Arguably, shouldn't people who make "good" comments be rewarded by increased PageRank? I would think so. And this is why people should be allowed to keep their regular URLs.
There are other ways to fight this problem.
Stu, I think it's a step in the right direction. Google could add another parameter which would identify the originating site. Something like this:
http://www.google.com/url?sa=D&q=URL&r=referring_url
Google could even figure out which direction the link should redirect to. If the referring site matches the referring URL (or there is no referrer as the case may be), it takes you to the URL given by the "q" parameter. If the referring site matches the "q" URL, it would redirect to the "r" URL.
That would allow you to determine the source of the link. Of course, that could lead to more referrer spam, but that's already a problem, so it's not like it would make matters any worse.
And then there is the issue of PageRank. Yes it is arguable. I would argue that any 3rd parties supplying links on my web site should not be counted by Google, since that counting is based on my PageRank. It's like saying, "Brad should be recommending site XYZ.com because I told him about it. So Google, please make a note of that." No. I don't know XYZ.com from ABC.com, and I'll link to it in my own post or sideblog if I think it should be recommended. The commenter is welcome to make plain links on their own weblog if they care so much about XYZ.com.
And yes, there are a bazillion ways to fight, but links from other sites is one of the main incentives for comment spam. And through all this, I'm using PageRank in a generic way to explain the linking benefit since we're aware of that term.
And you know what I hate more than lost referrer data? Comment spam.
One obvious solution is to simply mark your weblog with a noindex so Google, Yahoo! and co don't index it. I had spam problems on a message board I run, which magically disappeared when I did that.
Of course, then you miss out on all the lovely search engine traffic...
A while back Lachlan Hunt proposed a metadata profile that'd do what you ask, if only the search engines would respect it. See his entry on it (scroll down and pay particular attention to the "unendorsed" relation):
http://lachy.id.au/blogs/log/2004/08/link-relationships.html
I agree that search engines will need to become smarter about the semantics of links; the better they get at recognizing good and relevant content, the less opportunities there will be for abuse. But ignoring any link in blog comments sounds more like a temporary hack than a partial solution; it will ignore a lot of relevant links.
The only reason redirection works right now is because the search engines doesn't handle it correctly. That needs to be fixed, because it is already being abused in other ways. A link going through one or more redirection services works just like a regular link for the user, so it should work just like a regular link in any link calculations.
There is an easy comment spam solution, but pagerank-obsesed bloggers don't seem to agree...
Funny that you mention it - in my German blog, I have enabled google redirection for all URLs in my comments just the other day.
Some people also noted that referrers break and that even "good" URLs in my comments won't get any pagerank.
I think this is pretty much the price you'll have to pay if you want to fight comment spam effectively. And it's worth it.
If there were redirectors for all comments' URLs in all of the major blog publishing systems, comment spam would simply be useless.
I take a more hardline approach actually and specifically exclude comment/trackback content from Google's index. This is a personal preference however-- not everyone would be so inclined. I wish there was a "proper" way to go about doing this, by adding markup to more granularly instruct search engines on what to index (as opposed to including/excluding whole pages and/or directories). You can read about it here.
Take a look at the Google cache for one of my individual archive pages to see what Google sees. I don't let Google index my sidebar and things like that either, since I don't want "photos" matching for every page of my site.
What I'd be interested in: If you met some of the folks that are responsible for a lot of comment spam, how come you just didn't beat them up? Is it because those guys happen to be good customers of Overture?
So what software do they use? Not like it'd be hard to figure out how to create your own.
Doensn't surprise me that comment spammers have blogs of their own. Are they filled with comment spam too? I really think you should list their URLs (why not?) and them.
I am surprised that they would openly admit what they do, who they are, and what their URLs are at a search engine conference and to seach engine employees.
If PageRank is "broken", then enlighten us all on how it can be "fixed".
What about using an image verification box like they do for whois lookups?
I personally think blog spamming won't stop when Y! or G stop counting these links. There will always be the chance that a new SE (like MSN or Mirago) just don't have it in their algos yet. I think the biggest problem is the fact that it's too easy and there's no way of reaching the blog spammers. Maybe bloggers should consider starting sth like a "Do not comment-spam" list and offer it to the spammers. Maybe some of them regard it. Just a thought ....
Another possibility for marking up comments sections of pages would be to put a div around it (as many CMS templates already do) and then tell robots the div's class/id name in robots.txt. This, I think, would require the least change to various templates (if any change at all), and would enable spiders to identify the user-generated content ridiculously easily.
I already redirect all external links through my own redirection handler. This helps me track from which page people leave my site and where they go, but it doesn't stop spammers blindly adding my mt-comments.cgi to their list of URLs to hit. I'm sure none of them take the time to notice whether their links will go direct or not. (And yes, this means HTTP_REFERER gets messed up, but in my opinion exit tracking is so useful that this will eventually become the common paradigm, and some other mechanism for passing the "original" referrer should be devised.)
Meanwhile I've implemented a CAPTCHA system for comments left on my site. Seems to be the simplest way to prevent at least automated spammers, meaning I only have to delete the occasional hand-placed spam rather than hundreds of the things.
As a blog software developer I have been trying to make my software spam resisitant. I have found captcha to be the best solution as it prevents automated spamming. Imagine if every email server had captcha how little spam we would get.
Joe, Ross, Eaden:
I think it's important that we make it absolutely as easy as possible to use the software, but I am beginning to think that maybe instituting a CAPTCHA might not be such a bad idea.
HOWEVER, that doesn't help with the problem of trackback/pingback spam... I guess that the best thing would be to have a CAPTCHA for comments and just approve all trackbacks and pingbacks.
"Would you be surprised to know that some people who work in the search engine "industry" know who is responsible for a lot of the comment spam out there? I met some of them recently. And some of them even have blogs of their own. Seriously."
Not only would that not surprise me in the least, but it also would not surprise me if they were useful contributors to my blog or yours. In fact, I already know they are.
As much as it is human nature to paint these people as either complete morons (dangerous) or demons with horns and a pointy tail (victimizing), it's not difficult for a realtively intelligent person to put themselves into their shoes and figure out how they themselves would game the system.
In any battle, that's precisely what you have to do to win. This one is no different.
"Simply change all externally provided links so that they're run through a redirection service.
[...]
"Well, one of the reasons I HATE that, Brad, is because because you lose the referrer information."
It doesn't have to be this way. Simon Willison and I came up with a system last year that keeps bi-directional state so that either way you click on the referrer, you go to the correct destination. We didn't get to coding it, but I'm sure it will make it's appearance at some point.
"Arguably, shouldn't people who make "good" comments be rewarded by increased PageRank? I would think so. And this is why people should be allowed to keep their regular URLs."
So where do you draw the line between a "good" comment and a bad one? Would someone saying "Me too" be a bad comment unworthy of PageRank? Sounds like a tenuous value judgement.
"I am surprised that they would openly admit what they do, who they are, and what their URLs are at a search engine conference and to seach engine employees.
See, I'm not surprised at all. To these people, it's a business. This is not the emotional issue that it is for us. They make their money doing what they do and that's why they do it. Just because they are smart enough to squeeze profit out of a broken system doesn't make them evil.
Of course, my personal feeling is that it's unethical, but prospectors and con-men never believe themselves to be unethical, only opportunistic.
I am suprised that none of the major blogging tools or hosters (please correct me) have integrated bayesian filters into their comment systems. While not perfect they are extremely effective.
My trackback to you still seems to be having problems, so here is the url to my thoughts:
http://joseph.randomnetworks.com/archives/2004/12/20/fixing-search-engines-wont-stop-comment-spam/
Basically, search engines ignoring comment spam isn't going to be enough.
I am suprised that none of the major blogging tools or hosters (please correct me) have integrated bayesian filters into their comment systems. While not perfect they are extremely effective.
For email spam, yes. But how would Bayesian filtering help when the only difference between a good comment and a spam comment is a single URL?
I've got a plugin that i've created that uses a whitelist of urls that i'll allow in comments. If you're not blessed, you get shot through google.
This, piled on top of a few honeypots and flaggers has dropped my spam count to zero. I've made the return small enough for smart spammers to waste time trying to figure out how to beat it, and steep enough for stupid spammers to just be blocked outright. Spamming is a crime of opportunity.
Do tell! What software to the comment spammers use to run their own blogs?
"For email spam, yes. But how would Bayesian filtering help when the only difference between a good comment and a spam comment is a single URL?"
How many good comments are a single URL? What are the qualities of comment spam that makes it different? Surely it has to have some attention grabbing text?
I'm sure you can find (with a bit of trial and error) a reasonable tokenisation that would create sufficient statistical bandwidth.
Something for my to-do list anyway!
How many good comments are a single URL?
First of all, I didn't say that the spam would contain just a single URL. What I said was that the difference between a good comment and a spam was the inclusion of a single URL. Bayesian works well when you can identify parts of the text that are spammish. When a spammer's comment is precisely the same as a normal comment (even on topic and contextual) with the exception of a single link that they are posting for the purposes of increasing PageRank, Bayesian is impossible to train.
What are the qualities of comment spam that makes it different? Surely it has to have some attention grabbing text?
You've never seen a piece of weblog spam, have you? Weblog spam is not the same as email spam. It is not trying to get the user to click on anything or get your attention. In fact, it wants the least possible attention it can get from everyone except a single visitor called the GoogleBot.
On CAPTCHA, Bongo feels like taking the SAT test. Gimpy has just about made me go blind.
I wonder if things like CAPTCHA will start an AI race and produce winners for the Turing Test.
"You've never seen a piece of weblog spam, have you?"
Not many no :) Hmm, yup that's a hard problem alright. Thanks for the extra info, I kinda suspected I was missing something.
I just closed comments on all my posts other than those appearing on the front page. Before I did this I was receiving hundreds of spam comments a day. Since I took this step, I have been getting about one spam comment a month.
I manage a software development company, and one of our services is mass e-mailing (NOT spam, mind you), and another thing is (free of charge) blogging via place centered sites like chattablogs.com
Being part owner, its a choice I make, but its getting to the point of financially ridiculous time & resources we put into fighting comment spam. It's a non-stop war: we have so many blog sites and so many bloggers we get one step ahead of the spammers and then they come back. They've even started doing what looks to be apparent DoS attacks, somewhat out of frustration. They've been leaving little "Hey, F you terrablogs" messages in there. It's getting ridiculous.
Jez wrote "I just closed comments on all my posts other than those appearing on the front page. Before I did this I was receiving hundreds of spam comments a day. Since I took this step, I have been getting about one spam comment a month."
Oh, there's no doubt that it works. You are essentially reducing your exposure to the spammers as well as eliminating commenting on the posts with the highest Page Rank.
Of course, it's not 100% solution. You could reduce the size of the window to, say, to only the last post and only for 30 seconds after posting. That ought to do it with the added benefit of creating a fun race for your visitors.
Eventually though, spammers will be waiting in the wings too, reloading your page night and day waiting for you to post. At that point, you should use HTTP authentication to keep spammers away from the comment form entirely. Of course, it will also keep just about everyone away from your content, but it's worth it to get that spam to zero.
Finally, if by some curse of the Gods they are STILL able to break in, you should ask your web host to uninstall the ethernet cable from the back of the server. While this may make it slightly less convenient for you in accessing your server, you will undoubtedly cut out 100% of your spam problem.
See? Easy peasy...
One new to me phenomenon I'm running across now are weblogs with content consisting entirely of spam links.
Owner! Thanks for your site!
diet food
http://diet-food-x.tripod.com/
Very useful site! Thanks!
diet fitness
http://jabcomics.slutsfiles.com/
Good job! Good design.
The best wishes to your team!
bloc party
http://bloc-party.slutsfiles.com
Good design and interesting information on your site!
sams club
http://sams-club.slutsfiles.com/
It seems that the posts on this blog that mention comment spam are the most likely targets here for comment spam. That's rather odd, don't ya think?