July 13, 2005
Defeating Captchas, Scanning Images for Relevancy, etc.
I came across a few sites such as this one. Pretty interesting but no big surprises here…we came up with captchas over at Citysearch when we created a personalization portion a while back. As a sidenote on Captchas, check out this article.
I was also investigating, during my spare time and just out of curiosity, adding OCR functionality to a web crawler to add harvest additional keyword text to pages for search indexing (in addition to reading their EXIF and/or image comments, if present). I wasn’t doing this to index images and create an image search (where I could have used additional meta data around the image)…I was just doing this to squeeze out more context from the page the image was on (or the page it linked to…sometimes there were buttons with text and that single word could have described the entire contents of that page better than anything else). I’m unaware of any engine doing this. I didn’t complete my tests of how well it created relevance or figured out context but decoding the text within the image was about 60% accurate (things like neon or drop-shadow effects on the text threw it off).







