====== FuzzyOCR for Spamassassin on Debian ====== Image spam has seen an increase in use lately, and getting SpamAssassin to recognize keywords from attached images is best done with OCR (Optical Character Recognition). ====== Installation ====== OCRAD is the easiest OCR scanning engine to use on Debian 4.0 because it is reasonably current. aptitude install ocrad 2 years later a new release supporting SpamAssassin 3.2 has not yet been tagged, so it is probably easiest to just use the Debian unstable package. Because it is perl, it does not seem to have any unreasonable version dependencies. You will want to check for the latest version here: wget -c http://ftp.us.debian.org/debian/pool/main/f/fuzzyocr/fuzzyocr_3.5.1+svn135-1_all.deb dpkg -i fuzzyocr_3.5.1+svn135-1_all.deb apt-get -f install This version has a bug discussed here:

You will want to make the modifications after installation.

vi /usr/share/perl5/FuzzyOcr/Preprocessor.pm ====== Create a FuzzyOCR home ====== I wanted to keep the fuzzyocr log files and image hash databases in one place so I created a directory for them. mkdir /var/lib/spamassassin/fuzzyocr touch /var/lib/spamassassin/fuzzyocr/FuzzyOcr.log chown -R spamd: /var/lib/spamassassin/fuzzyocr And making a few configuration changes @@ -34,7 +34,7 @@ # Level 2 - Errors, Warnings and Info Messages # Level 3 - Full debug output # Default value: 1 -#focr_verbose 3 +focr_verbose 2 # Log Message-Id, From, To # Default: 1 @@ -42,6 +42,6 @@ # Send logging output to stderr. # Default value: 1 -#focr_log_stderr 0 +focr_log_stderr 1 # Logfile (make sure it is writable by the plugin) # Default value: none @@ -179,7 +179,7 @@ # Timeout for the plugin, in seconds. (Maximum runtime of the plugin) # Default value: 10 -#focr_timeout 15 +focr_timeout 15 # Use a global timeout value instead of per helper application. # Default value: 0 @@ -299,7 +299,7 @@ # skip the scans when the image is found in the database, using the score # from the previous scans. #-- -#focr_enable_image_hashing 3 +focr_enable_image_hashing 2 # Set this to skip updating the hashing database at startup # Default value: 0 (update at startup) @@ -323,16 +323,16 @@ # If the image hash db feature is enabled (Type 2 Hashing), # specify the file to use as the SPAM database # Default value: /etc/spamassassin/FuzzyOcr.db -#focr_db_hash /etc/spamassassin/FuzzyOcr.db +focr_db_hash /var/lib/spamassassin/fuzzyocr/FuzzyOcr.db # If the image hash db feature is enabled (Type 2 Hashing), # specify the file to use as the HAM database # Default value: /etc/spamassassin/FuzzyOcr.safe.db -#focr_db_safe /etc/spamassassin/FuzzyOcr.safe.db +focr_db_safe /var/lib/spamassassin/fuzzyocr/FuzzyOcr.safe.db # Auto-prune: Expire records from hasing databases after these many days # Default value: 35 -#focr_db_max_days 15 +focr_db_max_days 15 ### ### MySQL options (Type 3 Hashing) Restart spamassassin and test /etc/init.d/spamassassin restart tail -f /var/lib/spamassassin/fuzzyocr/FuzzyOcr.log ====== Maintenance ====== Create a logrotate file /etc/logrotate.d/fuzzyocr: /var/lib/spamassassin/fuzzyocr/FuzzyOcr.log { daily missingok rotate 10 compress delaycompress notifempty create 640 spamd spamd } Schedule a daily cleanup in cron to remove temporary images: crontab -e -u spamd @daily perl /usr/share/doc/fuzzyocr/Utils/fuzzy-clean