Table of Contents

FuzzyOCR for Spamassassin on Debian

<toc><ul><li><link topicref=“2”>Installation</link></li><li><link topicref=“3”>Create a FuzzyOCR home</link></li><li><link topicref=“4”>Maintenance</link></li></ul></toc> Image spam has seen an increase in use lately, and getting SpamAssassin to recognize keywords from attached images is best done with OCR (Optical Character Recognition).

Installation

OCRAD is the easiest OCR scanning engine to use on Debian 4.0 because it is reasonably current.

aptitude install ocrad

2 years later a new release supporting SpamAssassin 3.2 has not yet been tagged, so it is probably easiest to just use the Debian unstable package. Because it is perl, it does not seem to have any unreasonable version dependencies.

You will want to check for the latest version here: <uri strref=“http://packages.debian.org/fuzzyocr”/>

wget -c http://ftp.us.debian.org/debian/pool/main/f/fuzzyocr/fuzzyocr_3.5.1+svn135-1_all.deb
dpkg -i fuzzyocr_3.5.1+svn135-1_all.deb
apt-get -f install

This version has a bug discussed here:

<uri strref=“http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522285”/> <p>You will want to make the modifications after installation.</p>

vi /usr/share/perl5/FuzzyOcr/Preprocessor.pm

Create a FuzzyOCR home

I wanted to keep the fuzzyocr log files and image hash databases in one place so I created a directory for them.

mkdir /var/lib/spamassassin/fuzzyocr
touch /var/lib/spamassassin/fuzzyocr/FuzzyOcr.log
chown -R spamd: /var/lib/spamassassin/fuzzyocr

And making a few configuration changes

@@ -34,7 +34,7 @@                                                     
 # Level 2 - Errors, Warnings and Info Messages                       
 # Level 3 - Full debug output                                        
 # Default value: 1                                                   
-#focr_verbose 3                                                      
+focr_verbose 2                                                       
                                                                      
 # Log Message-Id, From, To                                           
 # Default: 1                                                         
@@ -42,6 +42,6 @@                                                   
                                                                      
 # Send logging output to stderr.                                     
 # Default value: 1                                                   
-#focr_log_stderr 0                                                   
+focr_log_stderr 1                                                    

 # Logfile (make sure it is writable by the plugin)
 # Default value: none
@@ -179,7 +179,7 @@

 # Timeout for the plugin, in seconds. (Maximum runtime of the plugin)
 # Default value: 10
-#focr_timeout 15
+focr_timeout 15

 # Use a global timeout value instead of per helper application.
 # Default value: 0
@@ -299,7 +299,7 @@
 # skip the scans when the image is found in the database, using the score
 # from the previous scans.
 #--
-#focr_enable_image_hashing 3
+focr_enable_image_hashing 2

 # Set this to skip updating the hashing database at startup
 # Default value: 0 (update at startup)
@@ -323,16 +323,16 @@
 # If the image hash db feature is enabled (Type 2 Hashing),
 # specify the file to use as the SPAM database
 # Default value: /etc/spamassassin/FuzzyOcr.db
-#focr_db_hash /etc/spamassassin/FuzzyOcr.db
+focr_db_hash /var/lib/spamassassin/fuzzyocr/FuzzyOcr.db

 # If the image hash db feature is enabled (Type 2 Hashing),
 # specify the file to use as the HAM database
 # Default value: /etc/spamassassin/FuzzyOcr.safe.db
-#focr_db_safe /etc/spamassassin/FuzzyOcr.safe.db
+focr_db_safe /var/lib/spamassassin/fuzzyocr/FuzzyOcr.safe.db

 # Auto-prune: Expire records from hasing databases after these many days
 # Default value: 35
-#focr_db_max_days 15
+focr_db_max_days 15

 ###
 ### MySQL options (Type 3 Hashing)

Restart spamassassin and test

/etc/init.d/spamassassin restart
tail -f /var/lib/spamassassin/fuzzyocr/FuzzyOcr.log

Maintenance

Create a logrotate file <strong>/etc/logrotate.d/fuzzyocr</strong>:

/var/lib/spamassassin/fuzzyocr/FuzzyOcr.log {
        daily
        missingok
        rotate 10
        compress
        delaycompress
        notifempty
        create 640 spamd spamd
}

Schedule a daily cleanup in cron to remove temporary images:

crontab -e -u spamd
@daily perl /usr/share/doc/fuzzyocr/Utils/fuzzy-clean