.oO  |  List directory  |  Similar  |  Print version
Diff: FuzzyOcr for SpamAssassin on Debian
 Legend:   Removed   Changed   Added 
 Ownership:  rw-rw-r-- ian linux
 Modified:  13 May 09, 20:48
 Modified by:  Ian Samuel (ian)
Rev.:  8 (Old)
 
 Ownership:  rw-rw-r-- ian linux
 Modified:  21 Apr 10, 10:58
 Modified by:  Ian Samuel (ian)
Rev.:  9 (Current)


+ %TITLE%

<toc>

Image spam has seen an increase in use lately, and getting SpamAssassin to recognize keywords from attached images is best done with OCR (Optical Character Recognition).

+ Installation

OCRAD is the easiest OCR scanning engine to use on Debian 4.0 because it is reasonably current.

<code>
aptitude install ocrad
</code>

2 years later a new release supporting SpamAssassin 3.2 has not yet been tagged, so it is probably easiest to just use the Debian unstable package. Because it is perl, it does not seem to have any unreasonable version dependencies.

You will want to check for the latest version here: http://packages.debian.org/fuzzyocr

<code>
wget -c http://ftp.us.debian.org/debian/pool/main/f/fuzzyocr/fuzzyocr_3.5.1+svn135-1_all.deb
dpkg -i fuzzyocr_3.5.1+svn135-1_all.deb
apt-get -f install
</code>

This version has a bug discussed here:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522285

You will want to make the modifications after installation.

<code>
vi /usr/share/perl5/FuzzyOcr/Preprocessor.pm
</code>

+ Create a FuzzyOCR home

I wanted to keep the fuzzyocr log files and image hash databases in one place so I created a directory for them.

<code>
mkdir /var/lib/spamassassin/fuzzyocr
touch /var/lib/spamassassin/fuzzyocr/FuzzyOcr.log
chown -R spamd: /var/lib/spamassassin/fuzzyocr
</code>

And making a few configuration changes

<code>
@@ -34,7 +34,7 @@
# Level 2 - Errors, Warnings and Info Messages
# Level 3 - Full debug output
# Default value: 1
-#focr_verbose 3
+focr_verbose 2

# Log Message-Id, From, To
# Default: 1
+ %TITLE%

<toc>

Image spam has seen an increase in use lately, and getting SpamAssassin to recognize keywords from attached images is best done with OCR (Optical Character Recognition).

+ Installation

OCRAD is the easiest OCR scanning engine to use on Debian 4.0 because it is reasonably current.

<code>
aptitude install ocrad
</code>

2 years later a new release supporting SpamAssassin 3.2 has not yet been tagged, so it is probably easiest to just use the Debian unstable package. Because it is perl, it does not seem to have any unreasonable version dependencies.

You will want to check for the latest version here: http://packages.debian.org/fuzzyocr

<code>
wget -c http://ftp.us.debian.org/debian/pool/main/f/fuzzyocr/fuzzyocr_3.5.1+svn135-1_all.deb
dpkg -i fuzzyocr_3.5.1+svn135-1_all.deb
apt-get -f install
</code>

This version has a bug discussed here:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522285

You will want to make the modifications after installation.

<code>
vi /usr/share/perl5/FuzzyOcr/Preprocessor.pm
</code>

+ Create a FuzzyOCR home

I wanted to keep the fuzzyocr log files and image hash databases in one place so I created a directory for them.

<code>
mkdir /var/lib/spamassassin/fuzzyocr
touch /var/lib/spamassassin/fuzzyocr/FuzzyOcr.log
chown -R spamd: /var/lib/spamassassin/fuzzyocr
</code>

And making a few configuration changes

<code>
@@ -34,7 +34,7 @@
# Level 2 - Errors, Warnings and Info Messages
# Level 3 - Full debug output
# Default value: 1
-#focr_verbose 3
+focr_verbose 2

# Log Message-Id, From, To
# Default: 1
@@ -42,11 +42,11 @@
@@ -42,6 +42,6 @@

# Send logging output to stderr.
# Default value: 1
-#focr_log_stderr 0
# Send logging output to stderr.
# Default value: 1
-#focr_log_stderr 0
+focr_log_stderr 0
+focr_log_stderr 1

# Logfile (make sure it is writable by the plugin)
# Default value: none
# Logfile (make sure it is writable by the plugin)
# Default value: none
-#focr_logfile /tmp/FuzzyOcr.log
+focr_logfile /var/lib/spamassassin/fuzzyocr/FuzzyOcr.log

###
### Wordlists

@@ -179,7 +179,7 @@

# Timeout for the plugin, in seconds. (Maximum runtime of the plugin)
# Default value: 10
-#focr_timeout 15
+focr_timeout 15

# Use a global timeout value instead of per helper application.
# Default value: 0
@@ -299,7 +299,7 @@
# skip the scans when the image is found in the database, using the score
# from the previous scans.
#--
-#focr_enable_image_hashing 3
+focr_enable_image_hashing 2

# Set this to skip updating the hashing database at startup
# Default value: 0 (update at startup)
@@ -323,16 +323,16 @@
# If the image hash db feature is enabled (Type 2 Hashing),
# specify the file to use as the SPAM database
# Default value: /etc/spamassassin/FuzzyOcr.db
-#focr_db_hash /etc/spamassassin/FuzzyOcr.db
+focr_db_hash /var/lib/spamassassin/fuzzyocr/FuzzyOcr.db

# If the image hash db feature is enabled (Type 2 Hashing),
# specify the file to use as the HAM database
# Default value: /etc/spamassassin/FuzzyOcr.safe.db
-#focr_db_safe /etc/spamassassin/FuzzyOcr.safe.db
+focr_db_safe /var/lib/spamassassin/fuzzyocr/FuzzyOcr.safe.db

# Auto-prune: Expire records from hasing databases after these many days
# Default value: 35
-#focr_db_max_days 15
+focr_db_max_days 15

###
### MySQL options (Type 3 Hashing)
</code>

Restart spamassassin and test

<code>
/etc/init.d/spamassassin restart
tail -f /var/lib/spamassassin/fuzzyocr/FuzzyOcr.log
</code>

+ Maintenance

Create a logrotate file */etc/logrotate.d/fuzzyocr*:

<code>
/var/lib/spamassassin/fuzzyocr/FuzzyOcr.log {
daily
missingok
rotate 10
compress
delaycompress
notifempty
create 640 spamd spamd
}
</code>

Schedule a daily cleanup in cron to remove temporary images:

<code>
crontab -e -u spamd
</code>

<code>
@daily perl /usr/share/doc/fuzzyocr/Utils/fuzzy-clean
</code>
@@ -179,7 +179,7 @@

# Timeout for the plugin, in seconds. (Maximum runtime of the plugin)
# Default value: 10
-#focr_timeout 15
+focr_timeout 15

# Use a global timeout value instead of per helper application.
# Default value: 0
@@ -299,7 +299,7 @@
# skip the scans when the image is found in the database, using the score
# from the previous scans.
#--
-#focr_enable_image_hashing 3
+focr_enable_image_hashing 2

# Set this to skip updating the hashing database at startup
# Default value: 0 (update at startup)
@@ -323,16 +323,16 @@
# If the image hash db feature is enabled (Type 2 Hashing),
# specify the file to use as the SPAM database
# Default value: /etc/spamassassin/FuzzyOcr.db
-#focr_db_hash /etc/spamassassin/FuzzyOcr.db
+focr_db_hash /var/lib/spamassassin/fuzzyocr/FuzzyOcr.db

# If the image hash db feature is enabled (Type 2 Hashing),
# specify the file to use as the HAM database
# Default value: /etc/spamassassin/FuzzyOcr.safe.db
-#focr_db_safe /etc/spamassassin/FuzzyOcr.safe.db
+focr_db_safe /var/lib/spamassassin/fuzzyocr/FuzzyOcr.safe.db

# Auto-prune: Expire records from hasing databases after these many days
# Default value: 35
-#focr_db_max_days 15
+focr_db_max_days 15

###
### MySQL options (Type 3 Hashing)
</code>

Restart spamassassin and test

<code>
/etc/init.d/spamassassin restart
tail -f /var/lib/spamassassin/fuzzyocr/FuzzyOcr.log
</code>

+ Maintenance

Create a logrotate file */etc/logrotate.d/fuzzyocr*:

<code>
/var/lib/spamassassin/fuzzyocr/FuzzyOcr.log {
daily
missingok
rotate 10
compress
delaycompress
notifempty
create 640 spamd spamd
}
</code>

Schedule a daily cleanup in cron to remove temporary images:

<code>
crontab -e -u spamd
</code>

<code>
@daily perl /usr/share/doc/fuzzyocr/Utils/fuzzy-clean
</code>


Home | Main | Linux | FreeBSD