Rspamd 2.6 has been released

We have released Rspamd 2.6 today.

There are several major projects in this release: neural network plugin various improvements, better bitcoin scam detection, conditional regular expressions and other reworks of the code, such as shadow results support has been done. Numerous of the bug fixes, including some critical ones have also been applied during this release cycle.

Here is a list of the major projects and serious bugfixes where applicable.

Neural network plugin rework

Rspamd now includes PCA method to reduce the input space dimentionality in the heavily customised environments with many rules. This method allows to transform all rules set to a fixed number of inputs for neural network using linear transformation. There are also other improvements for neural network plugin that have been added in this release, including the following:

Probabilistic learn method where spam and ham samples could be not balanced (useful for the cases where spam/ham amounts are significantly different)
Allowing to set a maximum number of inputs for ANN (via PCA prefiltering)
Reworked the internal structure of ANN (more hidden layers and fixed the output function)
Low level tensors library for speeding up the matrices operations
BLIS algebra library support

Reworked bitcoin detection library

Rspamd now supports lua filters for regular expressions. The idea is to allow fast pre-filter with regular expressions and slow Lua postprocessing for the cases where this processing is needed. Here is how it’s used in bitcoin library:

config.regexp['RE_POSTPROCESS'] = {
  description = 'Example of postprocessing for regular expressions',
  re = string.format('(%s) || (%s)', re1, re2),
  re_conditions = {
    [re1] = function(task, txt, s, e)
      if e - s <= 2 then
        return false
      end

      if check_re1(task, txt:sub(s + 1, e)) then
        return true
      end
    end,
    [re2] = function(task, txt, s, e)
      if e - s <= 2 then
        return false
      end

      if check_re2(task, txt:sub(s + 1, e)) then
        return true
      end
    end,
  },
}

This allows to add accelerated rules that are enabled merely if some relatively rare regular expression matches. In this particular case this feature is used to do BTC wallet verification and validation.

IDNA bugs are fixed

Dr. Hajime Shimada and Mr. Shirakura from Nagoya University have investigated that it is possible to bypass Rspamd URLs detection by using of a special Unicode characters. We have changed this behaviour so now full IDNA validation/normalisation is performed. I would like to thank the researchers for sharing that with us.

Fuzzy module telemetry

Rspamd will now send more data when checking for fuzzy hashes: it will send the source IP address of email being scanned and the domain name of a sender. This data is end-to-end encrypted between you and Rspamd public fuzzy storage and I plan to use it for better spam detection. If you don’t want this data to be shared then please stop using of the public fuzzy storage or set no_share flag to true.

Other major improvements

Use google-ced instead of libicu character detection
Rework and refactor forged recipients plugin
Added SO_REUSEPORT support for UDP sockets on Linux

Several major fixes:

Base64 detection has been fixed and improved to reduce FP rate
Query urls are now fully processed
Bundled libev has been updated to 4.33 (fixing many issues with FD closing race conditions)
Fixed ANN normalisation
Fixed redis backend leaks

Useful features:

Added whitelisted_signers_map in ARC module
Implemented /etc/hosts files processing

Here is the list of the most important changes:

[Conf] Mark Rspamd emailbl as ignore whitelist
[Conf] RBL: Add missing emails = true option
[Feature] Add support for scripts in fuzzy storage
[Feature] Arc: Add whitelisted_signers_map option
[Feature] Implement hosts file processing
[Feature] Neural: Introduce classes bias that allows non-equal classes learning
[Feature] Update libev to 4.33
[Fix] Another brain damage html standard adoptions
[Fix] Another fix for brain damaged obs-fws state
[Fix] Fix flags that caused force_actions failure
[Fix] Fix logging issue
[Fix] Fix lua symbols scores registration when config does not define scores
[Fix] Fix opaque maps logic
[Fix] Fix parsing of the html tags with no spaces after attributes
[Fix] Fix some corner cases in urls parsing, add limits
[Fix] Fix tlds extraction if custom composition rules are used
[Fix] Fix variables replacement in mempool
[Fix] Improve base64 detection
[Fix] Normalize dynamic scores in ANN correctly
[Fix] Plug memory leak introduced by #3153
[Fix] Stat_redis_backend: Fix memory leak and simplify learn path
[Fix] Try hard to deal with ghost workers
[Fix] metadata_exporter default formatter
[Rework] Change the way to extract URLs when dealing with alternative parts
[Rework] Fix various url extraction issues
[Rework] Re cache: Load compiled hyperscan in the main process as well
[Rework] Re cache: Load hyperscan early
[Rework] Rework URL structure: adjust tld part
[Rework] Rework URL structure: host field
[Rework] Rework URL structure: more structure optimisations
[Rework] Rework URL structure: user field
[Rework] URL: Another update for urls extraction logic
[Rework] Urls: Improve query urls handling
[Rework] Urls: adopt html related stuff
[Rework] Urls: more rework of the urls sets
[Rework] Urls: process query urls in HTML urls correctly
[Rework] Urls: rework urls hash structure
[Rework] Urls: update lua libraries
[Rework] Use multiple search tries for different url extraction types