12.4  Web content filtering by word occurrence

WinRoute can also filter Web pages that include undesirable words.

This is the filtering principle: Denied words are matched with values, called weight (represented by a whole positive integer). Weights of these words contained in a required page are summed (weight of each word is counted only once regardless of how many times the word is included in the page). If the total weight exceeds the defined limit (so called threshold value), the page is blocked.

So called forbidden words are used to filter out web pages containing undesirable words. URL rules (see chapter 12.2  URL Rules) define how pages including forbidden content will be handled.

Warning

Definition of forbidden words and threshold value is ineffective unless corresponding URL rules are set!

Definition of rules filtering by word occurrence

First, suppose that some forbidden words have been already defined and a threshold value has been set (for details, see below).

On the URL Rules tab under Configuration → Content Filtering → HTTP Policy, create a rule (or a set of rules) to allow access to the group of web pages which will be filtered by forbidden words. Go to the Content Rules tab under HTTP Rule to enable the web content filter.

Take a rule that will filter all web sites by occurrence of forbidden words as an example.

  • On the General tab, allow all users to access any web site.

    A rule filtering web pages by word occurrence (allow access)

    Figure 12.9. A rule filtering web pages by word occurrence (allow access)


  • On the Content Rules tab, check the Deny Web pages containing... option to enable filtering by word occurrence.

    A rule filtering web pages by word occurrence (word filtering)

    Figure 12.10. A rule filtering web pages by word occurrence (word filtering)


Word groups

To define word groups go to the Word Groups tab in Configuration → Content Filtering → HTTP Policy, the Forbidden Words tab. Words are sorted into groups. This feature only makes WinRoute easier to follow. All groups have the same priority and all of them are always tested.

Groups of forbidden words

Figure 12.11. Groups of forbidden words


Individual groups and words included in them are displayed in form of trees. To enable filtering of particular words use checkboxes located next to them. Unchecked words will be ignored. Due to this function it is not necessary to remove rules and define them again later.

Note: The following word groups are predefined in the default WinRoute installation:

  • Pornography — words that typically appear on pages with erotic themes,

  • Warez / Cracks — words that typically appear on pages offering downloads of illegal software, license key generators etc.

All key words in predefined groups are disabled by default. A WinRoute administrator can enable filtering of the particular words and modify the weight for each word.

Threshold value for Web page filtering

The value specified in Deny pages with weight over represents so called threshold weight value for each page (i.e. total weight of all forbidden words found at the page). If the total weight of the tested page exceeds this limit, access to the page will be denied (each word is counted only once, regardless of the count of individual words).

Definition of forbidden words

Use the Add button to add a new word into a group or to create a new group.

Definition of a forbidden word or/and a word group

Figure 12.12. Definition of a forbidden word or/and a word group


Group

Selection of a group to which the word will be included. You can also add a new name to create a new group.

Keyword

Forbidden word that is to be scanned for. This word can be in any language and it should follow the exact form in which it is used on websites (including diacritics and other special symbols and characters). If the word has various forms (declension, conjugation, etc.), it is necessary to define separate words for each word in the group. It is also possible to set various weight of words.

Weight

Word weight the level of how the word affects possible blocking or allowing of access to websites. The weight should respect frequency of the particular word in the language (the more common word, the lower weight) so that legitimate webpages are not blocked.

Description

A comment on the word or group.