Share and Enjoy !

MongoDB Collation Settings

Introduced in version 3.4, these MongoDB settings influence how data is selected and the order in which it is returned.

This post will demonstrate the collation settings used and the results of those settings. I have a sample set of data, queries, and results, along with some comments, to support these conclusions.

Use Case

We had the following use case on a recent project using MongoDB (version 4.2):

  • User wants to find data in a collection based on a user-entered string value.
  • The property to search in the collection is string data.
  • Data should be found regardless of character “case”; for example, searching for “e” should also match “E”.
  • Use an “includes” type search instead of an exact match; for example, searching for “e” should match both “e” and “yes”.
  • Ignore diacritics when matching data; for example, searching for “e” should also match “é”.
  • The sort order of results should be case-insensitive.

Implementation Details

To satisfy the use case presented, we did the following:

  • Used the MongoDB aggregation pipeline pattern with “allowDiskUse” to support large data sets, millions of items.
  • Used a case-insensitive regular expression query.
  • Used a collation to retrieve the results in case-insensitive order.

Below is the information we used to make these implementation decisions along with sample data and queries to demonstrate how various settings change the results.

MongoDB Information

This is the information that influenced our testing and decisions.

Collations

Collation allows users to specify language-specific rules for string comparison, such as rules for letter case and accent marks. You can specify collation for a collection or a view, an index, or specific operations that support collation.

Collation “Strength” Setting

The level of comparison to perform, which conforms to the ICU Comparison Levels.

MongoDB DocumentationInternational Components for Unicode (ICU)
Strength 1 – Primary level of comparison. Collation performs comparisons of the base characters only, ignoring other differences such as diacritics and case.Primary Level: Typically, this is used to denote differences between base characters (for example, “a” < “b”). It is the strongest difference. For example, dictionaries are divided into different sections by base character. This is also called the level-1 strength.
Strength 2 – Secondary level of comparison. Collation performs comparisons up to secondary differences, such as diacritics. That is, collation performs comparisons of base characters (primary differences) and diacritics (secondary differences). Differences between base characters take precedence over secondary differences.Secondary Level: Accents in the characters are considered secondary differences (for example, “as” < “às” < “at”). Other differences between letters can also be considered secondary differences, depending on the language. A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.
Note: In some languages (such as Danish), certain accented letters are considered to be separate base characters. In most languages, however, an accented letter only has a secondary difference from the unaccented version of that letter.
Strength 3 – Tertiary level of comparison. Collation performs comparisons up to tertiary differences, such as case and letter variants. That is, collation performs comparisons of base characters (primary differences), diacritics (secondary differences), and cases and variants (tertiary differences). Differences between base characters take precedence over secondary differences, which takes precedence over tertiary differences. This is the default strength level if strength is not specified.Tertiary Level: Upper- and lower-case differences in characters are distinguished at the tertiary level (for example, “ao” < “Ao” < “aò”). In addition, a variant of a letter differs from the base form on the tertiary level (such as “A” and “Ⓐ”). Another example is the difference between large and small Kana. A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called the level-3 strength

 

Collation Indexes

For the purposes of this post, I have purposely disregarded any index creation and usage regarding the sample queries and results.

Collation and Index Use

See: https://docs.mongodb.com/manual/reference/collation/index.html

To use an index for string comparisons, an operation must also specify the same collation. That is, an index with a collation cannot support an operation that performs string comparisons on the indexed fields if the operation specifies a different collation.

Regular Expressions and Index Use

See: https://docs.mongodb.com/manual/reference/operator/query/regex/#index-use

For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan.

Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.

The “allowDiskUse” option

See: https://docs.mongodb.com/manual/reference/limits/#Sort-Operations

If MongoDB cannot use an index to get documents in the requested sort order, the combined size of all documents in the sort operation, plus a small overhead, must be less than 32 megabytes.

If the size is exceeded, you will get the error Overflow sort stage buffered data usage exceeds internal limit.

To allow for the handling of large datasets, you can set the allowDiskUse option in the aggregate() method. The allowDiskUse option enables most aggregation pipeline operations to write data to a temporary file.

Example Queries and Sample Data

Below is sample data and queries that demonstrate the effects of collation settings.

Query 1 – Select all rows

Query 1.1

  • No collation specified.
  • The sort order is case-sensitive.

[ps]db.getCollection(‘SortTest’)
.aggregate(
[{ $sort: { "Name": 1 } }],
{
"collation": { "locale": "en", strength: 1 },
"allowDiskUse": true
}
)[/ps]

Sample Data

AAA001
EEE003
EXTRA AAA009
EXTRA EEE015
EXTRA ZZZ021
EXTRA aaa010
EXTRA eee016
EXTRA zzz022
EXTRA ÉÉÉ017
EXTRA ééé018
ZZZ023
aaa002
eee004
extra AAA007
extra EEE011
extra ZZZ019
extra aaa008
extra eee012
extra zzz020
extra ÉÉÉ013
extra ééé014
zzz024
ÉÉÉ005
ééé006

Query 2 – Exact match of property value where Code = “eee”

Query 2.1

  • No collation specified.
  • The search is case-sensitive.
  • The search is diacritics sensitive.
  • The sort order is case-sensitive. 

[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": "eee" } },
{ $sort: { "Name": 1 } }
],
{
"allowDiskUse": true
}
)[/ps]

Sample Data

AAA001
aaa002
EEE003
eee004
ÉÉÉ005
ééé006
extra AAA007
extra aaa008
EXTRA AAA009
EXTRA aaa010
extra EEE011
extra eee012
extra ÉÉÉ013
extra ééé014
EXTRA EEE015
EXTRA eee016
EXTRA ÉÉÉ017
EXTRA ééé018
extra ZZZ019
extra zzz020
EXTRA ZZZ021
EXTRA zzz022
ZZZ023
zzz024

Query 2 – Exact match of property value where Code = “eee”

Query 2.1

  • No collation specified.
  • The search is case-sensitive.
  • The search is diacritics sensitive.
  • The sort order is case-sensitive. 

[ps] db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": "eee" } },
{ $sort: { "Name": 1 } }
],
{
"allowDiskUse": true
}
) [/ps]

Sample Data

EXTRA eee016
eee004
extra eee012

Query 2.2

  • The search is case-insensitive.
  • The search is diacritics insensitive.
  • The sort order is case-insensitive.
  • Strength 1 – Collation performs comparisons of the base characters only, ignoring other differences such as diacritics and case.

[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": "eee" } },
{ $sort: { "Name": 1 } }
],
{
"collation": { "locale": "en", strength: 1 },
"allowDiskUse": true
}
)
[/ps]

Sample Data

EEE003
eee004
ÉÉÉ005
ééé006
extra EEE011
extra eee012
extra ÉÉÉ013
extra ééé014
EXTRA EEE015
EXTRA eee016
EXTRA ÉÉÉ017
EXTRA ééé018

Query 2.3

  • The search is case-insensitive.
  • The search is diacritics sensitive.
  • The sort order is case-insensitive.
  • Strength 2 – Collation performs comparisons up to secondary differences, such as diacritics. That is, collation performs comparisons of base characters (primary differences) and diacritics (secondary differences). Differences between base characters take precedence over secondary differences. 

[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": "eee" } },
{ $sort: { "Name": 1 } }
],
{
"collation": { "locale": "en", strength: 2 },
"allowDiskUse": true
}[/ps]

Sample Data

EEE003
eee004
extra EEE011
extra eee012
EXTRA EEE015
EXTRA eee016

Query 2.4

  • The search is case-sensitive.
  • The search is diacritics sensitive.
  • The sort order is case-insensitive.

Strength 3 – This is the default strength level if strength level not specified. Collation performs comparisons up to tertiary differences, such as case and letter variants. That is, collation performs comparisons of base characters (primary differences), diacritics (secondary differences), and cases and variants (tertiary differences). Differences between base characters take precedence over secondary differences, which takes precedence over tertiary differences.

[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": "eee" } },
{ $sort: { "Name": 1 } }
],
{
"collation": { "locale": "en", strength: 3 },
"allowDiskUse": true
}
)
[/ps]

Sample Data

eee004
extra eee012
EXTRA eee016

Query 3 – Case-insensitive regular expression match of property value where Code contains “eee”

Query 3.1

    • The sort order is case-sensitive.

[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": /eee/i } },
{ $sort: { "Name": 1 } }
],
{
"allowDiskUse": true
}
)
[/ps]

Sample Data

EEE003
EXTRA EEE015
EXTRA eee016
eee004
extra EEE011
extra eee012

Query 3.2

  • Using collation. Same results for strength values 2 and 3. 

[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": /eee/i } },
{ $sort: { "Name": 1 } }
],
{
"collation": { "locale": "en", strength: 1 },
"allowDiskUse": true
}
)
[/ps]

Sample Data

EEE003
eee004
extra EEE011
extra eee012
EXTRA EEE015
EXTRA eee016

Query 3.3

  • Note that additional “equivalent” characters need to be included in regular expression.
  • Using collation. Same results for strength values 2 and 3.

[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": /[e,è,é,ê,ë][e,è,é,ê,ë][e,è,é,ê,ë]/i } },
{ $sort: { "Name": 1 } }
],
{
"collation": { "locale": "en", strength: 1 },
"allowDiskUse": true
}
)[/ps]

Sample Data

EEE003
eee004
ÉÉÉ005
ééé006
extra EEE011
extra eee012
extra ÉÉÉ013
extra ééé014
EXTRA EEE015
EXTRA eee016
EXTRA ÉÉÉ017
EXTRA ééé018

Building the regular expression

This helper method used to support finding data disregarding diacritic marks:

  • Parameter: “eee”
  • Returns: “[e,è,é,ê,ë][e,è,é,ê,ë][e,è,é,ê,ë]”

[ps]protected string DiacriticSensitiveRegex(string value = "")
{
string after = value.ToLower()
.Replace("a", "[a,à,á,â,ã,ä,å]")
.Replace("b", "[b,þ]")
.Replace("c", "[c,¢,Ç]")
.Replace("e", "[e,è,é,ê,ë]")
.Replace("f", "[f,ƒ]")
.Replace("i", "[i,ì,î,í,ï]")
.Replace("n", "[n,ñ]")
.Replace("o", "[o,ð,ò,ó,ô,õ,ö,ø]")
.Replace("s", "[s,š]")
.Replace("u", "[u,µ,ù,ú,û,ü]")
.Replace("y", "[y,ý]")
.Replace("z", "[z,ž]");
return after;
}[/ps]

Example Data

Below is the data used to populate the “SortTest” collection used in the examples above.

[ps]var myData = [
{ "_id": 1, "Name": "AAA001", "Code": "AAA" }
, { "_id": 2, "Name": "aaa002", "Code": "aaa" }
, { "_id": 3, "Name": "EEE003", "Code": "EEE" }
, { "_id": 4, "Name": "eee004", "Code": "eee" }
, { "_id": 5, "Name": "ÉÉÉ005", "Code": "ÉÉÉ" }
, { "_id": 6, "Name": "ééé006", "Code": "ééé" }
, { "_id": 7, "Name": "extra AAA007", "Code": "AAA" }
, { "_id": 8, "Name": "extra aaa008", "Code": "aaa" }
, { "_id": 9, "Name": "EXTRA AAA009", "Code": "AAA" }
, { "_id": 10, "Name": "EXTRA aaa010", "Code": "aaa" }
, { "_id": 11, "Name": "extra EEE011", "Code": "EEE" }
, { "_id": 12, "Name": "extra eee012", "Code": "eee" }
, { "_id": 13, "Name": "extra ÉÉÉ013", "Code": "ÉÉÉ" }
, { "_id": 14, "Name": "extra ééé014", "Code": "ééé" }
, { "_id": 15, "Name": "EXTRA EEE015", "Code": "EEE" }
, { "_id": 16, "Name": "EXTRA eee016", "Code": "eee" }
, { "_id": 17, "Name": "EXTRA ÉÉÉ017", "Code": "ÉÉÉ" }
, { "_id": 18, "Name": "EXTRA ééé018", "Code": "ééé" }
, { "_id": 19, "Name": "extra ZZZ019", "Code": "ZZZ" }
, { "_id": 20, "Name": "extra zzz020", "Code": "zzz" }
, { "_id": 21, "Name": "EXTRA ZZZ021", "Code": "ZZZ" }
, { "_id": 22, "Name": "EXTRA zzz022", "Code": "zzz" }
, { "_id": 23, "Name": "ZZZ023", "Code": "ZZZ" }
, { "_id": 24, "Name": "zzz024", "Code": "zzz" }
];

var myTable = db.getCollection(‘SortTest’);
myTable.remove({});
myTable.insert(myData);[/ps]


Want to learn more about application development? Check out our resources here.

Share and Enjoy !

Related Content: