Matthew Chestnut is a Senior Consultant at ThreeWill. He has over 25 years of software development experience around enterprise and departmental business productivity applications. He has a proven track record of quality software development, on-budget project management and management of successful software development teams.
MongoDB Collation Settings
Introduced in version 3.4, these MongoDB settings influence how data is selected and the order in which it is returned.
This post will demonstrate the collation settings used and the results of those settings. I have a sample set of data, queries, and results, along with some comments, to support these conclusions.
Use Case
We had the following use case on a recent project using MongoDB (version 4.2):
- User wants to find data in a collection based on a user-entered string value.
- The property to search in the collection is string data.
- Data should be found regardless of character “case”; for example, searching for “e” should also match “E”.
- Use an “includes” type search instead of an exact match; for example, searching for “e” should match both “e” and “yes”.
- Ignore diacritics when matching data; for example, searching for “e” should also match “é”.
- The sort order of results should be case-insensitive.
Implementation Details
To satisfy the use case presented, we did the following:
- Used the MongoDB aggregation pipeline pattern with “allowDiskUse” to support large data sets, millions of items.
- Used a case-insensitive regular expression query.
- Used a collation to retrieve the results in case-insensitive order.
Below is the information we used to make these implementation decisions along with sample data and queries to demonstrate how various settings change the results.
MongoDB Information
This is the information that influenced our testing and decisions.
Collations
Collation allows users to specify language-specific rules for string comparison, such as rules for letter case and accent marks. You can specify collation for a collection or a view, an index, or specific operations that support collation.
Collation “Strength” Setting
The level of comparison to perform, which conforms to the ICU Comparison Levels.
- MongoDB Documentation: https://docs.mongodb.com/manual/reference/collation/index.html
- International Components for Unicode (ICU) Standards: https://docs.mongodb.com/manual/reference/collation/
MongoDB Documentation | International Components for Unicode (ICU) |
Strength 1 – Primary level of comparison. Collation performs comparisons of the base characters only, ignoring other differences such as diacritics and case. | Primary Level: Typically, this is used to denote differences between base characters (for example, “a” < “b”). It is the strongest difference. For example, dictionaries are divided into different sections by base character. This is also called the level-1 strength. |
Strength 2 – Secondary level of comparison. Collation performs comparisons up to secondary differences, such as diacritics. That is, collation performs comparisons of base characters (primary differences) and diacritics (secondary differences). Differences between base characters take precedence over secondary differences. | Secondary Level: Accents in the characters are considered secondary differences (for example, “as” < “às” < “at”). Other differences between letters can also be considered secondary differences, depending on the language. A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.Note: In some languages (such as Danish), certain accented letters are considered to be separate base characters. In most languages, however, an accented letter only has a secondary difference from the unaccented version of that letter. |
Strength 3 – Tertiary level of comparison. Collation performs comparisons up to tertiary differences, such as case and letter variants. That is, collation performs comparisons of base characters (primary differences), diacritics (secondary differences), and cases and variants (tertiary differences). Differences between base characters take precedence over secondary differences, which takes precedence over tertiary differences. This is the default strength level if strength is not specified. | Tertiary Level: Upper- and lower-case differences in characters are distinguished at the tertiary level (for example, “ao” < “Ao” < “aò”). In addition, a variant of a letter differs from the base form on the tertiary level (such as “A” and “Ⓐ”). Another example is the difference between large and small Kana. A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called the level-3 strength |
Collation Indexes
For the purposes of this post, I have purposely disregarded any index creation and usage regarding the sample queries and results.
Collation and Index Use
See: https://docs.mongodb.com/manual/reference/collation/index.html
To use an index for string comparisons, an operation must also specify the same collation. That is, an index with a collation cannot support an operation that performs string comparisons on the indexed fields if the operation specifies a different collation.
Regular Expressions and Index Use
See: https://docs.mongodb.com/manual/reference/operator/query/regex/#index-use
For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan.
Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.
The “allowDiskUse” option
See: https://docs.mongodb.com/manual/reference/limits/#Sort-Operations
If MongoDB cannot use an index to get documents in the requested sort order, the combined size of all documents in the sort operation, plus a small overhead, must be less than 32 megabytes.
If the size is exceeded, you will get the error Overflow sort stage buffered data usage exceeds internal limit.
To allow for the handling of large datasets, you can set the allowDiskUse option in the aggregate() method. The allowDiskUse option enables most aggregation pipeline operations to write data to a temporary file.
Example Queries and Sample Data
Below is sample data and queries that demonstrate the effects of collation settings.
Query 1 – Select all rows
Query 1.1
- No collation specified.
- The sort order is case-sensitive.
[ps]db.getCollection(‘SortTest’)
.aggregate(
[{ $sort: { "Name": 1 } }],
{
"collation": { "locale": "en", strength: 1 },
"allowDiskUse": true
}
)[/ps]
Sample Data
AAA001 EEE003 EXTRA AAA009 EXTRA EEE015 EXTRA ZZZ021 EXTRA aaa010 EXTRA eee016 EXTRA zzz022 EXTRA ÉÉÉ017 EXTRA ééé018 ZZZ023 aaa002 eee004 extra AAA007 extra EEE011 extra ZZZ019 extra aaa008 extra eee012 extra zzz020 extra ÉÉÉ013 extra ééé014 zzz024 ÉÉÉ005 ééé006
Query 2 – Exact match of property value where Code = “eee”
Query 2.1
- No collation specified.
- The search is case-sensitive.
- The search is diacritics sensitive.
- The sort order is case-sensitive.
[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": "eee" } },
{ $sort: { "Name": 1 } }
],
{
"allowDiskUse": true
}
)[/ps]
Sample Data
AAA001 aaa002 EEE003 eee004 ÉÉÉ005 ééé006 extra AAA007 extra aaa008 EXTRA AAA009 EXTRA aaa010 extra EEE011 extra eee012 extra ÉÉÉ013 extra ééé014 EXTRA EEE015 EXTRA eee016 EXTRA ÉÉÉ017 EXTRA ééé018 extra ZZZ019 extra zzz020 EXTRA ZZZ021 EXTRA zzz022 ZZZ023 zzz024
Query 2 – Exact match of property value where Code = “eee”
Query 2.1
- No collation specified.
- The search is case-sensitive.
- The search is diacritics sensitive.
- The sort order is case-sensitive.
[ps] db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": "eee" } },
{ $sort: { "Name": 1 } }
],
{
"allowDiskUse": true
}
) [/ps]
Sample Data
EXTRA eee016 eee004 extra eee012
Query 2.2
- The search is case-insensitive.
- The search is diacritics insensitive.
- The sort order is case-insensitive.
- Strength 1 – Collation performs comparisons of the base characters only, ignoring other differences such as diacritics and case.
[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": "eee" } },
{ $sort: { "Name": 1 } }
],
{
"collation": { "locale": "en", strength: 1 },
"allowDiskUse": true
}
)
[/ps]
Sample Data
EEE003 eee004 ÉÉÉ005 ééé006 extra EEE011 extra eee012 extra ÉÉÉ013 extra ééé014 EXTRA EEE015 EXTRA eee016 EXTRA ÉÉÉ017 EXTRA ééé018
Query 2.3
- The search is case-insensitive.
- The search is diacritics sensitive.
- The sort order is case-insensitive.
- Strength 2 – Collation performs comparisons up to secondary differences, such as diacritics. That is, collation performs comparisons of base characters (primary differences) and diacritics (secondary differences). Differences between base characters take precedence over secondary differences.
[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": "eee" } },
{ $sort: { "Name": 1 } }
],
{
"collation": { "locale": "en", strength: 2 },
"allowDiskUse": true
}[/ps]
Sample Data
EEE003 eee004 extra EEE011 extra eee012 EXTRA EEE015 EXTRA eee016
Query 2.4
- The search is case-sensitive.
- The search is diacritics sensitive.
- The sort order is case-insensitive.
Strength 3 – This is the default strength level if strength level not specified. Collation performs comparisons up to tertiary differences, such as case and letter variants. That is, collation performs comparisons of base characters (primary differences), diacritics (secondary differences), and cases and variants (tertiary differences). Differences between base characters take precedence over secondary differences, which takes precedence over tertiary differences.
[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": "eee" } },
{ $sort: { "Name": 1 } }
],
{
"collation": { "locale": "en", strength: 3 },
"allowDiskUse": true
}
)
[/ps]
Sample Data
eee004 extra eee012 EXTRA eee016
Query 3 – Case-insensitive regular expression match of property value where Code contains “eee”
Query 3.1
- The sort order is case-sensitive.
[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": /eee/i } },
{ $sort: { "Name": 1 } }
],
{
"allowDiskUse": true
}
)
[/ps]
Sample Data
EEE003 EXTRA EEE015 EXTRA eee016 eee004 extra EEE011 extra eee012
Query 3.2
- Using collation. Same results for strength values 2 and 3.
[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": /eee/i } },
{ $sort: { "Name": 1 } }
],
{
"collation": { "locale": "en", strength: 1 },
"allowDiskUse": true
}
)
[/ps]
Sample Data
EEE003 eee004 extra EEE011 extra eee012 EXTRA EEE015 EXTRA eee016
Query 3.3
- Note that additional “equivalent” characters need to be included in regular expression.
- Using collation. Same results for strength values 2 and 3.
[ps]db.getCollection(‘SortTest’)
.aggregate(
[
{ $match: { "Code": /[e,è,é,ê,ë][e,è,é,ê,ë][e,è,é,ê,ë]/i } },
{ $sort: { "Name": 1 } }
],
{
"collation": { "locale": "en", strength: 1 },
"allowDiskUse": true
}
)[/ps]
Sample Data
EEE003 eee004 ÉÉÉ005 ééé006 extra EEE011 extra eee012 extra ÉÉÉ013 extra ééé014 EXTRA EEE015 EXTRA eee016 EXTRA ÉÉÉ017 EXTRA ééé018
Building the regular expression
This helper method used to support finding data disregarding diacritic marks:
- Parameter: “eee”
- Returns: “[e,è,é,ê,ë][e,è,é,ê,ë][e,è,é,ê,ë]”
[ps]protected string DiacriticSensitiveRegex(string value = "")
{
string after = value.ToLower()
.Replace("a", "[a,à,á,â,ã,ä,å]")
.Replace("b", "[b,þ]")
.Replace("c", "[c,¢,Ç]")
.Replace("e", "[e,è,é,ê,ë]")
.Replace("f", "[f,ƒ]")
.Replace("i", "[i,ì,î,í,ï]")
.Replace("n", "[n,ñ]")
.Replace("o", "[o,ð,ò,ó,ô,õ,ö,ø]")
.Replace("s", "[s,š]")
.Replace("u", "[u,µ,ù,ú,û,ü]")
.Replace("y", "[y,ý]")
.Replace("z", "[z,ž]");
return after;
}[/ps]
Example Data
Below is the data used to populate the “SortTest” collection used in the examples above.
[ps]var myData = [
{ "_id": 1, "Name": "AAA001", "Code": "AAA" }
, { "_id": 2, "Name": "aaa002", "Code": "aaa" }
, { "_id": 3, "Name": "EEE003", "Code": "EEE" }
, { "_id": 4, "Name": "eee004", "Code": "eee" }
, { "_id": 5, "Name": "ÉÉÉ005", "Code": "ÉÉÉ" }
, { "_id": 6, "Name": "ééé006", "Code": "ééé" }
, { "_id": 7, "Name": "extra AAA007", "Code": "AAA" }
, { "_id": 8, "Name": "extra aaa008", "Code": "aaa" }
, { "_id": 9, "Name": "EXTRA AAA009", "Code": "AAA" }
, { "_id": 10, "Name": "EXTRA aaa010", "Code": "aaa" }
, { "_id": 11, "Name": "extra EEE011", "Code": "EEE" }
, { "_id": 12, "Name": "extra eee012", "Code": "eee" }
, { "_id": 13, "Name": "extra ÉÉÉ013", "Code": "ÉÉÉ" }
, { "_id": 14, "Name": "extra ééé014", "Code": "ééé" }
, { "_id": 15, "Name": "EXTRA EEE015", "Code": "EEE" }
, { "_id": 16, "Name": "EXTRA eee016", "Code": "eee" }
, { "_id": 17, "Name": "EXTRA ÉÉÉ017", "Code": "ÉÉÉ" }
, { "_id": 18, "Name": "EXTRA ééé018", "Code": "ééé" }
, { "_id": 19, "Name": "extra ZZZ019", "Code": "ZZZ" }
, { "_id": 20, "Name": "extra zzz020", "Code": "zzz" }
, { "_id": 21, "Name": "EXTRA ZZZ021", "Code": "ZZZ" }
, { "_id": 22, "Name": "EXTRA zzz022", "Code": "zzz" }
, { "_id": 23, "Name": "ZZZ023", "Code": "ZZZ" }
, { "_id": 24, "Name": "zzz024", "Code": "zzz" }
];
var myTable = db.getCollection(‘SortTest’);
myTable.remove({});
myTable.insert(myData);[/ps]
Want to learn more about application development? Check out our resources here.