Full text search in MongoDB

Full text search was demoed live at the MongoSV conference in December 2012 and as of 25th December 2012, is now available in the nightly builds. With the 2.3.2 unstable development release due in a few days, it’ll soon be available in a packaged release.
This is a custom implementation created by the MongoDB developers as a specific index type, and is due to be launched as an experimental feature in MongoDB 2.4. It has features such as:
- Full text search as an index type when creating new indexes, just like any other.
- Indexing of multiple fields, with weighting to give different fields higher priority.
- Support for Latin based languages initially, with plans for other character sets later. Initially this will be: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish.
- Support for advanced queries, similar to the Google search syntax e.g. negation and phrase matching.
- Stemming, to deal with plurals.
- Stop words (see the list here).
This looks like a good, general purpose full text search engine which goes along well with how MongoDB is developing into a good multi-purpose database. It may well never reach the complexity of “proper” search products like Elastic Search or Solr, but that is probably not the goal.
We’re planning to use it with Server Density in an upcoming release for quick searching server inventories, which will mean we don’t have to implement our own code or deploy another product. We’ll know a lot more when the official documentation is out.
Examples
Even thought the documentation is not yet ready, you can infer how it works from the tests available in the public Github repo.
First we enable full text search in the the latest unstable nightly and insert some test documents:
use test
db.adminCommand( { setParameter : "*", textSearchEnabled : true } );
tc = db.test
tc.save( { _id: 1, title: "Olivia Shakespear",text: "Olivia Shakespear (born Olivia Tucker; 17 March 1863 – 3 October 1938) was a British novelist, playwright, and patron of the arts. She wrote six books that are described as \"marriage problem\" novels. Her works sold poorly, sometimes only a few hundred copies. Her last novel, Uncle Hilary, is considered her best. She wrote two plays in collaboration with Florence Farr." } );
tc.save( { _id: 2, title: "Linn-Kristin Riegelhuth Koren", text: "Linn-Kristin Riegelhuth Koren (born 1 August 1984, in Ski) is a Norwegian handballer playing for Larvik HK and the Norwegian national team. She is commonly known as Linka. Outside handball she is a qualified nurse." } );
Then we can create a new index on the title field:
tc.ensureIndex( { "title": "text" } );
and we can now search:
> res = tc.runCommand( "text", { search: "Olivia" } );
{
"queryDebugString" : "olivia||||||",
"language" : "english",
"results" : [
{
"score" : 0.75,
"obj" : {
"_id" : 1,
"title" : "Olivia Shakespear",
"text" : "Olivia Shakespear (born Olivia Tucker; 17 March 1863 – 3 October 1938) was a British novelist, playwright, and patron of the arts. She wrote six books that are described as \"marriage problem\" novels. Her works sold poorly, sometimes only a few hundred copies. Her last novel, Uncle Hilary, is considered her best. She wrote two plays in collaboration with Florence Farr."
}
}
],
"stats" : {
"nscanned" : 1,
"nscannedObjects" : 0,
"n" : 1,
"timeMicros" : 128
},
"ok" : 1
}
We can then add the text field to the index. Note that you can only have 1 full text index so I have to drop the original one first, then recreate it as a compound index:
tc.dropIndexes()
tc.ensureIndex( { "title": "text", "text": "text" } );
and test stemming:
> res = tc.runCommand( "text", { search: "novelists" } );
{
"queryDebugString" : "novelist||||||",
"language" : "english",
"results" : [
{
"score" : 0.5116279069767442,
"obj" : {
"_id" : 1,
"title" : "Olivia Shakespear",
"text" : "Olivia Shakespear (born Olivia Tucker; 17 March 1863 – 3 October 1938) was a British novelist, playwright, and patron of the arts. She wrote six books that are described as \"marriage problem\" novels. Her works sold poorly, sometimes only a few hundred copies. Her last novel, Uncle Hilary, is considered her best. She wrote two plays in collaboration with Florence Farr."
}
}
],
"stats" : {
"nscanned" : 1,
"nscannedObjects" : 0,
"n" : 1,
"timeMicros" : 90
},
"ok" : 1
}
We can see the index we created and you can set overrides on the language:
> tc.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.test",
"name" : "_id_"
},
{
"v" : 0,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"ns" : "test.test",
"name" : "title_text_text_text",
"weights" : {
"text" : 1,
"title" : 1
},
"default_language" : "english",
"language_override" : "language"
}
]
You can specify the weight and default_language options when creating the index e.g.
tc.ensureIndex( { "title": "text", "text": "text" }, {weights: { title: 10 }, default_language: "norwegian" } );
And that’s basically it (from what I can see from the tests). Nice and simple.
Enjoy this post? You may also like MongoDB Benchmarks