php - How to create a performance system that count tags across a large dynamic dataset -


overview

i have ios app people can search tags of tags pre-defined , user defined.

as user write tags he/she wants search display row shows number of results available tags (see example search picture).

note: #exercise or #routine parent tag meaning person going use 1 of first.

i using php , mongodb server side. though of creating file tags count every hour clients can fetch , minimize resource consumption.

given manipulated information tags user controlled, list expand considerably on time.

the challenge

  • i puzzled on whats best approach when taking account performance & overhead creating, manipulating & storing such list.

my first idea create 2d array (see pic) store values. transformed json can stored mongodb.

but approach make me fetch tags , load them in memory +1 or -1. thus, not think might best.

all manipulation take place on insert,update , delete of each element. there considerable ram overhead.

my second idea create document store used tags , make count queries every hour produce list used clients.

this mean on every delete, update , insert check tag exist on document , create or delete or nothing depending on condition.

every hour, fetch tags, produce array tags combinations. query db against tags combinations , count number of results returned , create file.

this approach, think, better 1 given use mongodb , not mysql. still unsure of performance.

have made similar system , advice on better approach?

example picture of search

2d array

one of important things consider when using mongodb must think in terms of application's access patterns when deciding on database design. try solve problem using example , see how works.

let's consider have following documents in collection , let see below how make simple data format work wonders:

> db.performant.find() { "_id" : objectid("522bf7166094a4e72db22827"), "name" : "abc", "tags" : [  "chest",  "bicep",  "tricep" ] } { "_id" : objectid("522bf7406094a4e72db22828"), "name" : "def", "tags" : [  "routine",  "trufala",  "tricep" ] } { "_id" : objectid("522bf75f6094a4e72db22829"), "name" : "xyz", "tags" : [  "routine",  "mytag",  "tricep",  "mytag2" ] } { "_id" : objectid("522bf7876094a4e72db2282a"), "name" : "mno", "tags" : [  "exercise",  "mytag",  "tricep",  "mytag2",  "biceps" ] } 

first , foremost absolutely must create index on tags. (you can create compound index if desire)

> db.performant.ensureindex({tags:1}) > db.performant.getindexes() [     {             "v" : 1,             "key" : {                     "_id" : 1             },             "ns" : "test.performant",             "name" : "_id_"     },     {             "v" : 1,             "key" : {                     "tags" : 1             },             "ns" : "test.performant",             "name" : "tags_1"     } ] 

to query tags data above collection, 1 use db.performant.find({tags:{$in:["bicep"]}}), not idea. let me show why:

> db.performant.find({tags:{$in:["bicep","chest","trufala"]}}).explain() {     "cursor" : "btreecursor tags_1 multi",     "ismultikey" : true,     "n" : 2,     "nscannedobjects" : 3,     "nscanned" : 5,     "nscannedobjectsallplans" : 3,     "nscannedallplans" : 5,     "scanandorder" : false,     "indexonly" : false,     "nyields" : 0,     "nchunkskips" : 0,     "millis" : 0,     "indexbounds" : {             "tags" : [                     [                             "bicep",                             "bicep"                     ],                     [                             "chest",                             "chest"                     ],                     [                             "trufala",                             "trufala"                     ]             ]     },     "server" : "none-6674b8f4f2:27017" } 

as might have noticed, query performing entire collection scan. may make wonder, why hack added index, if not used , wondering too. unfortunately, issue yet resolved mongodb (at least in knowledge)

fortunately though, can around problem , still use index created on tags collection. here how:

> db.performant.find({$or:[{tags:"bicep"},{tags:"chest"},{tags:"trufala"}]}).explain() {     "clauses" : [             {                     "cursor" : "btreecursor tags_1",                     "ismultikey" : true,                     "n" : 1,                     "nscannedobjects" : 1,                     "nscanned" : 1,                     "nscannedobjectsallplans" : 1,                     "nscannedallplans" : 1,                     "scanandorder" : false,                     "indexonly" : false,                     "nyields" : 0,                     "nchunkskips" : 0,                     "millis" : 10,                     "indexbounds" : {                             "tags" : [                                     [                                             "bicep",                                             "bicep"                                     ]                             ]                     }             },             {                     "cursor" : "btreecursor tags_1",                     "ismultikey" : true,                     "n" : 0,                     "nscannedobjects" : 1,                     "nscanned" : 1,                     "nscannedobjectsallplans" : 1,                     "nscannedallplans" : 1,                     "scanandorder" : false,                     "indexonly" : false,                     "nyields" : 0,                     "nchunkskips" : 0,                     "millis" : 0,                     "indexbounds" : {                             "tags" : [                                     [                                             "chest",                                             "chest"                                     ]                             ]                     }             },             {                     "cursor" : "btreecursor tags_1",                     "ismultikey" : true,                     "n" : 1,                     "nscannedobjects" : 1,                     "nscanned" : 1,                     "nscannedobjectsallplans" : 1,                     "nscannedallplans" : 1,                     "scanandorder" : false,                     "indexonly" : false,                     "nyields" : 0,                     "nchunkskips" : 0,                     "millis" : 0,                     "indexbounds" : {                             "tags" : [                                     [                                             "trufala",                                             "trufala"                                     ]                             ]                     }             }     ],     "n" : 2,     "nscannedobjects" : 3,     "nscanned" : 3,     "nscannedobjectsallplans" : 3,     "nscannedallplans" : 3,     "millis" : 10,     "server" : "none-6674b8f4f2:27017" } 

as can see, n very close nscanned. there 3 records scanned 1 each corresponding "bicep","chest","trufala". since "bicep" , "chest" belong same document, 1 result returned corresponding it. in general, both count() , find() limited scans , efficient. also, never have serve user stale data. can avoid running sorts of batch jobs whatsoever!!!

so, using approach can derive following conclusion: if search n tags, , each tag present m times, total documents scanned n * m. considering have huge number of tags , huge number of documents, , scan few tags (which in turn correspond few documents - although not 1:1), results super fast, since 1 document scan occurs per tag , document combination.

note: index can never covered approach, since there index on array i.e. "ismultikey" : true. can read more on covered indexes here

limitations: every approach has limitations, , 1 has too!! sorting on results yield extremely poor performance scan entire collection number of times equal tags passed query plus scans additional records corresponding each argument of $or.

> db.performant.find({$or:[{tags:"bicep"},{tags:"chest"},{tags:"trufala"}]}).sort({tags:1}).explain() {     "cursor" : "btreecursor tags_1",     "ismultikey" : true,     "n" : 2,     "nscannedobjects" : 15,     "nscanned" : 15,     "nscannedobjectsallplans" : 15,     "nscannedallplans" : 15,     "scanandorder" : false,     "indexonly" : false,     "nyields" : 0,     "nchunkskips" : 0,     "millis" : 0,     "indexbounds" : {             "tags" : [                     [                             {                                     "$minelement" : 1                             },                             {                                     "$maxelement" : 1                             }                     ]             ]     },     "server" : "none-6674b8f4f2:27017" }  

in case, scans 15 times equal 3 full collection scans of 4 records each plus 3 records scanned each parameter $or

final verdict: use approach efficient results if ok unsorted results or willing make effort @ front end sort them yourself.


Comments

Popular posts from this blog

html - How to style widget with post count different than without post count -

How to remove text and logo OR add Overflow on Android ActionBar using AppCompat on API 8? -

javascript - storing input from prompt in array and displaying the array -