《6 Rules of Thumb for MongoDB Schema Design》 Part 1 翻译和笔记

原帖位于IT老兵博客,沉淀着一个IT老兵对于这个行业的多年的认知。

MongoDB如何设计数据模型。

前言

在工作中遇到了要使用MongoDB,学习MongoDB,肯定不能仅仅停留于对一些指令的简单操作的掌握,就像当初学习MySQL一样,要了解一下如何使用MongoDB来设计数据库。这里,找到一篇很好的文章,转载在下面,配上一定的翻译和学习笔记,原文也不是很复杂,贴上原文,是为了不误导读者,也不误导自己,将来可以不断再纠正其中理解不准确的地方。

正文

By William Zola, Lead Technical Support Engineer at MongoDB

“I have lots of experience with SQL, but I’m just a beginner with MongoDB. How do I model a one-to-N relationship?” This is one of the more common questions I get from users attending MongoDB office hours.

I don’t have a short answer to this question, because there isn’t just one way, there’s a whole rainbow’s worth of ways. MongoDB has a rich and nuanced vocabulary for expressing what, in SQL, gets flattened into the term “One-to-N”. Let me take you on a tour of your choices in modeling One-to-N relationships.

笔记: MongoDB的新手往往会遇到一个问题,我应该怎么去定义一个one-to-N的关系呢?“there’s a whole rainbow’s worth of ways. ”这句应该怎么理解呢?

There’s so much to talk about here, I’m breaking this up into three parts. In this first part, I’ll talk about the three basic ways to model One-to-N relationships. In the second part I’ll cover more sophisticated schema designs, including denormalization and two-way referencing. And in the final part, I’ll review the entire rainbow of choices, and give you some suggestions for choosing among the thousands (really – thousands) of choices that you may consider when modeling a single One-to-N relationship.

笔记: 这里有很多需要讨论,笔记会将它分为三个部分来讨论。第一部分,也就是本篇文章,来讨论三种建立One-to-N关系模型的基本的方法;第二部分,讨论更复杂的模型设计,包括反范式(denormalization)和双向参考(two-way referencing);最后一部分,将会复习整个选择的过程,并且给你们一些建立,来在上千的建立一个One-to-N关系的选择中做出判断。

Many beginners think that the only way to model “One-to-N” in MongoDB is to embed an array of sub-documents into the parent document, but that’s just not true. Just because you can embed a document, doesn’t mean you should embed a document.

笔记: 很多初学者会认为在MongoDB中建立一个“One-to-N”的模型只有一种方法,就是嵌入一个子文档的数组(array),这不是事实。确实是这样,看到的很多帖子就是这么去误导别人。

When designing a MongoDB schema, you need to start with a question that you’d never consider when using SQL: what is the cardinality of the relationship? Put less formally: you need to characterize your “One-to-N” relationship with a bit more nuance: is it “one-to-few”, “one-to-many”, or “one-to-squillions”? Depending on which one it is, you’d use a different format to model the relationship.

笔记: 在开始设计一个MongoDB的模式时,你需要考虑一个在使用SQL从来不需要考虑的问题:关系的基数是什么?具体来说,就是要考虑“one-to-few”,“one-to-many”, 或者“one-to-squillions”,这个基数不同,设计的格式也不同。

Basics: Modeling One-to-Few

An example of “one-to-few” might be the addresses for a person. This is a good use case for embedding – you’d put the addresses in an array inside of your Person object:

1
2
3
4
5
6
7
8
9
db.person.findOne()
{
name: 'Kate Monster',
ssn: '123-456-7890',
addresses : [
{ street: '123 Sesame St', city: 'Anytown', cc: 'USA' },
{ street: '123 Avenue Q', city: 'New York', cc: 'USA' }
]
}

This design has all of the advantages and disadvantages of embedding. The main advantage is that you don’t have to perform a separate query to get the embedded details; the main disadvantage is that you have no way of accessing the embedded details as stand-alone entities.

笔记: 上面这是一个常见One-to-Few的例子,个人信息和地址的关系。好处在于你不用单独执行一个查询去获取嵌入的信息;坏处在于你无法根据作为一个单独的条目去访问一个嵌入的内容。这个例子很形象,在那本MySQL实例中,也涉及到人和地址的关系处理。就是说大千世界的一对多的关系其实不是那么一刀切的,而SQL对这个的处理能力是有限的,或者说SQL原本的设计是没有太多考虑这个因素的。这个应该结合那本书一起来讨论,待完成……

For example, if you were modeling a task-tracking system, each Person would have a number of Tasks assigned to them. Embedding Tasks inside the Person document would make queries like “Show me all Tasks due tomorrow” much more difficult than they need to be. I will cover a more appropriate design for this use case in the next post.

Basics: One-to-Many

An example of “one-to-many” might be parts for a product in a replacement parts ordering system. Each product may have up to several hundred replacement parts, but never more than a couple thousand or so. (All of those different-sized bolts, washers, and gaskets add up.) This is a good use case for referencing – you’d put the ObjectIDs of the parts in an array in product document. (For these examples I’m using 2-byte ObjectIDs because they’re easier to read: real-world code would use 12-byte ObjectIDs.)

Each Part would have its own document:

1
2
3
4
5
6
7
8
db.parts.findOne()
{
_id : ObjectID('AAAA'),
partno : '123-aff-456',
name : '#4 grommet',
qty: 94,
cost: 0.94,
price: 3.99

Each Product would have its own document, which would contain an array of ObjectID references to the Parts that make up that Product:

1
2
3
4
5
6
7
8
9
10
11
db.products.findOne()
{
name : 'left-handed smoke shifter',
manufacturer : 'Acme Corp',
catalog_number: 1234,
parts : [ // array of references to Part documents
ObjectID('AAAA'), // reference to the #4 grommet above
ObjectID('F17C'), // reference to a different Part
ObjectID('D2AA'),
// etc
]

You would then use an application-level join to retrieve the parts for a particular product:

1
2
3
4
// Fetch the Product document identified by this catalog number
product = db.products.findOne({catalog_number: 1234});
// Fetch all the Parts that are linked to this Product
product_parts = db.parts.find({_id: { $in : product.parts } } ).toArray() ;

笔记: 这个例子是产品和配件的关系,是One-to-Many的关系。产品会有很多的配件,所以这里使用ObjectID来关联,这是一个单项关联。这个例子也是很常见的用来描述One-to-Many关系的。

For efficient operation, you’d need to have an index on ‘products.catalog_number’. Note that there will always be an index on ‘parts._id’, so that query will always be efficient.

This style of referencing has a complementary set of advantages and disadvantages to embedding. Each Part is a stand-alone document, so it’s easy to search them and update them independently. One trade off for using this schema is having to perform a second query to get details about the Parts for a Product. (But hold that thought until we get to denormalizing in part 2.)

笔记: 好处在于每一个配件都有一个独立的文档,很容易查询和更新。交换就是需要单独执行一个查询去获取配件信息。

As an added bonus, this schema lets you have individual Parts used by multiple Products, so your One-to-N schema just became an N-to-N schema without any need for a join table!

Basics: One-to-Squillions

An example of “one-to-squillions” might be an event logging system that collects log messages for different machines. Any given host could generate enough messages to overflow the 16 MB document size, even if all you stored in the array was the ObjectID. This is the classic use case for “parent-referencing” – you’d have a document for the host, and then store the ObjectID of the host in the documents for the log messages.

1
2
3
4
5
6
7
8
9
10
11
12
13
db.hosts.findOne()
{
_id : ObjectID('AAAB'),
name : 'goofy.example.com',
ipaddr : '127.66.66.66'
}

db.logmsg.findOne()
{
time : ISODate("2014-03-28T09:42:41.382Z"),
message : 'cpu is on fire!',
host: ObjectID('AAAB') // Reference to the Host document
}

You’d use a (slightly different) application-level join to find the most recent 5,000 messages for a host:

1
2
3
4
// find the parent ‘host’ document
host = db.hosts.findOne({ipaddr : '127.66.66.66'}); // assumes unique index
// find the most recent 5000 log message documents linked to that host
last_5k_msg = db.logmsg.find({host: host._id}).sort({time : -1}).limit(5000).toArray()

笔记: 主机和日志的关系来体现One-to-Squillions,区别在于关系建立在了孩子身上,孩子指向了父亲。

Recap

So, even at this basic level, there is more to think about when designing a MongoDB schema than when designing a comparable relational schema. You need to consider two factors:

Will the entities on the “N” side of the One-to-N ever need to stand alone?
What is the cardinality of the relationship: is it one-to-few; one-to-many; or one-to-squillions?

笔记:

在设计关系时,你需要考虑两个因素:

One-to-N的“N”这边需要单独作为一个条目吗?
关系的基数是什么:one-to-few;one-to-many;或者 one-to-squillions?
Based on these factors, you can pick one of the three basic One-to-N schema designs:

Embed the N side if the cardinality is one-to-few and there is no need to access the embedded object outside the context of the parent object
Use an array of references to the N-side objects if the cardinality is one-to-many or if the N-side objects should stand alone for any reasons
Use a reference to the One-side in the N-side objects if the cardinality is one-to-squillions
笔记:

基于这些因素,你可以考虑这三个基本模式设计:

如果基数是one-to-few,并且在父对象的上下文之外没有访问嵌入的对象的需求,那么嵌入N边。
如果基数是one-to-many,或者N边的对象基于一些原因需要单独展示,那么使用一个数组来指向N边的对象。
如果基数是one-to-squillions,使用一个参考去指向One那边。

总结

学习和梳理了这篇文章,感觉思路清晰了很多,MongoDB是在One-to-N这个领域做了很多设计,这可能也是跟当前的One-to-N的需求越来越多,而SQL对这个支持有限有关系。

待办的事情,配合总结一下MySQL的设计模式。