What does relational mean? Basic concepts of algorithmic language


Level 1: Level external models- this is the most top level where each model has its own view of the data. This layer defines the database viewpoint of individual applications.

Conceptual level: The central control link, where the database is presented in the most general form, which combines the data used by all applications. In fact, the conceptual level reflects a generalized model of the subject area.

Physical layer(Database): This is the data itself located in files or in page structures located on external storage media.


Data Models

The following data models are distinguished:

1. Infological

2. Date logical

3. Physical

The database design process begins with the design of an information model. An infological data model is a generalized informal description of the created database, made using natural language, mathematical formulas, tables, graphs and other tools that are understandable to all people working on database design.

Domain tuple

The information model reflects the real world in some human-understandable concept, completely independent of the data storage environment. Therefore, the Infology Model should not change until some change in the real world requires a change outside the definition so that the model continues to represent the domain.

There are many approaches to building this model: graph models, semantic networks, entity-connection and others.

Datalogical model

The infological model must be displayed in a datalogical model that is understandable to the DBMS. A datalogical model is a formal description of an information model in the DBMS language.

Hierarchical model

This model is a collection of related elements that form a hierarchical structure. The basic concepts of hierarchy include level, node, and relationship.

communication level


A node is a collection of data attributes that describe an object. Each node is connected to one node for more than high level and with any number of lower level nodes. The exception is the highest level node. The number of trees in the database is determined by the number of tree roots. Each database record has a single path from the root record. A simple example may serve as the Internet domain name system\address. On the first level (the root of the tree) lies our planet earth, on the second the Country, on the third the Region, on the fourth - locality, Street, house, flat. A typical representative is a DBMS from IBM - IMS.

All copies of this type descendant with common copy type of ancestor is called twins. A complete traversal order is defined for the database. From top to bottom and from right to left.

Physical model

A physical model is built based on the datalogical model. The physical organization of data has a major impact on database performance. DBMS developers are trying to create the most productive physical data models, offering users one or another tool to customize the model for a specific database.

Example: In particular for a relational database, it already takes into account:

1. Physical aspects of storing tables in specific files.

2. Creating indexes that optimize the speed of data operations using the application.

3. Execution various actions over data at certain events defined by users using triggers and stored procedures.

Infological models X

Physical models


For all levels and for any presentation method subject area, lies the coding of concepts of relationships between concepts. A key step in the development of any information system is to conduct system analysis:

Formalization of the subject area and representation of the system as a set of components.

Composition as the basis of system analysis can be functional (building a hierarchy).

However, in most systems, when it comes to databases, data types are a more static element than the way they are processed. Therefore, such methods of system analysis as the data flow diagram have received intensive development. Development of relational databases. Stimulated the development of data development methodologies, in particular ER ER diagrams. The relational data model directly uses the concept of relationship as a mapping. She is closest to conceptual model data presentation. And often lies at the heart of it.

Unlike the graph model theorist, in the relational model, connections between relations are implemented in an inexplicit way, for which relation keys are used. For example, relations of a hierarchical type are implemented by the mechanism of primary and foreign keys, when the fact of attributes must be present in the subordinate relation.

Such an attribute of relationships in the main relationship will be called a primary key, and in a subordinate relationship, a secondary one.

Progress in the development of programming languages ​​associated primarily with data typing and the emergence of object-oriented languages ​​has made it possible to approach the analysis of complex systems from the point of view of hierarchical representations, that is, using classes of objects with the properties of polymorphism, inheritance, and encapsulation.

RELATIONSHIP IS A TABLE.

Editing tables, records...

Deleting what you created and

Editing.


Relational database model

Relational data models have currently gained the greatest popularity precisely for this representation of data.

The relational model can be thought of as a special method of representing data that contains its own data (in the form of tables) and ways of working and manipulating them (in the form of relationships). The relational model assumes three conceptual elements: Structure, Integrity and Data Processing. These elements have their own mandatory concepts that need to be explained for further presentation.

The table is considered as a direct data store. Traditionally in relational systems the table is called attitude. A table row is called motorcade, and the column attribute. In this case, the attributes have unique names (within the relation).

The number of tuples in a table is called cardinal number. Number of attributes degree. A unique identifier is established for a relationship, that is, one or more attributes whose values ​​​​are not the same at the same time - the identifier is called primary key.Domain this is the set of valid homogeneous values ​​for a particular attribute. Thus, a domain can be considered as a named set of data, and the components of this set are logically indivisible units (for example, a list of names of employees of an institution can act as a domain, but not all names can be present in the table).

SUMM Kireeva 25.50 Motyleva 17.05 … …. …

Attitude

attributes

The fields KOD, NAME, SUMM are table attributes contained in the header.

Pairs KOD 5216, NAME Kireeva, SUMM 25.50 are elements of the body of the relationship.

In relational databases, unlike other models, the user specifies what data is needed for him and not how to do it. For this reason, the process of moving and navigating a database in relational systems is automatic, and this task is performed in a DBMS optimizer. His job is to make the most effective way retrieve data from the database upon request. Thus, the optimizer at least must be able to determine from which tables the data is selected, how much information is in these tables and what is the physical order of the records in the tables and how they are grouped.

In addition, a relational database also performs directory functions. The directory stores a description of all the objects that make up the database: tables, indexes, triggers, etc. It is obvious that it is vital for proper operation the entire system, such a component as the optimizer. The optimizer uses the information stored in the directory. An interesting fact is that the catalog itself is a set of tables, so the DBMS can manipulate it in traditional ways, without resorting to any special techniques or methods.

Domains and Relationships

Basic definitions: Domains, types of relations, predicates.

Relationships have a number of basic properties:

1. In the most general case, there are no common tuples in relations - this follows from the very definition of relations. However, for some DBMSs, deviations from this property are allowed in some cases. As long as there is a primary key in the relationship, identical tuples are excluded.

2. Tuples are not ordered from top to bottom - there is simply no concept of a positional number in a relation. In relationships, without losing information, you can successfully arrange tuples in any order.

3. Attributes are not ordered from left to right. The attributes in the relationship header can be arranged in any order without compromising the integrity of the data. Therefore, the concept of a positional number in relation to an attribute also does not exist.

4. Attribute values ​​consist of logically indivisible units - this follows from the fact that the values ​​are taken from domains; otherwise, we can say that relations do not contain repetition groups. That is, they are normalized.

Relational systems support several types of relationships:

1. Named ones are relation variables defined in the DBMS by creation operators and, as a rule, necessary for a more convenient presentation of information for the user.

2. Basic relationships are directly important part DB, so when designing they are given their own name.

3. A derived relation is one that was defined through other, usually basic, relations by using DBMS tools.

4. This representation is actually a named derived relation, and the representation is expressed exclusively through DBMS operators applied to named relations, so they do not physically exist in the database.

5. The result of queries is an unnamed derived relation containing data (the result of a specific query). The result is not stored in the database but exists as long as the user needs it.

6. A stored relation is one that is physically maintained in the memory of relations; stored relations most often include the base of relations. Based on the above, we can define a relational database as a set of interconnected relationships.


Contact in in this case is the association of two or more relations.

KOD ADRES
1 1 A one-to-many relationship is that at any given time each element (tuple A) corresponds to several elements of tuples B
∞ Binary connection
Students
Teachers
Timetable of classes

Students

Ternary connections


Data integrity

In relational models, the issue of data integrity is given a special place. Recall that the key or potential clue this is the minimum set of attributes whose values ​​can be used to uniquely find the required tuple; minimality means that excluding any attribute from the set does not allow identifying the tuple by the remaining attributes.

Every relationship has at least one possible key. One of them is taken as the primary key.

When choosing primary key preference should be given to non-composite keys or keys composed of a minimal set of attributes. It is also undesirable to use keys with long text values(It is preferable to use integer attributes as keys). So, to identify an employee, you can use either a unique personnel number, or passport number, or a set of last names, middle names and department numbers. It is not allowed for the primary key of a relationship, that is, any attribute participating in the primary key, to take undefined values. In this case, a contradictory situation will arise ( collision): A non-unique primary key element appears. Therefore, this should be carefully monitored when designing a database.

About foreign keys. It is worth noting that since relation C links relations B and A, it must include foreign keys corresponding to the primary keys of relations A and B.

A table's foreign key is formed using several primary keys of other tables.

Thus, when considering the problem of choosing a method for connecting a relationship in a database, the question arises of what the foreign keys should be. At the same time, for everyone foreign key it is necessary to solve the problem associated with the possibility (or impossibility) of undefined values ​​(NULL – values ​​- value attribute for missing information). In other words, can there be some tuple in a relation for which the tuple in its associated relations is not known?

On the other hand, it is necessary to think in advance about what will happen when removing tuples from a relation referenced by a foreign key. The following possible possibilities exist:

· Operation cascades– that is, deleting tuples in relations leads to deleting tuples associated with the relation. For example, deleting information about last name, first name, etc. employee in one respect leads to the deletion of his salary in another respect;

· Operation limited - that is, only those tuples for which there is no other associated information are removed. Not all information is deleted (not in all respects) since it can be used in another respect, the removal of information in which leads to a violation of data integrity. If such information is available, deletion cannot be carried out, for example, deleting information about the first name, last name, etc. employee is possible only if there is no related information about his salary.

It is necessary to provide technology for what will happen when you try to update the primary key of a relationship that is referenced by a foreign key. Here you have the same options as when deleting:

· The operation is cascaded, that is, when the primary key is updated, the foreign key in the related relation is updated. For example, updating the primary key in a relation where employee information is stored leads to an update of the foreign key in a relation containing salary information.

· The operation is limited to updating only those primary keys for which there is otherwise no associated information. If such information is available, the update cannot be made. For example, updating the primary key in a relation where information about an employee is stored is only possible if information about his salary is missing in the related relation.1


Relational algebra

The formal basis of the relational database model is relational algebra, based on set theory and considering a special operator over relations, and relational calculus based on mathematical logic.

Work

A A A B B C Y Y D
G D
A
A B C Y Y D F F W

It should be noted that relational algebra has great power - complex queries to the database can be expressed using a single expression. It is for this reason that these mechanisms are included in relational model data. Any query expressed using one relational algebra expression, or one relational calculus formula, can be expressed using one operator in this language.

Relational algebra has important property- it is closed regarding the concept of relationship. This means that the relational algebra expression is performed on relations relational databases data and the results of their calculation also represent relationships.

The main idea of ​​relational algebra is that the means of manipulating relationships considered as a set are based on traditional multiple operations supplemented by some database-specific operations.

Let us describe the version of algebra that was proposed by CODD. The operation consists of 8 main operators:

Relation fetch (unary operation)

Relation projection (unary operation)

· Merging relationships

· Intersection of relations (binary operation)

· Subtraction of ratios

Product of relations

· Connecting relationships

· Division of relationships

These operations can be explained as follows:

· The result of selecting a relation based on some condition is a relation that includes only those tuples of the original relation that satisfy this condition.

· When projecting a relation onto a given set of its attributes, a relation will be obtained whose tuples are taken from the corresponding tuples of the first relation.

· When performing the operation of merging two relations, a relation will be obtained that includes all tuples included in at least one of the relations participating in the operation.

· When performing the operation of intersection of two relations, a relation will be obtained that includes all tuples included in both initial relations.

· When performing the operation of subtracting two relations, a relation will be obtained that includes all tuples included in the first relation, except those that are also included in the second relation.

· When performing the direct product of two relations, a relation is obtained whose tuples are a combination of the tuples of the first and second relation.

· When two relations are connected according to some condition, a resulting relation of tuples is formed whose tuples are a combination of tuples of the first and second relations that satisfy this condition.

· The relational division operation has two operands – a binary relation (consisting of two attributes) and a unary relation (consisting of one attribute). The result of the operation is a relation consisting of tuples including the relation of the first attribute of tuples of the first relation, and such that the set of values ​​of the second attribute coincides with the set of values ​​of the second relation.

In addition to the above, there are a number of special operations specific to working with databases:

· As a result of the renaming operation, a relation is a set of tuples that coincides with the body of the original relation, but the attribute names have been changed.

It follows that the result of a relational operation is a certain relation; it is possible to form relational expressions in which, instead of the original relation (operand), an embedded relational expression will be used. This is due to the fact that the operations of relational algebra are truly closed to the concept of a relation. Let's start with the operation unification of relations, however, this equally applies to the operations of intersection and combination, that is, in relational algebra, the result of the union operation is a relation. If allowed into relational algebra opportunity associations arbitrary two relations with different sets of attributes, then the result of such an operation will be a set, but a set of tuples of different types, that is, generally speaking, not a relation. If we proceed from the requirement that relational algebra is closed with respect to the concept of relation, then such an operation associations is meaningless. This leads to the emergence of the concept relationship compatibility By unification: Two relations are compatible only if they have the same headers, that is, they have the same set of attribute names, and the attributes of the same name are defined in the same domain.

Provided that two relations are compatible in their union, when the operation of union, intersection, and subtraction is normally performed on them, the result of the operation is a relation with a correctly defined header that matches the header of each of the operand relations. If two relations are not fully join compatible, that is, compatible in everything except attribute names, then before performing a join type operation, these relations can be made fully join compatible by applying a rename operation.

The operation of direct product of two relations raises new problems. In Set Theory, the direct product can be obtained for any sets. The elements of the resulting set will be pairs made up of elements of the first and second sets. Since relations are sets, for any two relations it is possible to obtain a direct product. However, the result will not be a relation. The elements of the result will not be tuples, but pairs of tuples. Therefore, in relational algebra, a special form of the operation of taking a direct product is used - the extended direct product of relations. When taking the extended direct product of two relations, the element of the resulting relation is a tuple formed by merging one tuple of the first relation and one tuple of the second relation. A second problem immediately arises related to obtaining a correctly formed header of the resulting relationship; this leads to the need to introduce the concept of relationship compatibility by taking an extended direct product.

Two relations are compatible by taking a direct product only if the set of attribute names of these relations do not intersect. Any two relations can be converted to a compatible direct product form by applying a rename operation to one of the relations.

The fetch operation requires two relations: an initial relation, the operand, and a simple constraint condition. As a result of the selection operation, a relation is produced whose head coincides with the header of the operand relation, and the body includes those tuples of the operand relation that satisfy the values ​​of the constraint condition.

Let's introduce a number of operators.

Let union mean the union operation, intersect – the intersection operation, minus – the subtraction operation. To denote the sampling operation, we will use the construction A where B, where A is the operand relation, and B is a simple comparison condition. Let C1 and C2 be two simple sampling conditions

A where C1 AND C2 is identical (A where C1) intersect (A where C2)

A where C1 OR C2 is identical to (A where C1) union (A where C2)

A where C1 not C2 is identical to (A where C1) minus (A where C2)

Using these definitions, you can implement sampling operations in which the sampling condition is arbitrary logical expression made up of simple conditions using logical connections (and, or, not). The operation of taking projections of the relation A onto the list of attributes a1, a2,…,an will be a relation whose head is the set of attributes, a1,a2,…,an. The body of the result will consist of tuples for which in relation A there is a tuple, attribute a1 has the value b1, attribute a2 has the value b2< и так далее атрибут an – bn. По сути при выполнении операции проекции определяется «Вертикальная» вырезка отношения - операнда с удалением возникающих кортежей –дубликатов.

The join operation, sometimes called a conditional join, requires two operands, the relations being joined, and a third operand, the simple condition. Let the relation A and B be connected. As in the case of the selection operation, the join condition C has the form, (a comp –op b) or (a comp –op const) where A and B are the names of the attributes of the relations A and B, const is literally specified constant. Comp-op is a valid comparison operation in this context. Then, by definition, the result of the connection operation is the relation obtained by performing the restriction operation, according to condition C, the direct product of the relation A and B.

There is an important special case connections, natural connection. A join operation is called a natural join operation if the join conditions are of the form (a=b) where a and b are attributes of different join operands. This case is important because it is particularly common in practice and there are effective implementation algorithms for it in a DBMS. The natural join operation is applied to a pair of relations A and B that have a common attribute P, that is, an attribute with the same name and defined on the same domain. Let ab denote the union of the headers of relations A and B. Then a natural join is the result of the join of A and B projected onto ab. The operations of natural join are not directly included in the set of operations of relational algebra, but they have very important practical significance.

The operation of dividing relations needs more detailed explanation because it is difficult to understand. Let two relations A be given (a1,a2,..,an,b1,b2,…,bm)

B (b1,b2,…,bn) We assume that attribute b1 of relation A and attribute b1 of relation B are defined on the same domain. Let's call the set of attributes (aj) a composite attribute a, and the set (bj) c a composite attribute b. After this, we will talk about the relational division of the binary relation A (a,b) into the unary relation B (b).

The result of dividing A by B is a unary relation C (a), consisting of tuples v such that in relation A there are tuples which in the set of values ​​(w) include the set of values ​​of b in relation to B.

Since division is the most difficult operation, let us explain it with an example. Let there be two relations in the student database: STUDENTS (FULL NAME, NUMBER) and NAMES (FULL NAME), and the unary relation NAMES contains all the names that students of the institute have. Then, after performing the operation of relational division of the STUDENTS relation into the NAMES relation, a unary relation will be obtained containing the numbers of student cards belonging to students with all possible surnames at this institute.


Relational notation

Let's say there is a database with the structure STUDENTS (number, name, scholarship, group code), and the relation GROUPS (gr_nom, gr_col, gr old) Let's assume that you need to find out the names and numbers of students. tickets for students who are prefects of groups with more than 25 people. In relational algebra, you need to take the following actions for a request like this:

1. Connect the relations STUDENTS and GROUPS, according to the condition “student_number = gr_star”;

2. Limit the resulting ratio by the condition gr_col>25.

3. Project the result of the previous operation onto the attribute student_name, student_number.

Here is a step-by-step formulation of the sequence of query execution in the database, each of which corresponds to one relational operation. if we formulate the same query using relational calculus, then we would get a formula that can be read: Issue STUDY_NAME and STUDY_NUMBER for such students so that such a group GR_STAR and the value GR_NUM>25 coexist. In the second formulation, we indicated only the characteristics of the resulting relationship but said nothing about the method of its formation. In this case, the DBMS itself must decide what kind of operations and in what order should be performed on the STUDENTS and GROUPS relationships. Both methods discussed in the example are actually equivalent and there are not very complex conversions from one to the other.

The basic concepts of relational calculus are the concepts of a variable with a certain area of ​​its value, and the concepts of a correctly constructed formula based on variables and special functions. Functions. What is the domain of definition of a variable differs between tuple calculus and domain calculus, that is, along or across. In tuple calculus, the domains of variable definition are the database relation, i.e. valid value Each variable is a tuple of some relation. In domain calculus, the domains of variable definition are the domains on which the attributes of database relationships are defined, that is, the valid value of each variable is the value of each variable.

Byte Integer String Char
M
N
K

The RANGE command is used to define tuples. For example, to define the STUDENT variable whose scope is STUDENTS, you need to use the RANGE STUDENT IS STUDENTS construction. From this definition it follows that at any moment in time the student variable represents a certain tuple of the STUDENTS relation. When you use tuple variables in formulas, you can reference variable attribute values. For example, in order to refer to the value of the STUDENT_NAME attribute of the STUDENT variable, you need to use the STUDENT.STUDENT_NAME construction.

Correctly constructed formulas are used to express conditions imposed on tuple variables. Such formulas are based on simple comparisons, which are operations comparing the values ​​of attributes of variables and literal constants. For example, the construction STUDENT.STUD_NOM=123456. Is a simple comparison. More difficult option compound formulas are formed using logical connections AND, OR, NOT, IF…THEN. Finally, it is possible to construct well-formed formulas using quantifiers. If F is a well-formed formula involving the variable var, then the construction EXIST (existence quantifier) ​​var (F) and FORALL (for all tuples) var (F) are correct.

Variables included in properly constructed formulas can be free or bound. All variables included in their composition in the construction of which no quantifiers were used are free. This means that if for some set of values ​​of free tuple variables the value “true” is obtained when calculating formulas, then these values ​​can be included in the resulting relation. If a quantifier is used when constructing formulas, then the variables are related. When calculating the value of such a correctly constructed formula, not a single value of the associated variable is used, but its entire domain of definition.

1)EXISTS STUD2 (STUD.1STUD_STIP> STUD2.STUD_STIP)

2)FORALL STUD2 (STUD.1STUD_STIP> STUD2.STUD_STIP)

Let STUD1 and STUD2 be two tuple variables defined on the relation students, then the formula for the current tuple of the variable STUD1 takes on the value true only if in the entire relation students there is such a tuple associated with the variable STUD2 such that the value of its attribute STUD_STIP satisfies the internal comparison condition. Correctly constructed formula No. 2 for the constructed tuple STUDENT 1 takes the value true if for all tuples the relation STUDENTS associated with the variable STUDENT 2, the value of the STUDENT.STIP attribute satisfies the internal condition.

Thus, well-formed formulas provide a means of expressing the conditions for sampling from a database relationship. To be able to use relational calculus to actually work with a database, another component is required that determines the set and names of the columns of the resulting relation. This component is called target list.

Target list has the form:

· Var.attr is the name of a free variable, attr is the name of the relation attribute on which the var variable is defined.

· Var which is equivalent to the relation from the list, Var.attr1, Var.attr1... Var.attr№ includes the names of all attributes of the defining relation.

· New_name = var.attr; the new name of the corresponding attribute of the resulting relation.

The last option is required in cases where the code in the formula uses several free variables with the same scope. In domain calculus, the domain of definition of domains is not relations but domains. In relation to the STUDENTS GROUP database, we can talk about domain variables NAME(Domain values ​​are valid names or NOM STUD). (Domain values ​​are valid student numbers).

The main difference between domain calculus and tuple calculus is the presence of an additional set of predicates that make it possible to express so-called membership conditions. If R is an n-ary relation with attributes (a1, a2, … an) then the membership condition has the form R(ai1:Vi1,ai2:Vi2,…aim:Vim) where (m<=n). Где в Vij это либо литерально заданная константа либо имя кортежной переменной. Условие членства принимает значение истина, только в том случае если в отношении R существует кортеж, содержащий следующие значения указанных атрибутов. Если от Vij константа то на атрибут aij накладывается жёсткое условие независящее от текущих доменных переменных. Если же Vij имя доменной переменной то условие членства может принимать различные значения при разных значениях этой переменной.

A predicate is a logical function that returns true or false for some argument. A relation can be considered as a predicate with arguments that are attributes of the relation in question. If a given specific set of tuples is present in the relation, then the predicate will produce a true result, otherwise it will produce a false result.

In all other respects, the formulas and expressions of domain calculus look similar to the formulas and expressions of tuple calculus. Relational domain reckoning is the basis for most form-based language queries.


Related information.


As a rule, any web application can be divided into 2 main parts: the front end, where all the site information is displayed, and the back end, where this information is generated and placed. In this article we will talk about what relational databases are and how to design them.

The database stores records in a specially organized way so that information can be easily found and retrieved. Any database consists of one or more tables. A spreadsheet consists of rows and columns. All rows have the same columns, and each column contains data. In general, for a better understanding, let’s define that the tables in the database are very similar to those you saw in Excel.

Tabular data can be inserted, restored, updated and deleted. A special abbreviation CRUD (Create-Read-Update-Delete) was created for a package of these operations.

Relational databases are databases where all information is stored in tables connected to each other by special relationships. These relationships allow us to retrieve and join data from one or more tables using a single query.

But these are all just words. To really understand what relational databases are, you need to practice more. Let's get started and see what data we have to work with.

Step 1: Data Preparation

In order for us to have something to work with, I typed the query “#databases” on Twitter and created a table of 10 records:

Table 1

full_name username text created_at following_username
Boris Hadjur _DreamLead Scootmedia, MetiersInternet
Gunnar Svalander GunnarSvalander klout, zillow
GE Software GEsoftware DayJobDoc, byosko
Adrian Burch adrianburch Cindy Crawford, Arjantim
Andy Ryder AndyRyder5 Michael Dell, Yahoo
Andy Ryder AndyRyder5 Michael Dell, Yahoo
Brett Englebert Brett_Englebert
Brett Englebert Brett_Englebert RealSkipBayless, stephenasmith
Nimbus Data Systems NimbusData dellock6, rohitkilam
SSWUG.ORG SSWUGorg drsql, steam_games

First of all, let's look at the columns:

This is real data. If you want, you can find them and update them.

Fine. Now all our data is in one place. Does this allow us to easily search them? Not really. This table is far from ideal. First, we have duplicate entries in some columns: for example, in x “username” and “following_username”. Also, the “following_username” column violates the rules of relational models, because there is more than 1 value in the cells (entries are separated by commas).

In addition, we come across duplicates in the lines.

Duplicate data is indeed a problem because... they make the CRUD process difficult. For example, when searching through this table, processing duplicates will take additional time. In addition, if the user updates the tweet, then we will need to overwrite all duplicates.

The solution to this problem is to split Table 1 into several tables. Let's get down to solving the first problem, namely eliminating duplicates in columns.

Step 2. Get rid of duplicates in columns

As stated above, the “username” and “following_username” columns contain duplicate data. They came about because I wanted to show the relationship between tweets and users. Let's improve our database structure by dividing the existing table into two: one will store information, and the other will store relationships between records.

Since @Brett_Englebert follows @RealSkipBayless, we will display this in the “following” table as follows: we will place the name @Brett_Englebert in the “from_user” column, and @RealSkipBayless in the “to_user.” Let's see what the “following” table will look like after the split Tables 1:

Table 2. following

from_user to_user
_DreamLead Scootmedia
_DreamLead MetiersInternet
GunnarSvalander klout
GunnarSvalander zillow
GEsoftware DayJobDoc
GEsoftware byosko
adrianburch CindyCrawford
adrianburch Arjantim
AndyRyder MichaelDell
AndyRyder Yahoo
Brett_Englebert RealSkipBayless
Brett_Englebert stephenasmith
NimbusData dellock6
NimbusData rohitkilam
SSWUGorg drsql
SSWUGorg steam_games

Table 3. users

full_name username text created_at
Boris Hadjur _DreamLead What do you think about #emailing #campaigns #traffic in #USA? Is it a good market nowadays? do you have #databases? Tue, 12 Feb 2013 08:43:09 +0000
Gunnar Svalander GunnarSvalander Bill Gates Talks Databases, Free Software on Reddit http://t.co/ShX4hZlA #billgates #databases Tue, 12 Feb 2013 07:31:06 +0000
GE Software GEsoftware RT @KirkDBorne: Readings in #Databases: excellent reading list, many categories: http://t.co/S6RBUNxq via @rxin Fascinating. Tue, 12 Feb 2013 07:30:24 +0000
Adrian Burch adrianburch RT @tisakovich: @NimbusData at the @Barclays Big Data conference in San Francisco today, talking #virtualization, #databases, and #flash memory. Tue, 12 Feb 2013 06:58:22 +0000
Andy Ryder AndyRyder5 http://t.co/D3KOJIvF article about Madden 2013 using AI to predict the super bowl #databases #bus311 Tue, 12 Feb 2013 05:29:41 +0000
Andy Ryder AndyRyder5 http://t.co/rBhBXjma an article about privacy settings and facebook #databases #bus311 Tue, 12 Feb 2013 05:24:17 +0000
Brett Englebert Brett_Englebert #BUS311 University of Minnesota’s NCFPD is creating #databases to prevent “food fraud.” http://t.co/0LsAbKqJ Tue, 12 Feb 2013 01:49:19 +0000
Brett Englebert Brett_Englebert #BUS311 companies might be protecting their production #databases, but what about their backup files? http://t.co/okJjV3Bm Tue, 12 Feb 2013 01:31:52 +0000
Nimbus Data Systems NimbusData @NimbusData CEO @tisakovich @BarclaysOnline Big Data conference in San Francisco today, talking #virtualization, #databases,& #flash memory Mon, 11 Feb 2013 23:15:05 +0000
SSWUG.ORG SSWUGorg Don’t forget to sign up for our FREE expo this Friday: #Databases, #BI, and #Sharepoint: What You Need to Know! http://t.co/Ijrqrz29 Mon, 11 Feb 2013 22:15:37 +0000

Already better. Now in the “users” table (Table 3) we only store information about tweets, and in the following table (Table 2) we store user dependencies.

The founder of relational database theory, Edgar Codd, would call this process (removing duplicates from table columns) bringing the database to first normal form.

Step 3: Removing Duplicates from Rows

Now we will move on to fixing other problems, namely, getting rid of duplicates in the rows of the “users” table. Since @AndyRyder5 and @Brett_Englebert each posted multiple tweets, their names are in the “users” table ( Table 3) are duplicated in the full_name column. This problem is also solved by partitioning the “users” table.

Since the tweet text and the time it was created are unique data, we will place them in the same table. We also need to indicate the relationship between tweets and users. For this purpose I created a special column username.

Table 4. tweets

username text created_at
_DreamLead What do you think about #emailing #campaigns #traffic in #USA? Is it a good market nowadays? do you have #databases? Tue, 12 Feb 2013 08:43:09 +0000
GunnarSvalander Bill Gates Talks Databases, Free Software on Reddit http://t.co/ShX4hZlA #billgates #databases Tue, 12 Feb 2013 07:31:06 +0000
GEsoftware RT @KirkDBorne: Readings in #Databases: excellent reading list, many categories: http://t.co/S6RBUNxq via @rxin Fascinating. Tue, 12 Feb 2013 07:30:24 +0000
adrianburch RT @tisakovich: @NimbusData at the @Barclays Big Data conference in San Francisco today, talking #virtualization, #databases, and #flash memory. Tue, 12 Feb 2013 06:58:22 +0000
AndyRyder5 http://t.co/D3KOJIvF article about Madden 2013 using AI to predict the super bowl #databases #bus311 Tue, 12 Feb 2013 05:29:41 +0000
AndyRyder5 http://t.co/rBhBXjma an article about privacy settings and facebook #databases #bus311 Tue, 12 Feb 2013 05:24:17 +0000
Brett_Englebert #BUS311 University of Minnesota’s NCFPD is creating #databases to prevent “food fraud.” http://t.co/0LsAbKqJ Tue, 12 Feb 2013 01:49:19 +0000
Brett_Englebert #BUS311 companies might be protecting their production #databases, but what about their backup files? http://t.co/okJjV3Bm Tue, 12 Feb 2013 01:31:52 +0000
NimbusData @NimbusData CEO @tisakovich @BarclaysOnline Big Data conference in San Francisco today, talking #virtualization, #databases,& #flash memory Mon, 11 Feb 2013 23:15:05 +0000
SSWUGorg Don’t forget to sign up for our FREE expo this Friday: #Databases, #BI, and #Sharepoint: What You Need to Know! http://t.co/Ijrqrz29 Mon, 11 Feb 2013 22:15:37 +0000

Table 5. users

full_name username
Boris Hadjur _DreamLead
Gunnar Svalander GunnarSvalander
GE Software GEsoftware
Adrian Burch adrianburch
Andy Ryder AndyRyder5
Brett Englebert Brett_Englebert
Nimbus Data Systems NimbusData
SSWUG.ORG SSWUGorg

After partitioning in the users table ( Table 5) we have unique (non-repeating) lines.

This process of removing duplicates from strings is called casting to second normal form.

Step 4. Join tables based on keys

So, as a result of our actions, Table 1 was divided into 3 parts: following (Table 2), tweets (Table 4), users (Table 5). All duplicates have been eliminated. In order for us to be able to easily extract data from this structure in the future, we must connect tables independent from each other with special relationships that will give us information about which user owns which tweet, and who follows whom.

To create relationships between records, we need to enter a unique identifier called a primary key.

Generally speaking, in Tables 4 and 5 we have already done this. In the “users” table, the primary key is the “username” column because the username must be a unique value and cannot be repeated. In the “tweets” table, we use this key to indicate the relationship between the user and the tweet. The “username” column in the “tweets” table is called a foreign key.

If you have ever worked with databases, then you may have a question: can we use the “username” column as a primary key?

On the one hand, this can simplify the search process, because we do not use any numeric IDs. On the other hand, what if the user wants to change his login? This can lead to a huge number of problems. In order to avoid getting into a similar situation, it is better to use numeric IDs. It all depends on your system. If you provide your users with the ability to change logins, then it is better to use an auto-incrementing numeric ID field as the primary key. Otherwise, the “username” column is quite suitable for this role. I'll leave it as it is.

Let's look at the tweets table (Table 4). The primary key must be unique for each row. Which column in this table can we select for this role? The “created_at” column will not work, because in principle, 2 different users can publish a post at the same time. The “text” column is the same: two different users can create a tweet with the text “Hello World”. The “username” column in this table is a foreign key to indicate the relationship between the user and the tweet. So, since all possible options do not suit us, the best solution would be to add an id column, which will be the primary key for this table.

Table 6. tweets with id column

ID username text created_at
1 _DreamLead What do you think about #emailing #campaigns #traffic in #USA? Is it a good market nowadays? do you have #databases? Tue, 12 Feb 2013 08:43:09 +0000
2 GunnarSvalander Bill Gates Talks Databases, Free Software on Reddit http://t.co/ShX4hZlA #billgates #databases Tue, 12 Feb 2013 07:31:06 +0000
3 GEsoftware RT @KirkDBorne: Readings in #Databases: excellent reading list, many categories: http://t.co/S6RBUNxq via @rxin Fascinating. Tue, 12 Feb 2013 07:30:24 +0000
4 adrianburch RT @tisakovich: @NimbusData at the @Barclays Big Data conference in San Francisco today, talking #virtualization, #databases, and #flash memory. Tue, 12 Feb 2013 06:58:22 +0000
5 AndyRyder5 http://t.co/D3KOJIvF article about Madden 2013 using AI to predict the super bowl #databases #bus311 Tue, 12 Feb 2013 05:29:41 +0000
6 AndyRyder5 http://t.co/rBhBXjma an article about privacy settings and facebook #databases #bus311 Tue, 12 Feb 2013 05:24:17 +0000
7 Brett_Englebert #BUS311 University of Minnesota’s NCFPD is creating #databases to prevent “food fraud.” http://t.co/0LsAbKqJ Tue, 12 Feb 2013 01:49:19 +0000
8 Brett_Englebert #BUS311 companies might be protecting their production #databases, but what about their backup files? http://t.co/okJjV3Bm Tue, 12 Feb 2013 01:31:52 +0000
9 NimbusData @NimbusData CEO @tisakovich @BarclaysOnline Big Data conference in San Francisco today, talking #virtualization, #databases,& #flash memory Mon, 11 Feb 2013 23:15:05 +0000
10 SSWUGorg Don’t forget to sign up for our FREE expo this Friday: #Databases, #BI, and #Sharepoint: What You Need to Know! http://t.co/Ijrqrz29 Mon, 11 Feb 2013 22:15:37 +0000

We can do the same with the following table, because no existing column can serve as a primary key. The columns “from_user” and “to_user” are foreign keys and indicate the relationship between user subscriptions.

So, at this point we have already done a lot of things. We got rid of duplicate information in columns and rows and selected suitable columns for our tables to act as primary and foreign keys to indicate dependencies between data. This process is called normalization and is designed to bring your tables under the relational model. Thanks to normalization, we can implement CRUD operations in a simpler way.

Below you can see a diagram of our tables and the relationships between them:

Database Management Systems

Now that we have a relational database, how can we implement it? To do this, we can use database management systems (DBMS). There is a whole range of similar programs, both paid and free. Paid ones include Oracle Database, IBM DB2 and Microsoft SQL Server. Free: MySQL, SQLite and PostgreSQL.

Most often, various companies use MySQL. Twitter in this sense is no exception.

SQLite is more often used when developing applications for iOS and Android, where various types of confidential information are stored. Google Chrome browser uses SQLite to store browsing history, cookies, images...

PostgreSQL is used less frequently. There is a useful PostGIS extension for it, which makes this DBMS convenient for storing geolocation data. For example, the OpenStreetMap service uses PostgreSQL.

Structured Query Language (SQL)

Once you have chosen the right DBMS for you and installed it, the next step would be to create tables and manage the data. To do this, we can use a special language called SQL.

Creating a development database:

CREATE DATABASE development;

Creating the Users table:

CREATE TABLE users (full_name VARCHAR(100), username VARCHAR(100));

When creating fields, we need to specify the type of information to be stored and its size. The “full_name” and “username” columns will be of type VARCHAR, which is designed to store character strings. Size 100 characters. You can find a list of all types.

Adding an entry:

INSERT INTO users (full_name, username) VALUES ("Boris Hadjur", "_DreamLead");

Retrieving all posts by user _DreamLead:

Post update:

Deleting an entry:

SQL is very similar to human language (English). Each SQL DBMS has a number of its own features and differences, but in general, all varieties of SQL are similar to each other.

Bottom line

In this lesson, we looked at the process of creating a relational database, took a set of data and distributed them into tables according to the relational model. We also took a quick look at existing DBMSs and the SQL language.

Relational Database - Basic Concepts

Often, when talking about a database, they simply mean some automated data storage. This idea is not entirely correct. Why this is so will be shown below.

Indeed, in the narrow sense of the word, a database is a certain set of data necessary for work (up-to-date data). However, data is an abstraction; no one has ever seen “just data”; they do not arise or exist on their own. Data is a reflection of objects in the real world. Let, for example, you want to store information about parts received at the warehouse. How will a real world object - a part - be displayed in the database? In order to answer this question, you need to know which features or aspects of the part will be relevant and necessary for the job. These may include the name of the part, its weight, dimensions, color, date of manufacture, material from which it is made, etc. In traditional terminology, real-world objects, information about which is stored in a database, are called entities (don’t let this word scare the reader - this is a generally accepted term), and their actual characteristics are called attributes.

Each attribute of a specific object is an attribute value. Thus, the engine part has a weight attribute value of 50, which reflects the fact that this engine weighs 50 kilograms.

It would be a mistake to think that only physical objects are reflected in the database. It is capable of absorbing information about abstractions, processes, phenomena - that is, about everything that a person encounters in his activities. For example, in a database you can store information about orders for the supply of parts to a warehouse (although it is not a physical object, but a process). The attributes of the "order" entity will be the name of the part being supplied, the number of parts, the name of the supplier, delivery time, etc.

Objects in the real world are connected to each other by many complex dependencies that must be taken into account in information activities. For example, parts are supplied to the warehouse by their manufacturers. Therefore, it is necessary to include the “manufacturer’s name” attribute among the part attributes. However, this is not enough, since additional information about the manufacturer of a particular part may be needed - his address, telephone number, etc. This means that the database must contain not only information about parts and purchase orders, but also information about their manufacturers. Moreover, the database must reflect the relationships between parts and manufacturers (each part is produced by a specific manufacturer) and between orders and parts (each order is issued for a specific part). Note that only relevant, significant connections need to be stored in the database.

Thus, in the broad sense of the word, a database is a set of descriptions of real-world objects and connections between them that are relevant for a specific application area. In what follows, we will proceed from this definition, clarifying it as we go along.

Relational data model

So now we have an idea of ​​what is stored in the database. Now we need to understand how entities, attributes, and relationships map to data structures. This is determined by the data model.

Traditionally, all DBMSs are classified depending on the data model that underlies them. It is customary to distinguish between hierarchical, network and relational data models. Sometimes they are supplemented with a data model based on inverted lists. Accordingly, they talk about hierarchical, network, relational DBMS or DBMS based on inverted lists.

In terms of prevalence and popularity, relational DBMSs today are unrivaled. They have become a de facto industrial standard, and therefore the domestic user will have to deal with a relational DBMS in their practice. Let's briefly look at the relational data model without delving into its details.

It was developed by Codd back in 1969-70 on the basis of the mathematical theory of relations and is based on a system of concepts, the most important of which are table, relation, row, column, primary key, foreign key.

A relational database is one in which all data is presented to the user in the form of rectangular tables of data values, and all operations on the database are reduced to manipulations with tables. A table consists of rows and columns and has a name that is unique within the database. The table reflects the type of real world object (entity), and each of its rows represents a specific object. Thus, the Part table contains information about all parts stored in the warehouse, and its rows are sets of attribute values ​​for specific parts. Each table column is a collection of values ​​for a specific attribute of an object. So, the Material column represents a set of values ​​​​"Steel", "Tin", "Zinc", "Nickel", etc. The Quantity column contains non-negative integers. The values ​​in the Weight column are real numbers equal to the weight of the part in kilograms.

These values ​​don't appear out of thin air. They are selected from the set of all possible values ​​for an object attribute, which is called the domain. Thus, the values ​​in the material column are selected from a set of names of all possible materials - plastics, wood, metals, etc. Therefore, it is fundamentally impossible for a value to appear in the Material column that does not exist in the corresponding domain, for example, “water” or “sand”.

Each column has a name, which is usually written at the top of the table ( Rice. 1). It must be unique within the table, but different tables can have columns with the same name. Any table must have at least one column; The columns are arranged in the table according to the order in which their names appeared when it was created. Unlike columns, rows do not have names; their order in the table is not defined, and their number is logically unlimited.

Figure 1. Basic database concepts.

Since the rows in the table are not ordered, it is impossible to select a row by its position - there is no "first", "second", or "last" among them. Any table has one or more columns, the values ​​of which uniquely identify each of its rows. This column (or combination of columns) is called a primary key. In the Part table, the primary key is the Part Number column. In our example, each part in the warehouse has a single number, by which the necessary information is retrieved from the Part table. Therefore, in this table, the primary key is the Part Number column. There cannot be duplicate values ​​in this column - there should be no rows in the Part table that have the same value in the Part Number column. If a table satisfies this requirement, it is called a relation.

The relationship of tables is the most important element of the relational data model. It is supported by foreign keys. Let's consider an example in which a database stores information about ordinary employees (Employee table) and managers (Manager table) in some organization ( Rice. 2). The primary key of the table Head is the Number column (for example, personnel number). The Last Name column cannot serve as a primary key, since two managers with the same last names can work in the same organization. Any employee is subordinate to a single manager, which must be reflected in the database. The Employee table contains a column Manager Number, and the values ​​in this column are selected from the Number column of the Manager table (see. Rice. 2). The Manager Number column is a foreign key in the Employee table.

Figure 2. Relationship between database tables.

Tables cannot be stored and processed if there is no "data about data" in the database, such as handles for tables, columns, etc. They are usually called metadata. Metadata is also presented in tabular form and stored in a data dictionary.

In addition to tables, the database can store other objects, such as displays, reports, views, and even application programs that work with the database.

For users of an information system, it is not enough for the database to simply reflect real-world objects. It is important that such a reflection is unambiguous and consistent. In this case, the database is said to satisfy the integrity condition.

In order to guarantee the correctness and mutual consistency of data, certain restrictions are imposed on the database, which are called data integrity constraints.

There are several types of integrity constraints. It is required, for example, that the values ​​in a table column be selected only from the corresponding domain. In practice, more complex integrity constraints are also taken into account, for example, referential integrity. Its essence is that a foreign key cannot be a pointer to a non-existent row in the table. Integrity constraints are implemented using special means, which will be discussed in Sec.Database server .

SQL language

The data itself in computer form is of no interest to the user if there are no means of accessing it. Data is accessed in the form of database queries that are formulated in a standard query language. Today, for most DBMSs, this language is SQL.

The emergence and development of this language as a means of describing database access is associated with the creation of the theory of relational databases. The prototype of the SQL language arose in 1970 as part of the System/R research project, work on which was carried out at the Santa Teresa laboratory of IBM. Nowadays SQL is the standard interface with relational DBMS. Its popularity is so great that developers of non-relational DBMSs (for example, Adabas) supply their systems with a SQL interface.

The SQL language has an official standard - ANSI/ISO. Most DBMS developers adhere to this standard, but often extend it to implement new data processing capabilities. New data management mechanisms that will be described in Sec.Database server , can only be used through special SQL statements, which are generally not included in the language standard.

SQL is not a traditional programming language. Not programs are written in it, but queries to the database. That's why SQL is a declarative language. This means that it can be used to formulate what needs to be obtained, but it cannot indicate how it should be done. In particular, unlike procedural programming languages ​​(C, Pascal, Ada), the SQL language does not have operators such as if-then-else, for, while, etc.

We will not go into detail about the syntax of the language. We will touch upon it only to the extent necessary to understand simple examples. With their help, the most interesting data processing mechanisms will be illustrated.

A SQL query consists of one or more statements, one after the other, separated by a semicolon. Table 1 below lists the most important operators that are included in the ANSI/ISO SQL standard.

Table 1. Basic SQL operators.

SQL queries use names that uniquely identify database objects. In particular, this is the table name (Detail), column name (Title), as well as the names of other objects in the database that belong to additional types (for example, names of procedures and rules), which will be discussed in Sec.Database server . Along with simple ones, complex names are also used - for example, a qualified column name determines the name of the column and the name of the table to which it belongs (Part.Weight). For simplicity, in the examples, names will be written in Russian, although in practice this is not recommended.

Each column in any table stores specific types of data. There are basic data types - fixed-length character strings, integers and real numbers, and additional data types - variable-length character strings, monetary units, date and time, logical data (two values ​​- "TRUE" and "FALSE"). In SQL, you can use numeric, string, character, and date and time constants.

Let's look at a few examples.

The query “determine the number of parts in stock for all types of parts” is implemented as follows:

SELECT Name, Quantity

FROM Part;

The result of the query will be a table with two columns - Name and Quantity, which are taken from the original Part table. Essentially, this query allows you to get a vertical projection of the original table (more strictly, a vertical subset of the set of table rows). From all rows of the Part table, rows are formed that include values ​​​​taken from two columns - Name and Quantity.

The query “What steel parts are in stock?” formulated in SQL looks like this:

FROM Part

WHERE Material = "Steel";

The result of this query will also be a table containing only those rows of the source table that have the value "Steel" in the Material column. This query allows you to get a horizontal projection of the Part table (the asterisk in the SELECT statement means selecting all columns from the table).

The request “to determine the name and quantity of parts in stock that are made of plastic and weigh less than five kilograms” will be written as follows:

SELECT Name, Quantity

FROM Part

WHERE Material = "Plastic"

AND Weight< 5;

The result of the query is a table with two columns - Name, Quantity, which contains the name and number of parts made of plastic and weighing less than 5 kg. In essence, the sampling operation is the operation of first forming a horizontal projection (find all rows of the Part table for which Material = "Plastic" and Weight< 5), а затем вертикальной проекции (извлечь Название и Количество из выбранных ранее строк).

One of the tools that provides quick access to tables is indexes. An index is a database structure that is a pointer to a specific row in a table. A database index is used in the same way as an index in a book. It contains values ​​taken from one or more columns of a particular table row and a reference to that row. The values ​​in the index are ordered, allowing the DBMS to quickly search the table.

Let's assume that a query is formulated to the Warehouse database:

SELECT Name Quantity, Material

FROM Part

WHERE Number = "T145-A8";

If there are no indexes for a given table, then to execute this query the DBMS must scan the entire Part table, sequentially selecting rows from it and checking the selection condition for each of them. For large tables, such a query will take a very long time to complete.

If an index was previously created on the Table Number Detail column, then the search time in the table will be reduced to a minimum. The index will contain the values ​​from the Number column and a link to the row with this value in the Part table. When executing a query, the DBMS will first find the value “T145-A8” in the index (and do this quickly, since the index is ordered and its rows are small), and then, using the link in the index, determine the physical location of the searched row.

An index is created with the SQL CREATE INDEX statement. In this example, the operator

CREATE UNIQUE INDEX Part index

ON Part(Number);

will create an index with the name "Part Index" on the column Number of the table Part.

For a DBMS user, it is not the individual SQL statements that are of interest, but a certain sequence of them, designed as a single whole and making sense from his point of view. Each such sequence of SQL statements implements a specific action on the database. It is carried out in several steps, at each of which certain operations are performed on the database tables. Thus, in the banking system, the transfer of a certain amount from a short-term account to a long-term account is carried out in several operations. Among them are withdrawing an amount from a short-term account and crediting it to a long-term account.

If there is a failure during this action, for example, when the first operation is completed but the second is not, then the money will be lost. Therefore, any action on the database must be performed entirely, or not performed at all. This action is called a transaction.

Transaction processing relies on a log, which is used to roll back transactions and restore the state of the database. More details about transactions will be discussed in Sec.Transaction Processing .

Concluding our discussion of the SQL language, let us once again emphasize that it is a query language. It is impossible to write any complex application program that works with a database. For this purpose, modern DBMSs use the fourth generation language (Forth Generation Language - 4GL), which has both the basic capabilities of third generation procedural languages ​​(3GL), such as C, Pascal, Ada, and the ability to embed SQL statements into the program text, as well as user interface controls (menus, forms, user input, etc.). Today, 4GL is one of the de facto standards for database application development tools.

A data model is a set of data structures and operations for their processing. Using a data model, you can visually represent the structure of objects and the relationships established between them. Data model terminology is characterized by the concepts of “data element” and “binding rules”. A data element describes any set of data, and association rules define algorithms for interconnecting data elements. To date, many different data models have been developed, but in practice three main ones are used. There are hierarchical, network and relational data models. Accordingly, they talk about hierarchical, network and relational DBMSs.

O Hierarchical data model. Hierarchically organized data is very common in everyday life. For example, the structure of a higher education institution is a multi-level hierarchical structure. A hierarchical (tree) database consists of an ordered set of elements. In this model, initial elements give rise to other elements, and these elements in turn give rise to further elements. Each child element has only one parent element.

Organizational structures, lists of materials, tables of contents in books, project plans, and many other sets of data can be presented in a hierarchical form. The integrity of links between ancestors and descendants is automatically maintained. Basic rule: no child can exist without its parent.

The main disadvantage of this model is the need to use the hierarchy that was the basis of the database during design. The need for constant reorganization of data (and often the impossibility of this reorganization) led to the creation of a more general model - a network model.

O Network data model. The network approach to data organization is an extension of the hierarchical approach. This model differs from the hierarchical one in that each generated element can have more than one generating element. ■

Because a network database can directly represent all kinds of relationships inherent in the data of the corresponding organization, this data can be navigated, explored and queried in various ways, that is, the network model is not bound by just one hierarchy. However, in order to make a request to a network database, it is necessary to delve deeply into its structure (have the schema of this database at hand) and develop a mechanism for navigating the database, which is a significant drawback of this database model.

O Relational data model. The basic idea of ​​a relational data model is to represent any set of data as a two-dimensional table. In its simplest form, a relational model describes a single two-dimensional table, but more often than not, the model describes the structure and relationships between several different tables.

Relational data model

So, the purpose of the information system is to process data about objects real world, taking into account connections between objects. In database theory, data is often called attributes, and objects - entities. Object, attribute and connection are fundamental concepts of I.S.

An object(or essence) is something that exists and distinguishable, that is, an object can be called that “something” for which there is a name and a way to distinguish one similar object from another. For example, every school is an object. Objects are also a person, a class at school, a company, an alloy, a chemical compound, etc. Objects can be not only material objects, but also more abstract concepts that reflect the real world. For example, events, regions, works of art; books (not as printed products, but as works), theatrical performances, films; legal norms, philosophical theories, etc.

Attribute(or given)- this is a certain indicator that characterizes a certain object and takes a certain numeric, text or other value for a specific instance of the object. The information system operates with sets of objects designed in relation to a given subject area, using specific attribute values(data) of certain objects. For example, let's take classes in a school as a set of objects. The number of students in a class is a datum that takes on a numerical value (one class has 28, another has 32). The class name is a given one that takes a text value (one has 10A, another has 9B, etc.).

The development of relational databases began in the late 60s, when the first works appeared that discussed; the possibility of using familiar and natural ways of presenting data - the so-called tabular datalogical models - when designing databases.

The founder of the theory of relational databases is considered to be an IBM employee, Dr. E. Codd, who published an article on June 6, 1970 A Relational Model of Data for Large-Shared Data Banks(Relational data model for large collective data banks). This article was the first to use the term “relational data model.” The theory of relational databases, developed in the 70s in the USA by Dr. E. Codd, has a powerful mathematical basis that describes the rules for effectively organizing data. The theoretical framework developed by E. Codd became the basis for the development of the theory of database design.

E. Codd, being a mathematician by training, proposed using the apparatus of set theory (union, intersection, difference, Cartesian product) for data processing. He proved that any set of data can be represented in the form of two-dimensional tables of a special kind, known in mathematics as “relations”.

Relational A database is considered to be one in which all data is presented to the user in the form of rectangular tables of data values, and all operations on the database are reduced to manipulations with the tables.

The table consists of columns (fields) And lines (records); has a name that is unique within the database. Table reflects Object type real world (entity), and each of her string is a specific object. Each table column is a collection of values ​​for a specific attribute of an object. The values ​​are selected from the set of all possible values ​​for an object attribute, which is called domain.

In its most general form, a domain is defined by specifying some base data type to which the elements of the domain belong, and an arbitrary Boolean expression applied to the data elements. If you evaluate a Boolean condition on a data item and the result is true, then that item belongs to the domain. In the simplest case, a domain is defined as a valid potential set of values ​​of the same type. For example, the collection of the birth dates of all employees constitutes the “birthdate domain,” and the names of all employees constitute the “employee name domain.” The birthdate domain must have a point-in-time data type, and the employee name domain must have a character data type.

If two values ​​come from the same domain, then a comparison can be made between the two values. For example, if two values ​​are taken from the domain of birth dates, you can compare them and determine which employee is older. If the values ​​are taken from different domains, then their comparison is not allowed, since, in all likelihood, it does not make sense. For example, nothing definite will come of comparing an employee's name and date of birth.

Each column (field) has a name, which is usually written at the top of the table. When designing tables within a specific DBMS, it is possible to select for each field its type, that is, to define a set of rules for its display, as well as to determine the operations that can be performed on the data stored in this field. Sets of types may vary between different DBMSs.

The field name must be unique in the table, but different tables can have fields with the same name. Any table must have at least one field; The fields are located in the table in accordance with the order in which their names appeared when it was created. Unlike fields, strings do not have names; their order in the table is not defined, and their number is logically unlimited.

Since the rows in the table are not ordered, it is impossible to select a row by its position - there is no “first”, “second”, or “last” among them. Any table has one or more columns, the values ​​of which uniquely identify each of its rows. Such a column (or combination of columns) is called primary key. An artificial field is often introduced to number records in a table. Such a field, for example, could be its ordinal field, which can ensure the uniqueness of each record in the table. The key must have the following properties.

Uniqueness. At any given time, no two different relation tuples have the same value for the combination of attributes included in the key. That is, there cannot be two rows in the table that have the same identification number or passport number.

Minimalism. None of the attributes included in the key can be excluded from the key without violating uniqueness. This means that you should not create a key that includes both the passport number and the identification number. It is enough to use any of these attributes to uniquely identify a tuple. You should also not include a non-unique attribute in the key, that is, using a combination of an identification number and an employee’s name as a key is prohibited. By excluding the employee's name from the key, each row can still be uniquely identified.

Every relation has at least one possible key, since the totality of all its attributes satisfies the condition of uniqueness - this follows from the very definition of the relation.

One of the possible keys is randomly selected in as the primary key. The remaining possible keys, if any, are taken as alternative keys. For example, if you select an identification number as the primary key, then the passport number will be the alternate key.

The relationship of tables is the most important element of the relational data model. It is supported foreign keys.

When describing a relational database model, different terms are often used for the same concept, depending on the level of description (theory or practice) and the system (Access, SQL Server, dBase). In table 2.3 provides a summary of the terms used.

Table 2.3. Database terminology

Database theory____________ Relational databases_________ SQL Server __________

Relation Table Table

Tuple Record Row

AttributeField_______________Column

Relational Databases

Relational database is a set of relationships containing all the information that must be stored in the database. That is, the database represents a set of tables necessary to store all the data. The tables of a relational database are logically related to each other. The requirements for designing a relational database in general can be reduced to several rules.

О Each table has a unique name in the database and consists of rows of the same type.

O Each table consists of a fixed number of columns and values. More than one value cannot be stored in a single row column. For example, if there is a table with information about the author, publication date, circulation, etc., then the column with the author's name cannot store more than one last name. If the book is written by two or more authors, you will have to use additional tables.

O At no point in time will there be two rows in the table that duplicate each other. Rows must differ in at least one value in order to be able to uniquely identify any row in the table.

О Each column is assigned a unique name within the table; a specific data type is set for it so that homogeneous values ​​are placed in this column (dates, last names, telephone numbers, monetary amounts, etc.).

O The complete information content of a database is represented as explicit values ​​of the data itself, and this is the only method of representation. For example, relationships between tables are based on the data stored in the corresponding columns, and not on the basis of any pointers that artificially define relationships.

О When processing data, you can freely access any row or any column of the table. The values ​​stored in the table do not impose any restrictions on the order in which the data is accessed. Description of the columns,

RELATIONAL DATABASE AND ITS FEATURES. TYPES OF RELATIONS BETWEEN RELATIONAL TABLES

Relational database is a collection of interconnected tables, each of which contains information about objects of a certain type. A table row contains data about one object (for example, a product, a customer), and the table columns describe various characteristics of these objects - attributes (for example, name, product code, customer information). Records, i.e. table rows, have the same structure - they consist of fields that store object attributes. Each field, i.e. column, describes only one characteristic of the object and has a strictly defined data type. All records have the same fields, only they display different information properties of the object.

In a relational database, each table must have a primary key - a field or combination of fields that uniquely identifies each row in the table. If a key consists of several fields, it is called composite. The key must be unique and uniquely identify the entry. Using the key value, you can find a single record. Keys also serve to organize information in the database.

Relational database tables must meet the requirements for normalizing relationships. Normalization of relations is a formal apparatus of restrictions on the formation of tables, which eliminates duplication, ensures consistency of data stored in the database, and reduces labor costs for maintaining the database.

Let a Student table be created containing the following fields: group number, full name, student record number, date of birth, specialty name, faculty name. Such an organization of information storage will have a number of disadvantages:

  • duplication of information (the name of the specialty and faculty is repeated for each student), therefore, the volume of the database will increase;
  • the procedure for updating information in the table is complicated due to the need to edit each table entries.

Table normalization is designed to address these shortcomings. Available three normal forms of relationships.

First normal form. A relational table is reduced to first normal form if and only if none of its rows contains more than one value in any of its fields and none of its key fields is empty. So, if you need to obtain information from the Student table by the student’s name, then the Full Name field should be divided into Last Name, First Name, and Patronymic parts.

Second normal form. A relational table is defined in second normal form if it satisfies the requirements of first normal form and all its fields that are not included in the primary key have a full functional dependence on the primary key. To reduce a table to second normal form, it is necessary to determine the functional dependence of the fields. A functional dependence of fields is a dependence in which in an instance of an information object a certain value of a key attribute corresponds to only one value of a descriptive attribute.

Third normal form. A table is in third normal form if it satisfies the requirements of second normal form that none of its non-key fields is functionally dependent on any other non-key field. For example, in the Student table (Group No., Full Name, Gradebook No., Date of Birth, Headman), three fields - Gradebook No., Group No., Headman are in transitive dependence. The group number depends on the grade book number, and the Headman depends on the group number. To eliminate the transitive dependency, it is necessary to transfer some of the fields of the Student table to another Group table. The tables will take the following form: Student (group number, full name, grade book number, date of birth), Group (group number, Headman).

The following operations are possible on relational tables:

  • Merge tables with the same structure. The result is a common table: first the first, then the second (concatenation).
  • Intersection of tables with the same structure. Result - those records that are in both tables are selected.
  • Subtracting tables with the same structure. Result - those records are selected that are not in the subtracted one.
  • Sample (horizontal subset). Result - records that meet certain conditions are selected.
  • Projection (vertical subset). The result is a relation containing some of the fields from the source tables.
  • Cartesian product of two tables The resulting table's records are obtained by combining each record of the first table with each record of the other table.

Relational tables can be related to each other, hence data can be retrieved from multiple tables simultaneously. Tables are linked to each other in order to ultimately reduce the size of the database. Each pair of tables is connected if they have identical columns.

There are the following types of information links:

  • one to one;
  • one-to-many;
  • many-to-many.

One-to-one communication assumes that one attribute of the first table corresponds to only one attribute of the second table and vice versa.

One-to-many communication assumes that one attribute of the first table corresponds to several attributes of the second table.

Many-to-many communication assumes that one attribute of the first table corresponds to several attributes of the second table and vice versa.







2024 gtavrl.ru.