Chapter 01 - Key points
| Site: | Campus Virtuel - Université de Jijel |
| Cours: | Données semi-structurées DS |
| Livre: | Chapter 01 - Key points |
| Imprimé par: | مستخدم ضيف |
| Date: | dimanche 22 mars 2026, 06:24 |
Description
Key points seen in Chapter 01.
1. Types of data organizations
- After defining the conceptual organization of the data, a physical organization must be chosen.
- This organization defines how the data will be structured and saved on disk.
- This organization is transparent to the user.
- We can define three organizations:
- Structured Data (or databases),
- Unstructured Data (or free-form files, text files),
- Semi-Structured Data (which will be the subject of this module).
- Each organization meets specific needs; the right organization must be chosen for each use case.
2. Structured Data
- These are generally databases.
- A database is a large collection of structured information stored on a permanent medium.
2.1. Presentation
- Data has a well-defined structure.
- In the case of databases:
- Data is organized into tables.
- Each table has several columns.
- The table contains records; these records all have the same structure (same number, types, and column order).
- The DBMS checks the data for conformity before inserting it.
2.2. Advantages of structured data
- Centralized data: no redundancy,
- Consistent data: application of constraints,
- Can handle large amounts of data,
- High-level operations.
2.3. Disadvantages of structured data
- Binary format specific to the DBMS,
- Difficult for humans to read in its native format,
- Difficult to read using another DBMS,
- Difficult to exchange,
3. Unstructured Data
- By unstructured data, we mean text files.
- This is the most basic type of data.
3.1. Presentation
- This is a string of characters without a clear structure.
- These files are edited using text editors like Notepad.
- Important: If a word processing program (like Microsoft Word) is used to edit a text file, you must explicitly specify its type as "Text File" when saving. Otherwise, Microsoft Word will use its own format (with the .docx extension), which is a binary format.
3.2. Advantages of unstructured data
- Simple,
- Easy to edit,
- Easy for humans to read,
- Easy to share.
3.3. Disadvantages of unstructured data
- Manipulation via basic level operations (like open and close).
- Difficult to use in automated processing: the lack of structure makes writing algorithms to process them very difficult.
4. Semi-Structured Data
- They aim to meet the new needs required in the web context:
- The need for data exchange: which requires "open" data and not proprietary binary formats,
- The need for a structure for the exchanged data: to be able to process and display it correctly to the user.