Chapter 01 - Key points

Site: Campus Virtuel - Université de Jijel
Cours: Données semi-structurées DS
Livre: Chapter 01 - Key points
Imprimé par: مستخدم ضيف
Date: dimanche 22 mars 2026, 06:24

Description

Key points seen in Chapter 01.

1. Types of data organizations

  • After defining the conceptual organization of the data, a physical organization must be chosen.
  • This organization defines how the data will be structured and saved on disk.
  • This organization is transparent to the user.
  • We can define three organizations:
    • Structured Data (or databases),
    • Unstructured Data (or free-form files, text files),
    • Semi-Structured Data (which will be the subject of this module).
  • Each organization meets specific needs; the right organization must be chosen for each use case.

2. Structured Data

  • These are generally databases.
  • A database is a large collection of structured information stored on a permanent medium.

2.1. Presentation

  • Data has a well-defined structure.
  • In the case of databases:
    • Data is organized into tables.
    • Each table has several columns.
    • The table contains records; these records all have the same structure (same number, types, and column order).
  • The DBMS checks the data for conformity before inserting it.

2.2. Advantages of structured data

  • Centralized data: no redundancy,
  • Consistent data: application of constraints,
  • Can handle large amounts of data,
  • High-level operations.

2.3. Disadvantages of structured data

  • Binary format specific to the DBMS,
  • Difficult for humans to read in its native format,
  • Difficult to read using another DBMS,
  • Difficult to exchange,

3. Unstructured Data

  • By unstructured data, we mean text files.
  • This is the most basic type of data.

3.1. Presentation

  • This is a string of characters without a clear structure.
  • These files are edited using text editors like Notepad.
  • Important: If a word processing program (like Microsoft Word) is used to edit a text file, you must explicitly specify its type as "Text File" when saving. Otherwise, Microsoft Word will use its own format (with the .docx extension), which is a binary format.

3.2. Advantages of unstructured data

  • Simple,
  • Easy to edit,
  • Easy for humans to read,
  • Easy to share.

3.3. Disadvantages of unstructured data

  • Manipulation via basic level operations (like open and close).
  • Difficult to use in automated processing: the lack of structure makes writing algorithms to process them very difficult.

4. Semi-Structured Data

  • They aim to meet the new needs required in the web context:
    • The need for data exchange: which requires "open" data and not proprietary binary formats,
    • The need for a structure for the exchanged data: to be able to process and display it correctly to the user.