Semi-Structured Data

Definition: Semi-Structured Data

To understand semi-structured data let us first define structured and unstructured data:

Structured Data – Data which is formatted and organized into a data structure and whose elements can be accessed and addressed in several combinations to make the information useful.

Unstructured Data – Data which is not formatted and easier to process in that sense.

Semi-structured data is a cross between structured data and unstructured data. It is neither raw data nor data in conventional database systems. It is not very highly formatted or organized which makes it possible to analyze. At the same time it has certain information associated with it (e.g. metadata) which allows the element contained in it to be addressed. Therefore the information is called self-describing.

For example a word document can be considered to be an unstructured data. But we can add certain keywords, tags, markers (e.g. author name, date created) which makes it easier for the document to be accessed when the user searches for those keywords. This makes the data semi-structured. However the data still lacks the complex formatting of a database and hence cannot be called structured. Semi-structured data model has its own set of advantages and disadvantages associated with it. It is easier to a) discover new data and load it b) to integrate heterogeneous data and c) to query the data without knowing the data types. Optimization of data is harder in a semi-structured data model. Semi-structured data are used specially in data integration.

Few other examples of semi-structured data are:

• Emails which have the sender, recipient, date, time and other tags added to the unstructured data which is the content of the email message, making it semi-structured.

• Graphics and pictures can be marked with keywords like the creator, location, date making it possible to organize and access graphics.

XML and other markup languages like Standard Generalized Markup Language (SGML) are generally used to manage semi-structured data.


Hence, this concludes the definition of Semi-Structured Data along with its overview.


