Sunday, July 19, 2009

Structured Data vs UnStructured Data

"The labels "structured data" and "unstructured data" are often used ambiguously by different interest groups; and often used lazily to cover multiple distinct aspects of the issue. In reality, there are at least three orthogonal aspects to structure:

  • The structure of the data itself. 
  • The structure of the container that hosts the data. 
  • The structure of the access method used to access the data.

These three dimensions are largely independent and one does not need to imply another. For example, it is absolutely feasible and reasonable to store unstructured data in a structured database container and access it by unstructured search mechanisms."

So when we say 80% of your data is unstructured, do we mean "Not stored in database"?  Is XML tagged data, structured ? (yes), if it stored on the file system( )?  A .pdf stored in a database and indexed via a search engine? 

I have never been asked by a customer to clarify what I mean by unstructured data but I know it is coming.

One participant in the Oracle conversation has this take:

As per my experience, 'unstructured data' is data/information/content which doesn't have a specific  structure/rule attached to it. For example, a word document or an HTML page can contain data/information/content in any structure. One can have any number of images, paragraph etc. Also, in most of the cases, there is no relation between the content(s). On the other hand, 'structured data' has structure/rules attached to it e.g. a product. A product will always have a code, manufacturer, category etc. and thus defines the structure of data. 

Now, the above is business terms. So, you can store them the way you wish to have your technical solution- it could be Database, File System etc.

So this would basically be saying that it is the structure of the data itself that determines whether or not it isstructured or unstructured. 

However, within the ECM space, I tend to take a different tack, at least when explaining it to myself.  I typically take a more simplistic approach.  Structured vs Unstructured is cellular data vs non-cellular data.  DB LOB types are special exception cases. 

<disclaimer>Of course, I take this approach when presenting ECM which deals primarily with content sored in non-DB table cell formats/locations.</disclaimer> 

While XML data may be structured, it is contained in a content item (XML Document) that is itself unstructured.  Were the xml data to be parsed and inserted into a table structure that mirrored the XML tag names (for example) at that point the data in the DB would be considered "structured" while the XML Document and all the data it contained would still be considered "unstructured". 

Posted via web from swathidharshananaidu's posterous

No comments:

Post a Comment