So you may have read about the fact that we're thinking pretty hard about re-architecting things to use lots of XML at work. Now we're facing an interesting challenge, and I'd like to ask the blog world for advice. Surely we're not the first group to encounter this.

The Problem

The problem is that we have tons of data that we need to represent in XML. Much of the data is related to stock tickers. For example, given YHOO, we have earnings information, P/E ratio, average daily volume, EPS, full company name, and so on. However, we also have some data that doesn't map to particular tickers--instead is maps to more general symbols that we use internally (industry codes, etc.).

The Goal

We're trying to figure out how to create an XML Schema (or several?) that encompasses the full collection of data that we may want to publish both on-line and internally. It's hard to figure out the right approach or where to start.

Do we build one gigantic schema? If so, what problems do we run into down the road? Will we be generating new versions too often?

Should we instead build several schemas? One for us, one for those who consume our data, and others as needed?

If anyone has written about this from a practical point of view, we'd love pointers to it. (At least I would.) Theory is all well and good, but if you haven't been through this exercise before, I'm going to be skeptical about your recommendations. Why? This feels like a hard problem that looks easy on the surface.

Also, what about naming conventions for both namespaces and elements? That's likely to become a semi-religious debate quickly, but it can't hurt (too much) to ask. :-)

Posted by jzawodn at October 30, 2002 11:20 AM

Reader Comments
# Neel Krishnaswami said:

Most XML schema languages define a way of specifying algebraic data types. There's an entire class of functional programming languages -- Haskell, ML, and Ocaml -- that use algebraic data types as their fundamental structuring mechanism. You could do a lot worse than spending some time playing with them. They have the big advantage of being vastly more terse than schema languages, so it's really, really easy to experiment and learn.

An "algebraic data type" is a data type built up using /sum/ and /product/ type constructors. A product type is just the cartesian product of some other types. Eg, if you have bool and int as base types, then using a product type constructor will let you create (int * int), (bool * bool), (int * (int * int)) and so on as derived data types. A sum type is a little less familiar; it's based on the set theoretic idea of disjoint sum -- think of a C-style union, only you are required to provide a tag to distinguish the alternatives.

Eg, the abstract syntax of a small language could be represented as the following Ocaml data type:


type expr =
| Varref of string
| Lit of int
| Lambda of string * expr
| Call of expr * expr


You might write the ast of a function call as 'Call(Varref "factorial", Lit 5)'. The whole thing is very similar to writing a BNF grammar for your data, only with more expressive typechecking.

on October 30, 2002 01:08 PM
# Manuel Lemos said:

If you are going to store massive amounts of data that you may want to read or update later, don't use XML, it is not streamable. It will take a lot of memory and CPU to parse and extract the information. Consider CSV.

on October 30, 2002 09:52 PM
# Mark Matthews said:

I won't advise you on whether or not to create your own schemas, re-use others, keep internal/external versions, etc, as I've been on projects that have done it in different ways, and each way has different issues....Standard schemas are always missing something that you need, external parties won't adopt your schema, etc. That's why tools like XSL become invaluable.

What I will say, is that once you do decide on a schema, _do_ make a DTD (or XSD), document it, and publish it, and make sure everyone uses it.

The past projects I've been on have had considered formal schemas and documentation as overhead, so they didn't create them, or created them later. This made it _difficult_ to coordinate between parties when exchanging documents. Sometimes it felt like we would've been more productive using tab-delim...tracking down XML structure 'bugs' in large documents soaked up a lot of my development time.

on October 31, 2002 07:17 AM
# Shawn MacFarland said:

XML is just a nice representation of the data from a specific data model. Don’t let the detailed requirements of XML cloud your design process.

The first couple of steps of the process should be a data modeling process, without too much XML Schema concern. Your goal should be to separate your financial information into distinct groupings composed of useful base type sets. All base type sets such as equities, options, and other contracts should consist of collections of type information based solely within the XML schema domain dates, currency values, etc. We can imagine that base type sets will not change too frequently; a simple equity will always be a simple equity. However new financial instruments will always appear and it should be possible to version and inherit from existing instrument definitions. Once your groupings of the base types are established and the groupings of these base types are created to satisfy your needs in each of your subject areas, you will have a very large data model, consisting of many pieces.

At this point you need to analyze how you expect the groups and their base types to change over time This expectation of change will determine how to convert your data model to XML Schema. Which pieces need to be in their own namespace, which base types are related to each other, and which ones will change in relation to which others. Literally you need to construct a second data model representing your expectations of the change in your first data model, it is this second data model that will help you map your representation to namespaces and other XML Schema specific domain needs.

on October 31, 2002 11:38 AM
# Stuart Myles said:

Hi, Over at the Wall Street Journal Online, we've been doing this with XML and market data for quite some time. And we've been very happy with the results. We started doing it a little over 5 years ago, so there was not much in the way of standards to choose from. (There were some, but they really weren't any good).

However, there is a standard that may be tackling the kind of thing you need. That's MDDL (http://www.mddl.org). You'll see it has only covered a subset of what you're looking for. However, it was designed from the ground-up to be eXtensible (I know, because I helped design it in the early days). If you're at all interested, you should contact James Hartley (jhartley@siia.net).

Regards,

Stuart
---
WSJ.com Architect

on November 5, 2002 07:23 PM
# said:

So where did you eventually land with this and what was your experience?

on December 15, 2009 08:36 AM
Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

 

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.