Asked and Answered: When Are You Going to Post Something New?

A couple months ago, someone asked:

When are you going to post some new stuff ?

First let me say that I'm flattered that anyone even noticed that I stopped posting stuff, and even more flattered that they weren't happy about it. So thanks. I appreciate you asking.

With that said, the answer is depressingly simple:

I have no idea.

I started Full Table Scan at a time when I was employed but largely unoccupied. Think wrist restraints of the precious metal variety. Boooorrrring. Combing my love of database technology and writing seemed an ideal way to pass the time.

When my employment nightmare finally came to an end, I decided to take some time off, enjoy my kids while they were still young, and finish the college degree that I had deferred after three very un-stellar years following high school. I took to doing some contract work part-time to grease the wheels and keep my brain from dripping out my ears, but largely I just checked out of society in general and specifically the database world.

A funny thing happened when I started doing contract work - I had to track how long I spent working. From there I would generate invoices, in order to get paid. As a result I became very conscious of where my time was spent and when I was spending time doing something other than playing with my kids. And that pretty much spelled the end of the posts to this blog.

I love databases, and I love to write, but I love my wife and kids a whole lot more. Given the choice between unpaid time with them and unpaid time writing blog posts, well... sorry y'all.

With the college degree finished and the children growing older, however, I've slowly increased the amount of time I spend working. (It's hard to spend time with kids who are in school most or all of the day.) I'm sure that I still don't count as a truly productive member of society, but then again I probably never did. Maybe some day I'll also get back into the habit of paying close attention to the database industry and recording my thoughts about it, but for now I think I'll keep plugging away at the more mundane problems my clients throw at me each day.

It certainly pays better.

Question: Why is Row-based Processing Bad?

A reader writes:

We have recently bought Netezza v4.5 and planning to move out EDW from Oracle 10g. We have a lot of PL/SQL ETL written along with OWB ETL. Do you have any examples why cursors ( Row Based Processing) are bad choice for both Netezza and Oracle for data warehousing application.

Unfortunately, I do not have any real-world examples of why this is a bad idea that I can share. At least not without getting myself in trouble and/or making some people angry at me, anyway. ;-) But this does give me a chance to get on my soapbox about one of my peeves:

When dealing with relational databases think sets, not rows.

Though we're used to thinking in terms of individual rows, SQL works on sets of rows. Relational databases process SQL, so... if you want your database to perform well, you've got to design things in terms of sets, not rows.

To a large degree people already think about sets without really realizing it. Here's a really contrived example: Let's say you wanted to find all the rows in a table where the Color field has a value of blue. You wouldn't write a stored procedure to loop over all the rows, inspecting the Color field of each, would you? No, you'd write a query that includes "WHERE Color = 'blue'" in it, and let the database figure out which set of rows within that table fits that criterion. See? You're already thinking about sets without even realizing it. :-)

Where set-based thinking really becomes important, however, is scalability. Set-based operations are easy to break up into chunks and thus easy the parallelize. In a partitioned table, each partition is essentially its own set, so each partition can be operated on independently. In an MPP system like Netezza, where each table is distributed across all the SPUs, each SPU can operate on its set of data independently. Operating on independent sets in parallel makes things not only fast, but scalable - more independent sets equals higher possible scalability.

Cursors, on the other hand, well, they allow you to go one row at a time over all the rows you have to deal with. That's what they were designed for, after all. Wanna guess how (not) fast and (not) scalable that is?

The other factor to consider here is portability, or lack thereof. I'm not going to claim that all SQL is effortlessly portable between different database systems, but there's no question that queries are easier to move between systems than stored procedures. All the stored procedure languages I'm familiar with are proprietary, at any rate, and proprietary generally equals 'intentionally not portable'.

Why is portability important? Nobody likes to admit it, but for any given software system, the odds are good that you're going to have to rework everything at least once. Whether that's because you're forced to change database systems or because you need to generally overhaul things to get the functionality you want, it's highly likely to happen. Functionality built on a non-proprietary foundation makes both of those easier.

If you don't believe me, see the question that started this whole discussion. :-P

Disclaimer: I detest stored procedures, and it has been a mercifully long time since I've worked extensively with Oracle. Thus I am unaware of any easy way to have multiple instances of a stored procedure running with each operating on different subsets of data, with Oracle or any other DB. If there is then I expect to be soundly and publicly redressed by those who know better. Regardless, I would think that set-based approaches would still make for more scalable solutions simply because sets are what relational databases are designed to work on, not procedural programs.

Request for Comments: Database Time Series Support

Vendors, experts, enthusiasts, lend me your ears.

Dean M writes:

Having an interest in Times Series within Capital Markets, would like to understand the Time Series offering of the various vendors. To start the discussion we could contrast and compare the basic architecture approaches (Column -vs- Row) orientation, language expressiveness (SQL -vs- Proprietary) and the ever popular memory model (in memory, disk based, hybrid).

I will (uncharacteristically) withhold my own thoughts and opinions, at least for the time being, as I'd like to hear what everyone else has to say.

So: what database(s) provide the best time series functionality and why?

Question: Netezza vs. Vertica

I received the following question last week from Mike T:

curious as to your evaulation between vertica and netezza. i have used netezza for several projects at a couple of clients.

My response, unfortunately, is that I can't quite say yet.
Continue reading "Question: Netezza vs. Vertica" »