Scalability: Millions of tables vs one big master table and millions of views

We have to migrate an old (million LOC) system from SAP ADS (formerly Sybase) ISAM based (so called free tables) based system to a PostgreSQL database.

The system implements kind of multitenancy regarding putting some basic information (chancellery ID, client ID, fiscal year, etc.) just into the windows folder paths and the SQL statements use these to build SQL statements with the appropriate paths for the specific table(-files).

Since SAP decided to shutdown support for the Advantage Database Server, we’re forced to migrate the system to another (real) RDBMS.
We pretty much already decided to go with PostgreSQL since that would support kind of namespacing (schemata) to at least map the windows folder paths to something we could use to replicate this.

ATM we’re discussing two major approaches how to do this:

  1. Migrating the existing ADS tables to PostgreSQL table objects as they appear in each of the folders to table instances residing in specific PostgresSQL SCHEMA instances.
  2. Migrating the existing ADS tables to PostgreSQL master-table objects, and create views for them a including a mapping table and a reference key included with the master table and another table where these SCHEMA names are mapped using a unique ID.

Both of the approaches are already technically almost solved, and we’re able to migrate the existing systems either way.

Though there are still points to consider which path we should take.
– Some of our architects say from their guts that having billions of rows in a single master-table would likely have a significant performance impact with the SQL statements.
– Others (like me) argue that a view/master-table model would have several benefits regarding the overall DB-design efficiency.

We’re doing measurements already, but I’d like to hear some advice, maybe some experiences even other choices of DB-design ideas, which of the approaches might scale better.

Some edge data:

  • The overall system currently maintains ~1000 basic table structures (some of them with +400 column fields).
  • We have customers with multitenancy paths up to 10.000.
  • Some of those (master-)tables might need to keep up rows in the range of billions.
  • Since indexes are (seemingly) cheap in ADS, there are sometimes many of these used with the existing ISAM tables.

What we already noticed:

  • The apprehended performance impact for SQL statements applied to master tables through views vs single table objects is way from linear scaling going with the amount of data rows (I’ve been measuring that using exemplary EXPLAIN ANALYZE server side).
  • Using tables per SCHEMA the pg_catalog tends to grow big, especially the pg_catalog.pg_attributes table. Any operations done with these (including the PostgreSQL query optimizer) might hit by the sheer big data content.
  • Currently we use a single tablespace for all of the tables. While we could organize tablespaces for each virtual directory path, this seems to overcomplicate the DB-design though and might even turn up other problems.
  • As mentioned before, single tablespaces for such amount of relations doesn’t seem to scale well with PostgreSQL and an underlying NTFS windows filesystem.
  • Regarding the underlying NTFS system, we should consider that there will be a minimum block size reserved for each file created, and PostgreSQL creates several files for any pg_class relation object, which includes tables, indexes, blob fields, etc.

May be what I’ve offered with my observations here is a bit biased, but I’d like to ask about either alternatives, or absolute no goes for either of those models we’re discussing.


Since my boss is a wise guy, and he knows that I am one of the architects supporting the view-model, he gave as a homework to me, to give them information what are the cons actually.

Is it possible to debug 2 millions lines of code? (The scene with Nedry at Jurassic Park)

In a famous scene with John Hammond & Nedry, Nedry say than he is the only once than can debug 2 millions lines of code.So after that scene and with my frustration i googled and some websites say than 2 millions lines of code is equal to the Windows 3.1 (OS) made with 2.5 Millions lines of code approximately.

Question:
So is possible than only one person can code/debug 2 millions lines of code, and considering than the “Park” need to be open to get some income profits, and to pay the investors ?

Scene: https://twitter.com/jurassicworld/status/870729372113686529?lang=en

Information: https://informationisbeautiful.net/visualizations/million-lines-of-code/

Please consider the age of the character in that scene and many things like weekend time, Holidays, Family Times and if the “Character” became temporarily sick and etc.. Few minutes after than John Hammond,told with Alan Grant in the Travel Trailers the Nedry is conspiring with a other character(Lewis Dodgson) so that time is not for “Coding or work” or for production related so please consider that time like the examples above.

It’s possible to debug 2 millions lines of code ? (The scene with Nedry’s at Jurassic Park )

Hello everyone.

I have a question and i also know some replies.

In a famous scene with John Hammond & Nedry, Nedry say than he is the only once than can debug 2 millions lines of code.So after that scene and with my frustration i googled and some websites say than 2 millions lines of code is equal to the Windows 3.1 (OS) made with 2.5 Millions lines of code approximately.

Question:
So is possible than only one person can code/debug 2 millions lines of code, and considering than the “Park” need to be open to get some income profits, and to pay the investors ?

Scene: https://twitter.com/jurassicworld/status/870729372113686529?lang=en

Information: https://informationisbeautiful.net/visualizations/million-lines-of-code/

Optimizing MySql Cascading Deletes on Millions of Records

The scenario is as follows. We are using a production db to create dev db’s for our developers to work off of. We will import the latest prod db backup into our prep mysql server, run a mysql pre scramble queries which delete old data to reduce the DB file size, run the scramble scripts to obfuscate the data, run the post mysql script to do final clean up work.

The issue we are encountering is with the Mysql pre scramble queries. Our prod db is about 40gb uncompressed and contains 10’s of millions of rows. During our mysql query prep process we convert our restrict FK’s to cascade, delete the records we need to delete, then convert back to restrict fk’s. We use cascades to help so we only have to delete the root records and the rest is taken care of – it helps us trim the db size down significantly so the dev doesn’t have a massive db file to work with.

For reference, the server we are running this on is pretty beefy – 36 cores, 72 logical processors, 191 gb of ram of which we have about 60 gb free to use for mysql.

I’ve already increased the servers innodb buffer pool to 30 gb.

However – running tons of cascading deletes is incredibly slow – it takes about 8 hours to run this script. I’m looking for ways to increase the speed of these deletes.

It is my understanding that the following occurs during a mysql delete:

  • do the same as SELECT
  • found rows log operation (write into redo-log files)
  • mark rows as deleted lock deleted rows
  • wait if any other operations use them
  • update indexes

Is there anything I can tweak server settings wise to increase the speed beyond the buffer pool size?

Is it possible to disable the redo-log files (if a crash occurs on this server it doesn’t matter its only being used to prep the dev db)

If i deleted all indexes prior to running the deletes would that help? Some of these tables do have quite a few indexes.

We have already setup the deletes to run in batches of 1000 as to not try to run them all at once.

Mysql running on this server is only used to create the dev db, so stability isn’t really of high importance – we just need to be able to create these db’s as fast as possible – right now the entire process takes about 12 hours from start to finish with 80% of that time being used by the delete queries.

Question: The best way I can win the Powerball and Mega Millions Lottery?

Look I’m 18 yrs old and I know its hard I like hard things I want to win the those lotteries
millions of dollars I’m trying really hard because I live in NJ
and I want to move to Massachusetts on January 2020 or Early
what’s the best way to win?
Please give me some winning numbers and to match the 5 plus the ball of powerball or Mega Millions? Whats the easiest way to win those 2 lotteries? Give me some good numbers to win plus the jackpot I want to millions of dollars please thank you!

How to find matched words from millions of rows faster in mysql [on hold]

I have a database table consisting 30 millions of rows contains three columns- id, postid and word. Word column consists different non English words. I have to find out the matched words for a given word among these all 30 millions of words. Here is a demo table:

enter image description here

For a given word my query is searching all of the rows and taking too much time. How can i get minimum query time to find the matched words among these 30 millions of rows(words).I am using Mysql.

How to speed up query on table with millions of rows

The Issue:

I’m working on a big table that consists about 37mln rows. Data include measurements of many devices made in certain time e.g. ‘2013-09-24 10:45:50’. Each day all of those devices are sending many measurements in different intervals on different times. I want to make a query which selects all the most actual ( ‘actual’ I mean the latest from all measurements made in each day) measurement of each day for 2 months e.g from 2013-01-01 to 2013-02-01.

The problem is that this query takes so much time to go, despite all of the indexes i’ve made on different columns. I’ve also created auxiliary table that contains max(MeterDate) and MeasurementsId when the measurement was given. I’ve noticed that index can’t be made on MeterDate because it contains date and time which is not useful for making an index on it. So i converted the MeterDate -> CONVERT(date, MeterDate). I though that after joining The auxiliary table with [dbo].[Measurements] the query would be faster but still query takes more than 12s which is too long for me.

The structure of table:

Create table [dbo].[Measurements]

[Id] [int] IDENTITY(1,1) NOT NULL,
[ReadType_Id] [int] NOT NULL,
[Device_Id] [int] NULL,
[DeviceInterface] [tinyint] NULL,
[MeterDate] [datetime] NULL,
[MeasureValue] [decimal](18, 3) NULL

Every row of Measurements table include measurement value on direct MeterDate e.g. “2008-04-04 13:28:44.473”

Direct select structure:

DECLARE @startdate datetime= '2013-07-01'; 
DECLARE @enddate datetime = '2013-08-01';

SELECT *
FROM [dbo].[Measurements] 
WHERE [MeterDate] BETWEEN @startdate and @enddate 

Does anyone knows how to rebuilt table or add new or add indexes on which column that speed up query a bit ? Thanks in advance for any info.

Edit:

The table that I used was created by this query

with t1 as
(
    Select  [Device_Id], [DeviceInterface],  CONVERT(date,  MeterDate) as OnlyDate, Max(MeterDate) as MaxMeterDate
    FROM [dbo].[Measurements] 
    GROUP BY [Device_Id], [DeviceInterface], CONVERT(date,  MeterDate)
)
Select t1.[Device_Id], t1.[DeviceInterface],t1.[OnlyDate], r.Id  
INTO [dbo].[MaxDatesMeasurements]
FROM t1
JOIN [dbo].[Measurements] as r ON r.Device_Id = t1.Device_Id AND r.DeviceInterface = t1.DeviceInterface AND r.MeterDate = t1.MaxMeterDate

Then I wanted to join the new created table [dbo].[MaxDatesMeasurements] with old [dbo].[Measurements] and select direct rows

DECLARE @startdate datetime= '2013-07-01'; 
DECLARE @enddate datetime = '2013-08-01'; 


Select *
From [dbo].[MaxDatesMeasurements] as t1 
Join [dbo].[Measurements] as t2 on t1.[Id] = t2.[Id] 
WHERE t1.[OnlyDate] BETWEEN @startdate AND @enddate