Migrating US Census GDB file to PostGIS — ERROR: tables can have at most 1600 columns

I’m working with data from the U.S. Census. Specifically, I’ve downloaded gdb files from here.

I have the gdb file ACS_2016_5YR_TRACT.gdb on my server and I’m attempting to create a psql table with the following:

ogr2ogr -f "PostgreSQL" PG:"host=host port=5432 dbname=db user=user password=password" ACS_2016_5YR_TRACT.gdb -overwrite -progress --config PG_USE_COPY YES

The output I get when running the above is a CREATE TABLE command with over 1600 columns which it tries to execute but fails with:

ERROR:  tables can have at most 1600 columns

Has anybody had success loading ACS5 data onto PostGIS? How could I get around this?

I have to imagine this is a common use case for people working with census data… I’m wondering if there is a way to pass flags to ogr2ogr that would allow it to partition the data? Maybe there’s a way to increase the column limit in postgresql? I’ve tried changing the db settings and recompiling, as suggested in other answers, but haven’t had much luck. I’m not sure if that’s the best way to go about it though.

Multiply two columns of Census data and groupby

I have census data that looks like this

    State   County  TotalPop    Hispanic    White   Black   Native  Asian   Pacific
   Alabama  Autauga     1948    0.9         87.4    7.7     0.3     0.6     0.0
   Alabama  Autauga     2156    0.8         40.4    53.3    0.0     2.3     0.0
   Alabama  Autauga     2968    0.0         74.5    18.6    0.5     1.4     0.3
   ...

Two things to note, (1) there can be multiple rows for a County and (2) the racial data is given in percentages, but sometimes I want the actual size of the population.

Getting the total racial population translates to (in pseudo Pandas):

(census.TotalPop * census.Hispanic / 100).groupby("County").sum()

But, this gives an error: KeyError: 'State'. As the product of TotalPop and Hispanic is a Pandas Series not the original dataframe.

As suggested by this Stack Overflow question, I can create a new column for each race…

census["HispanicPop"] = census.TotalPop * census.Hispanic / 100

This works, but feels messy, it adds 6 columns unnecessarily as I just need the data for one plot. Here is the data (I’m using “acs2015_census_tract_data.csv”) and here is my implementation:

Working Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

census = pd.read_csv("data/acs2015_census_tract_data.csv")

races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']

# Creating a total population column for each race
# FIXME: this feels inefficient.  Does Pandas have another option?
for race in races:
    census[race + "_pop"] = (census[race] * census.TotalPop) / 100

# current racial population being plotted
race = races[0]

# Sum the populations in each state
race_pops = census.groupby("State")[race + "_pop"].sum().sort_values(ascending=False)

#### Plotting the results for each state

fig, axarr = plt.subplots(2, 2, figsize=(18, 12))
fig.suptitle("{} population in all 52 states".format(race), fontsize=18)

# Splitting the plot into 4 subplots so I can fit all 52 States
data = race_pops.head(13)
sns.barplot(x=data.values, y=data.index, ax=axarr[0][0])

data = race_pops.iloc[13:26]
sns.barplot(x=data.values, y=data.index, ax=axarr[0][1]).set(ylabel="")

data = race_pops.iloc[26:39]
sns.barplot(x=data.values, y=data.index, ax=axarr[1][0])

data = race_pops.tail(13)
_ = sns.barplot(x=data.values, y=data.index, ax=axarr[1][1]).set(ylabel="")

Subset pandas DataFrame based on two columns ignoring what order the match happens in

I have two Pandas DataFrames and I want to subset df_all based on the values within to_keep. Unfortunately this isn’t straight forward pd.merge() or df.join() because I have multiple columns that I want to match on, and I don’t care what order the match happens.

  • I don’t care if df_all['source'] matches in either to_keep['from'] OR 'to_keep['to']
  • And then df_all['target'] matches in either to_keep['from'] OR to_keep['to'].

What I have below currently works, but it seems like a lot of work and hopefully this operation could be optimized.


import pandas as pd
import numpy as np

# create sample dataframe
df_all = pd.DataFrame({'from': ['a', 'a', 'b', 'a', 'b', 'c', 'd', 'd', 'd'], 
                       'to': ['b', 'b', 'a', 'c', 'c', 'd', 'c', 'f', 'e'], 
                       'time': np.random.randint(50, size=9),
                       'category': np.random.randn(9)
                       })

# create a key based on from & to
df_all['key'] = df_all['from'] + '-' + df_all['to']

df_all

    category    from    time    to  key
0   0.374312    a   38  b   a-b
1   -0.425700   a   0   b   a-b
2   0.928008    b   34  a   b-a
3   -0.160849   a   44  c   a-c
4   0.462712    b   4   c   b-c
5   -0.223074   c   33  d   c-d
6   -0.778988   d   47  c   d-c
7   -1.392306   d   0   f   d-f
8   0.910363    d   34  e   d-e

# create another sample datframe
to_keep = pd.DataFrame({'source': ['a', 'a', 'b'], 
                        'target': ['b', 'c', 'c'] 
                       })

to_keep

    source  target
0   a   b
1   a   c
2   b   c

# create a copy of to_keep
to_keep_flipped = to_keep.copy()

# flip source and target column names
to_keep_flipped.rename(columns={'source': 'target', 'target': 'source'}, inplace=True)

# extend to_keep with flipped version
to_keep_all = pd.concat([to_keep, to_keep_flipped], ignore_index=True)

to_keep_all

    source  target
0   a   b
1   a   c
2   b   c
3   b   a
4   c   a
5   c   b

# create a key based on source & target
keys = to_keep_all['source'] + '-' + to_keep_all['target']

keys

0    a-b
1    a-c
2    b-c
3    b-a
4    c-a
5    c-b
dtype: object

df_all[df_all['key'].isin(keys)]

    category    from    time    to  key
0   0.374312    a   38  b   a-b
1   -0.425700   a   0   b   a-b
2   0.928008    b   34  a   b-a
3   -0.160849   a   44  c   a-c
4   0.462712    b   4   c   b-c

Sum of the multiplication of all columns of one matrix (n-by-m) with another matrix (n-by-n)

I hope the title is self-explanatory. I hoped to make the title such that others could find it as well. I know how to carry out the operation with a loop, but it must be quicker using some kind of matrix multiplication, which I am interested to learn.

The code with the loop looks something like this

x <- matrix(rexp(300, rate=.1), nrow=20)
y <- matrix(rexp(400, rate=.1), nrow=20)

res <- as.data.frame(matrix(0,ncol = 15, nrow = 20))

for (i in 1:20){

  res <- res + x*y[,i]

}

Sum of multiply all columns of one matrix (n-by-m) with another matrix (n-by-n)

I hope the title is self-explanatory. I hoped to make the title such that others could find it as well. I know how to carry out the operation with a loop, but it must be quicker using some kind of matrix multiplication, which I am interested to learn.

The code with the loop looks something like this

x <- matrix(rexp(300, rate=.1), nrow=20)
y <- matrix(rexp(400, rate=.1), nrow=20)

res <- as.data.frame(matrix(0,ncol = 15, nrow = 20))

for (i in 1:20){

  res <- res + x*y[,i]

}

R Shiny: All the columns have class Character when rendered in a ShinyApp after converting formattable output…

After converting a formattable output to a datatable using as.datatable function, although i am able to filter stuff but all the columns are in character class WHEN rendered in shinyapp. Meaning, 700 > 6000, 9>10 etc. (just because it is not treated as numeric class)

Sample code for testing:

    #libraries
    library(data.table)
    library(formattable)
    library(shiny)

    #upto 2 digits issue cannot be seen as 9.1 >8.1 etc even in character format, hence increasing the numbers by multiplying it to another column.
    iris$Sepal.Width <- iris$Sepal.Width*iris$Petal.Length

    #creating UI
    ui <- fluidPage(
      DT::dataTableOutput("table1"))

    #creating server
    server <- function(input, output){
      output$table1 <- DT::renderDataTable( 
        as.datatable(formattable(iris)))
    }

    #calling the server
    shinyApp(ui, server)

observation: when trying to sort column Sepal.Width in descending order, 9.x will be at top whereas it should be 25.46 in the ShinyUI.

Note: Click Show 100 filter in the AP then use sort for better understanding of the issue

All the things work perfectly when done in R but fails in ShinyApp

How to create test cases in order to test columns in an excel file?

I have to test an excel file which generated by the system. This file has around 40 columns in it. Do I have to create test cases for each column separately in order to test?

File contains payment records. Bank details,account numbers,amount etc.

Can’t i check them in a single test case or multiple without having test cases for each column? Please explain