Discussion:
Fast Access of Large Data Set
(too old to reply)
Mike Copeland
2009-12-11 00:53:42 UTC
Permalink
I have a large data file that contains US ZIP codes and City & State
values. It has > 64K records (I can't load it into Excel), and I wish
to use it in a C/C++ application - I want to look up the City & State
for specific ZIP code values. I don't need any other information from
this file.
My question is: what's the best technique or language feature for
storing and accessing individual Cities & States via the ZIP code value
(10000-99999)? I would think a declared array of strings, stored and
accessed via the ZIP value won't work, as such an array would be too
large for the compiler (VS6.0). Given that, would an STL vector (or
something else) work well, both for storing and accessing? Please
advise. TIA
Stephan T. Lavavej [MSFT]
2009-12-11 03:00:28 UTC
Permalink
Post by Mike Copeland
I would think a declared array of strings, stored and
accessed via the ZIP value won't work, as such an array would be too
large for the compiler (VS6.0).
VC6 is ancient and no longer supported. You should use the latest version,
VC9 SP1 (VS 2008 SP1).
Post by Mike Copeland
Given that, would an STL vector (or
something else) work well, both for storing and accessing?
std::vector is an excellent sequence container (it imitates builtin arrays,
but is far smarter). It can be used for key-value lookups when your keys
are unsigned integers, compact, and small. (If you have negative keys,
sparsely distributed keys, or compactly distributed keys that start at a
hundred million, you shouldn't use std::vector.) As it happens, zip codes
are unsigned integers, reasonably compact, and reasonably small, so you
could use std::vector. However, std::map is probably a better choice, as
it's an associative container. (std::map is an ordered associative
container, implemented with balanced binary trees; VC9 SP1 contains
std::tr1::unordered_map which is implemented with hashes, but by default you
should prefer std::map unless you have a reason to want hashes.) Using a
real associative container is a good idea because then you'll be able to
extend your program if in the future you need to support ZIP+4 or some other
key.

Here's an example of reading zip codes from a data file and loading them
into a std::map. std::map lookups are O(log N) which is "very fast". (For
example, my file has 29470 lines, so each lookup involves traversing about
15 nodes, which is essentially instantaneous on modern computers, unless for
some reason you need to do this a billion times per second.)

C:\Temp>type meow.cpp
#include <stdlib.h>
#include <exception>
#include <fstream>
#include <iostream>
#include <map>
#include <ostream>
#include <regex>
#include <sstream>
#include <stdexcept>
#include <string>
using namespace std;
using namespace std::tr1;

unsigned int zip_from_str(const string& s) {
istringstream iss(s);

unsigned int ret = 0;

iss >> ret;

return ret;
}

typedef map<unsigned int, string> zip_map_t;
typedef zip_map_t::const_iterator zip_map_ci_t;

zip_map_t read_zip_codes() {
zip_map_t ret;

// http://www.census.gov/geo/www/gazetteer/gazette.html links to:
// http://www.census.gov/tiger/tms/gazetteer/zips.txt
// http://www.census.gov/tiger/tms/gazetteer/zip90r.txt

ifstream f("zips.txt");

if (!f) {
throw runtime_error("RUNTIME ERROR: Couldn't open zips.txt.");
}

const regex
r("[^,]*,\"([0-9]{5})\",\"([A-Z]{2})\",\"([^\",]+)\",[^,]*,[^,]*,[^,]*,[^,]*");

for (string s; getline(f, s); ) {
smatch m;

if (!regex_match(s, m, r)) {
throw runtime_error("RUNTIME ERROR: Unrecognized line: " + s);
}

ret[zip_from_str(m[1])] = m[3].str() + ", " + m[2].str();
}

return ret;
}

int main() {
try {
const zip_map_t m = read_zip_codes();

cout << "Enter ZIP codes:" << endl << endl;

const regex r("[0-9]{5}");

for (string s; getline(cin, s); ) {
if (s == "bye") {
break;
}

if (!regex_match(s, r)) {
cout << "That's not a ZIP code." << endl;
continue;
}

const zip_map_ci_t i = m.find(zip_from_str(s));

if (i == m.end()) {
cout << "ZIP code not found." << endl;
continue;
}

cout << "Location: " << i->second << endl << endl;
}
} catch (const exception& e) {
cout << "Exception: " << e.what() << endl;
return EXIT_FAILURE;
} catch (...) {
cout << "Unknown exception." << endl;
return EXIT_FAILURE;
}
}

C:\Temp>cl /EHsc /nologo /W4 /MT /O2 /GL meow.cpp
meow.cpp
Generating code
Finished generating code

C:\Temp>meow
Enter ZIP codes:

98052
Location: REDMOND, WA

90210
Location: BEVERLY HILLS, CA

02115
Location: BOSTON, MA

10001
Location: NEW YORK, NY

20008
Location: WASHINGTON, DC

30303
Location: ATLANTA, GA

40202
Location: LOUISVILLE, KY

50309
Location: DES MOINES, IA

60601
Location: CHICAGO, IL

77063
Location: HOUSTON, TX

80202
Location: DENVER, CO

99950
Location: KETCHIKAN, AK

bye

C:\Temp>

If for some reason you couldn't use a separate data file and had to embed
this information directly in a C++ source file, you could process your file
(using sed or some other regular expression engine) into a series of
statements of the form:

m[ 2115] = "BOSTON, MA";
m[10001] = "NEW YORK, NY";
m[20008] = "WASHINGTON, DC";

(Warning: you would need to remove leading zeros in order to avoid the wrath
of octal numbers.)

Or, if you really wanted to use std::vector<std::string> (which offers O(1)
lookups, but doesn't like sparsely distributed keys), you'd want:

std::vector<std::string> v(100000);
v[ 2115] = "BOSTON, MA";
v[10001] = "NEW YORK, NY";
v[20008] = "WASHINGTON, DC";

(When embedding the data in a C++ source file, you could use const char *
instead of std::string, which would avoid dynamically allocating a copy of
string literals that are in your executable to begin with.)

Also, you could use std::vector<std::pair<unsigned int, std::string> >
sorted by the unsigned int. This offers O(log N) lookups like std::map, but
occupies less memory, and is sometimes a good choice.

Stephan T. Lavavej
Visual C++ Libraries Developer
Post by Mike Copeland
I have a large data file that contains US ZIP codes and City & State
values. It has > 64K records (I can't load it into Excel), and I wish
to use it in a C/C++ application - I want to look up the City & State
for specific ZIP code values. I don't need any other information from
this file.
My question is: what's the best technique or language feature for
storing and accessing individual Cities & States via the ZIP code value
(10000-99999)? I would think a declared array of strings, stored and
accessed via the ZIP value won't work, as such an array would be too
large for the compiler (VS6.0). Given that, would an STL vector (or
something else) work well, both for storing and accessing? Please
advise. TIA
Cezary H. Noweta
2009-12-11 03:09:04 UTC
Permalink
Hello,
Post by Mike Copeland
My question is: what's the best technique or language feature for
storing and accessing individual Cities & States via the ZIP code value
(10000-99999)? I would think a declared array of strings, stored and
accessed via the ZIP value won't work, as such an array would be too
large for the compiler (VS6.0).
Why do you think that it will not work? It is quite small amount of
memory. Under my VS6 SP5+PP it works fine:

char **p;

// C++
p = new char *[90000];
delete[] p;

/* C */
p = malloc(sizeof(char *[90000]));
free(p);

p = malloc(90000 * sizeof(char *));
free(p);

-- best regards

Cezary Noweta
Mike Copeland
2009-12-11 18:25:24 UTC
Permalink
Post by Cezary H. Noweta
Post by Mike Copeland
My question is: what's the best technique or language feature for
storing and accessing individual Cities & States via the ZIP code value
(10000-99999)? I would think a declared array of strings, stored and
accessed via the ZIP value won't work, as such an array would be too
large for the compiler (VS6.0).
Why do you think that it will not work? It is quite small amount of
char **p;
// C++
p = new char *[90000];
delete[] p;
/* C */
p = malloc(sizeof(char *[90000]));
free(p);
p = malloc(90000 * sizeof(char *));
free(p);
-- best regards
Cezary Noweta
I didn't express myself clearly enough 8<{{
I need to declare an array of strings (char[17]) that each contain
the City-State strings for each ZIP code. I would need ~90000 such
strings and wish to quickly access each one based on the ZIP code
(integer) value. Thus, I want to use 85004 to "find" the string
"Phoenix AZ" that I'd have in that array location. I'd expect a
90,000 strings array as a single declaration to not compile (though I
haven't tried it...). And I was wondering what other technique I could
use.
Scott McPhillips [MVP]
2009-12-11 20:46:01 UTC
Permalink
Post by Mike Copeland
Post by Cezary H. Noweta
Post by Mike Copeland
My question is: what's the best technique or language feature for
storing and accessing individual Cities & States via the ZIP code value
(10000-99999)? I would think a declared array of strings, stored and
accessed via the ZIP value won't work, as such an array would be too
large for the compiler (VS6.0).
Why do you think that it will not work? It is quite small amount of
char **p;
// C++
p = new char *[90000];
delete[] p;
/* C */
p = malloc(sizeof(char *[90000]));
free(p);
p = malloc(90000 * sizeof(char *));
free(p);
-- best regards
Cezary Noweta
I didn't express myself clearly enough 8<{{
I need to declare an array of strings (char[17]) that each contain
the City-State strings for each ZIP code. I would need ~90000 such
strings and wish to quickly access each one based on the ZIP code
(integer) value. Thus, I want to use 85004 to "find" the string
"Phoenix AZ" that I'd have in that array location. I'd expect a
90,000 strings array as a single declaration to not compile (though I
haven't tried it...). And I was wondering what other technique I could
use.
It will work and is not even remotely "too large for the compiler." The
size of the data is not a problem. The interesting design issue is to make
the search efficient, and you received some good advice on that.

Just do it.
--
Scott McPhillips [VC++ MVP]
Alexander Grigoriev
2009-12-12 02:26:11 UTC
Permalink
"Scott McPhillips [MVP]" <org-dot-mvps-at-scottmcp> wrote in message >>
Post by Scott McPhillips [MVP]
Post by Mike Copeland
I didn't express myself clearly enough 8<{{
I need to declare an array of strings (char[17]) that each contain
the City-State strings for each ZIP code. I would need ~90000 such
strings and wish to quickly access each one based on the ZIP code
(integer) value. Thus, I want to use 85004 to "find" the string
"Phoenix AZ" that I'd have in that array location. I'd expect a
90,000 strings array as a single declaration to not compile (though I
haven't tried it...). And I was wondering what other technique I could
use.
It will work and is not even remotely "too large for the compiler." The
size of the data is not a problem. The interesting design issue is to
make the search efficient, and you received some good advice on that.
Just do it.
--
Scott McPhillips [VC++ MVP]
I think array indexing by 5 digit ZIP doesn't have any problems with
efficiency.
Brian Muth
2009-12-11 22:29:10 UTC
Permalink
Stephan gave you a very detailed answer.
Tim Roberts
2009-12-12 05:22:05 UTC
Permalink
Post by Mike Copeland
I didn't express myself clearly enough 8<{{
I need to declare an array of strings (char[17]) that each contain
the City-State strings for each ZIP code. I would need ~90000 such
strings and wish to quickly access each one based on the ZIP code
(integer) value. Thus, I want to use 85004 to "find" the string
"Phoenix AZ" that I'd have in that array location. I'd expect a
90,000 strings array as a single declaration to not compile (though I
haven't tried it...).
Have you done the math? That's only 1.5 megabytes. The compiler isn't
even going to break a sweat.

Even so, I'd be tempted to put the data in a file, allocate the data
structure dynamically, and read it in.

From painful experience, I know that 17 characters is not nearly long
enough to hold all of the city/state names in this country.

1 1 2 2 3
----5----0----5----0----5----0
King and Queen Court House, VA
Rancho Santa Margarita, CA
Truth or Consequences, NM
South Salt Lake City, UT
Inver Grove Heights, MN
South San Francisco, CA
--
Tim Roberts, ***@probo.com
Providenza & Boekelheide, Inc.
Loading...