![]() ![]() ![]() |
|
Introduction
PostgreSQL's "COPY" statement is an excellent way to load large amounts of data quickly into a database. The main PostgreSQL documentation is the definitive guide, and I recommend studying it. The documentation is quite easy to understand, once you know it :-).
If you are like me, it probably looks a little forbidding at first.
I hope the following notes, based on my (sometimes painful) experiences
Basic basics Here is the syntax for COPY, from the main docs: COPY [ BINARY ] table [ WITH OIDS ] COPY will be run by the PostgreSQL backend (user "postgres"). The backend user requires permissions to read & write to the data file in order to copy from/to it, and needs to be able to find it. Therefore you need to use an absolute pathname. You will also need to have insert/update or select permission on the table in order to COPY to or from it. Assuming these requirements are met, you are ready to happily populate your database by doing: COPY bleah FROM 'blah'; from the psql prompt. TIP: Try creating a test table, inserting some data like that which you want to import. Then COPY the content to a file and inspect the results. This will tell you a lot about what COPY wants. Problems COPY is not terribly smart. All it really does is split each line using the appropriate delimiters, and attempt to line up each field in your data with each field in the table, counting from the left. If the parser accepts your data, in it goes. No parsing is done by COPY itself, so the data is handed off in just the state you have it. I find problems usually fall into the following (somewhat overlapping) categories: Data type mismatches NULL confusion Rogue characters Embedded delimiters
If you are importing from Windows, be aware that some Windows-based software will accept spaces into number or date fields, and these may exist in your file. Other non-printable characters can be even worse, because they don't take up space, and are therefore practically invisible. If your COPY fails for this reason, you can find them by doing something like this: grep 'search pattern' datafile | vis -lw This will show you every non-printable character in the affected lines.
For example, if your integer looks like this: 1,203,327 it will not be accepted. The commas will need to be edited out. PostgreSQL
recognizes a variety of date/time formats. If you are unsure about yours,
test a sample. If your data file is large, using an interactive editor is not a good idea. Most times, your edits will require some scripting. If you are not already familiar with scripting in tools like sh, sed, awk or perl, time spent on learning will be rewarded. You can do a lot with a little knowledge. Coverage of these methods is beyond the scope of this little article (you probably know more than I do anyway!).
You can avoid these altogether by moving your data files via FTP, using the "ascii" transfer method. This will automatically adjust the line endings for you. You can just delete the carriage returns using a simple script: " tr -d '\r' < datafile " should work just fine. I like to use GNU recode, which has lots of other nifty functionality as well. The command for that is recode /cl datafile. There are lots of other possible methods, mainly similar to these. Choose according to your taste. My terminal emulator (SecureCRT) allows me to copy and paste quite large amounts of data directly from my Windows workstation, using "COPY FROM stdin;". It even renders MS Excel cells into lines of tab delimited text, with appropriate line endings! If you are using a terminal emulator, you may want to investigate its capabilities. Rogue characters Your choice is to delete them, perhaps with an innocuous replacement,
or to escape them. This requires inserting a backslash before each offending
character. If you really want to keep your data just the way Others Other situations, such as embedded delimiters or incorrect data formats, require that you find some pattern in your data which will allow you to identify precisely what needs to be changed. i.e. You will need to delete the commas in "1,203,327", but you probably don't want to delete all of the commas in the whole line. Today's popular scripting languages offer good methods for doing this kind of thing. Other Bits & Quirks You can also use the psql client version of COPY, which is "\copy". The syntax is slightly different: Being a psql command, it is not terminated by a semicolon The behaviour of COPY is not symmetrical. If you COPY data into a table already containing data, the new data will be appended. If you COPY TO a file already containing data, the existing data will
be If you are doing repeated tests of your COPY, be careful you do not duplicate the data in your table. The quickest way to clean out a table after a test is "TRUNCATE tablename;". Default values are not filled in by COPY, as an INSERT would do. You will need to explicitly supply a value for every field. If you are using a "serial" type field, you will need to reserve some numbers by resetting the sequence value: SELECT setval ('sequence_name', new_value); You can then insert the reserved numbers into your data file, and COPY away. Indexes on your table will slow down your COPY too, perhaps by an order of magnitude. For a large data file, you may be better off dropping your indexes and recreating them afterwards. A script is an easy way to do this. Because each COPY runs as a single transaction, an extremely large data file might lead to performance problems. If in doubt, just split your file into chunks. When editing, I often like to do a one-shot, like this: sed -e 'edit_script' data_file | psql -c " copy mydata from stdin with null as '' " -d mydb&
When loading a dump file generated from a different RDBMS, containing INSERT statements, you may be able to get it to load with little or no editing. However, there may be some difficulties with large dump files:
If you want to try this, here is a Perl script which works for me. Check the man page of psql for more information. |
|
Maintainted by Wu Xin, CBI, Peking University, China, 2003 |