Large Objects large object BLOBlarge object PostgreSQL has a large object facility, which provides stream-style access to user data that is stored in a special large-object structure. Streaming access is useful when working with data values that are too large to manipulate conveniently as a whole. This chapter describes the implementation and the programming and query language interfaces to PostgreSQL large object data. We use the libpq C library for the examples in this chapter, but most programming interfaces native to PostgreSQL support equivalent functionality. Other interfaces might use the large object interface internally to provide generic support for large values. This is not described here. Introduction TOAST versus large objects All large objects are stored in a single system table named pg_largeobject. Each large object also has an entry in the system table pg_largeobject_metadata. Large objects can be created, modified, and deleted using a read/write API that is similar to standard operations on files. PostgreSQL also supports a storage system called TOAST, which automatically stores values larger than a single database page into a secondary storage area per table. This makes the large object facility partially obsolete. One remaining advantage of the large object facility is that it allows values up to 4 TB in size, whereas TOASTed fields can be at most 1 GB. Also, reading and updating portions of a large object can be done efficiently, while most operations on a TOASTed field will read or write the whole value as a unit. Implementation Features The large object implementation breaks large objects up into chunks and stores the chunks in rows in the database. A B-tree index guarantees fast searches for the correct chunk number when doing random access reads and writes. The chunks stored for a large object do not have to be contiguous. For example, if an application opens a new large object, seeks to offset 1000000, and writes a few bytes there, this does not result in allocation of 1000000 bytes worth of storage; only of chunks covering the range of data bytes actually written. A read operation will, however, read out zeroes for any unallocated locations preceding the last existing chunk. This corresponds to the common behavior of sparsely allocated files in Unix file systems. As of PostgreSQL 9.0, large objects have an owner and a set of access permissions, which can be managed using and . SELECT privileges are required to read a large object, and UPDATE privileges are required to write or truncate it. Only the large object's owner (or a database superuser) can delete, comment on, or change the owner of a large object. To adjust this behavior for compatibility with prior releases, see the run-time parameter. Client Interfaces This section describes the facilities that PostgreSQL's libpq client interface library provides for accessing large objects. The PostgreSQL large object interface is modeled after the Unix file-system interface, with analogues of open, read, write, lseek, etc. All large object manipulation using these functions must take place within an SQL transaction block, since large object file descriptors are only valid for the duration of a transaction. If an error occurs while executing any one of these functions, the function will return an otherwise-impossible value, typically 0 or -1. A message describing the error is stored in the connection object and can be retrieved with . Client applications that use these functions should include the header file libpq/libpq-fs.h and link with the libpq library. Creating a Large Object lo_creat The function Oid lo_creat(PGconn *conn, int mode); creates a new large object. The return value is the OID that was assigned to the new large object, or InvalidOid (zero) on failure. mode is unused and ignored as of PostgreSQL 8.1; however, for backward compatibility with earlier releases it is best to set it to INV_READ, INV_WRITE, or INV_READ | INV_WRITE. (These symbolic constants are defined in the header file libpq/libpq-fs.h.) An example: inv_oid = lo_creat(conn, INV_READ|INV_WRITE); lo_create The function Oid lo_create(PGconn *conn, Oid lobjId); also creates a new large object. The OID to be assigned can be specified by lobjId; if so, failure occurs if that OID is already in use for some large object. If lobjId is InvalidOid (zero) then lo_create assigns an unused OID (this is the same behavior as lo_creat). The return value is the OID that was assigned to the new large object, or InvalidOid (zero) on failure. lo_create is new as of PostgreSQL 8.1; if this function is run against an older server version, it will fail and return InvalidOid. An example: inv_oid = lo_create(conn, desired_oid); Importing a Large Object lo_import To import an operating system file as a large object, call Oid lo_import(PGconn *conn, const char *filename); filename specifies the operating system name of the file to be imported as a large object. The return value is the OID that was assigned to the new large object, or InvalidOid (zero) on failure. Note that the file is read by the client interface library, not by the server; so it must exist in the client file system and be readable by the client application. lo_import_with_oid The function Oid lo_import_with_oid(PGconn *conn, const char *filename, Oid lobjId); also imports a new large object. The OID to be assigned can be specified by lobjId; if so, failure occurs if that OID is already in use for some large object. If lobjId is InvalidOid (zero) then lo_import_with_oid assigns an unused OID (this is the same behavior as lo_import). The return value is the OID that was assigned to the new large object, or InvalidOid (zero) on failure. lo_import_with_oid is new as of PostgreSQL 8.4 and uses lo_create internally which is new in 8.1; if this function is run against 8.0 or before, it will fail and return InvalidOid. Exporting a Large Object lo_export To export a large object into an operating system file, call int lo_export(PGconn *conn, Oid lobjId, const char *filename); The lobjId argument specifies the OID of the large object to export and the filename argument specifies the operating system name of the file. Note that the file is written by the client interface library, not by the server. Returns 1 on success, -1 on failure. Opening an Existing Large Object lo_open To open an existing large object for reading or writing, call int lo_open(PGconn *conn, Oid lobjId, int mode); The lobjId argument specifies the OID of the large object to open. The mode bits control whether the object is opened for reading (INV_READ), writing (INV_WRITE), or both. (These symbolic constants are defined in the header file libpq/libpq-fs.h.) lo_open returns a (non-negative) large object descriptor for later use in lo_read, lo_write, lo_lseek, lo_lseek64, lo_tell, lo_tell64, lo_truncate, lo_truncate64, and lo_close. The descriptor is only valid for the duration of the current transaction. On failure, -1 is returned. The server currently does not distinguish between modes INV_WRITE and INV_READ | INV_WRITE: you are allowed to read from the descriptor in either case. However there is a significant difference between these modes and INV_READ alone: with INV_READ you cannot write on the descriptor, and the data read from it will reflect the contents of the large object at the time of the transaction snapshot that was active when lo_open was executed, regardless of later writes by this or other transactions. Reading from a descriptor opened with INV_WRITE returns data that reflects all writes of other committed transactions as well as writes of the current transaction. This is similar to the behavior of REPEATABLE READ versus READ COMMITTED transaction modes for ordinary SQL SELECT commands. lo_open will fail if SELECT privilege is not available for the large object, or if INV_WRITE is specified and UPDATE privilege is not available. (Prior to PostgreSQL 11, these privilege checks were instead performed at the first actual read or write call using the descriptor.) These privilege checks can be disabled with the run-time parameter. An example: inv_fd = lo_open(conn, inv_oid, INV_READ|INV_WRITE); Writing Data to a Large Object lo_write The function int lo_write(PGconn *conn, int fd, const char *buf, size_t len); writes len bytes from buf (which must be of size len) to large object descriptor fd. The fd argument must have been returned by a previous lo_open. The number of bytes actually written is returned (in the current implementation, this will always equal len unless there is an error). In the event of an error, the return value is -1. Although the len parameter is declared as size_t, this function will reject length values larger than INT_MAX. In practice, it's best to transfer data in chunks of at most a few megabytes anyway. Reading Data from a Large Object lo_read The function int lo_read(PGconn *conn, int fd, char *buf, size_t len); reads up to len bytes from large object descriptor fd into buf (which must be of size len). The fd argument must have been returned by a previous lo_open. The number of bytes actually read is returned; this will be less than len if the end of the large object is reached first. In the event of an error, the return value is -1. Although the len parameter is declared as size_t, this function will reject length values larger than INT_MAX. In practice, it's best to transfer data in chunks of at most a few megabytes anyway. Seeking in a Large Object lo_lseek To change the current read or write location associated with a large object descriptor, call int lo_lseek(PGconn *conn, int fd, int offset, int whence); This function moves the current location pointer for the large object descriptor identified by fd to the new location specified by offset. The valid values for whence are SEEK_SET (seek from object start), SEEK_CUR (seek from current position), and SEEK_END (seek from object end). The return value is the new location pointer, or -1 on error. lo_lseek64 When dealing with large objects that might exceed 2GB in size, instead use pg_int64 lo_lseek64(PGconn *conn, int fd, pg_int64 offset, int whence); This function has the same behavior as lo_lseek, but it can accept an offset larger than 2GB and/or deliver a result larger than 2GB. Note that lo_lseek will fail if the new location pointer would be greater than 2GB. lo_lseek64 is new as of PostgreSQL 9.3. If this function is run against an older server version, it will fail and return -1. Obtaining the Seek Position of a Large Object lo_tell To obtain the current read or write location of a large object descriptor, call int lo_tell(PGconn *conn, int fd); If there is an error, the return value is -1. lo_tell64 When dealing with large objects that might exceed 2GB in size, instead use pg_int64 lo_tell64(PGconn *conn, int fd); This function has the same behavior as lo_tell, but it can deliver a result larger than 2GB. Note that lo_tell will fail if the current read/write location is greater than 2GB. lo_tell64 is new as of PostgreSQL 9.3. If this function is run against an older server version, it will fail and return -1. Truncating a Large Object lo_truncate To truncate a large object to a given length, call int lo_truncate(PGcon *conn, int fd, size_t len); This function truncates the large object descriptor fd to length len. The fd argument must have been returned by a previous lo_open. If len is greater than the large object's current length, the large object is extended to the specified length with null bytes ('\0'). On success, lo_truncate returns zero. On error, the return value is -1. The read/write location associated with the descriptor fd is not changed. Although the len parameter is declared as size_t, lo_truncate will reject length values larger than INT_MAX. lo_truncate64 When dealing with large objects that might exceed 2GB in size, instead use int lo_truncate64(PGcon *conn, int fd, pg_int64 len); This function has the same behavior as lo_truncate, but it can accept a len value exceeding 2GB. lo_truncate is new as of PostgreSQL 8.3; if this function is run against an older server version, it will fail and return -1. lo_truncate64 is new as of PostgreSQL 9.3; if this function is run against an older server version, it will fail and return -1. Closing a Large Object Descriptor lo_close A large object descriptor can be closed by calling int lo_close(PGconn *conn, int fd); where fd is a large object descriptor returned by lo_open. On success, lo_close returns zero. On error, the return value is -1. Any large object descriptors that remain open at the end of a transaction will be closed automatically. Removing a Large Object lo_unlink To remove a large object from the database, call int lo_unlink(PGconn *conn, Oid lobjId); The lobjId argument specifies the OID of the large object to remove. Returns 1 if successful, -1 on failure. Server-Side Functions Server-side functions tailored for manipulating large objects from SQL are listed in . SQL-Oriented Large Object Functions Function Description Example(s) lo_from_bytea lo_from_bytea ( loid oid, data bytea ) oid Creates a large object and stores data in it. If loid is zero then the system will choose a free OID, otherwise that OID is used (with an error if some large object already has that OID). On success, the large object's OID is returned. lo_from_bytea(0, '\xffffff00') 24528 lo_put lo_put ( loid oid, offset bigint, data bytea ) void Writes data starting at the given offset within the large object; the large object is enlarged if necessary. lo_put(24528, 1, '\xaa') lo_get lo_get ( loid oid , offset bigint, length integer ) bytea Extracts the large object's contents, or a substring thereof. lo_get(24528, 0, 3) \xffaaff
There are additional server-side functions corresponding to each of the client-side functions described earlier; indeed, for the most part the client-side functions are simply interfaces to the equivalent server-side functions. The ones just as convenient to call via SQL commands are lo_creatlo_creat, lo_create, lo_unlinklo_unlink, lo_importlo_import, and lo_exportlo_export. Here are examples of their use: CREATE TABLE image ( name text, raster oid ); SELECT lo_creat(-1); -- returns OID of new, empty large object SELECT lo_create(43213); -- attempts to create large object with OID 43213 SELECT lo_unlink(173454); -- deletes large object with OID 173454 INSERT INTO image (name, raster) VALUES ('beautiful image', lo_import('/etc/motd')); INSERT INTO image (name, raster) -- same as above, but specify OID to use VALUES ('beautiful image', lo_import('/etc/motd', 68583)); SELECT lo_export(image.raster, '/tmp/motd') FROM image WHERE name = 'beautiful image'; The server-side lo_import and lo_export functions behave considerably differently from their client-side analogs. These two functions read and write files in the server's file system, using the permissions of the database's owning user. Therefore, by default their use is restricted to superusers. In contrast, the client-side import and export functions read and write files in the client's file system, using the permissions of the client program. The client-side functions do not require any database privileges, except the privilege to read or write the large object in question. It is possible to use of the server-side lo_import and lo_export functions to non-superusers, but careful consideration of the security implications is required. A malicious user of such privileges could easily parlay them into becoming superuser (for example by rewriting server configuration files), or could attack the rest of the server's file system without bothering to obtain database superuser privileges as such. Access to roles having such privilege must therefore be guarded just as carefully as access to superuser roles. Nonetheless, if use of server-side lo_import or lo_export is needed for some routine task, it's safer to use a role with such privileges than one with full superuser privileges, as that helps to reduce the risk of damage from accidental errors. The functionality of lo_read and lo_write is also available via server-side calls, but the names of the server-side functions differ from the client side interfaces in that they do not contain underscores. You must call these functions as loread and lowrite.
Example Program is a sample program which shows how the large object interface in libpq can be used. Parts of the program are commented out but are left in the source for the reader's benefit. This program can also be found in src/test/examples/testlo.c in the source distribution. Large Objects with <application>libpq</application> Example Program #include #include #include #include #include #include "libpq-fe.h" #include "libpq/libpq-fs.h" #define BUFSIZE 1024 /* * importFile - * import file "in_filename" into database as large object "lobjOid" * */ static Oid importFile(PGconn *conn, char *filename) { Oid lobjId; int lobj_fd; char buf[BUFSIZE]; int nbytes, tmp; int fd; /* * open the file to be read in */ fd = open(filename, O_RDONLY, 0666); if (fd < 0) { /* error */ fprintf(stderr, "cannot open unix file\"%s\"\n", filename); } /* * create the large object */ lobjId = lo_creat(conn, INV_READ | INV_WRITE); if (lobjId == 0) fprintf(stderr, "cannot create large object"); lobj_fd = lo_open(conn, lobjId, INV_WRITE); /* * read in from the Unix file and write to the inversion file */ while ((nbytes = read(fd, buf, BUFSIZE)) > 0) { tmp = lo_write(conn, lobj_fd, buf, nbytes); if (tmp < nbytes) fprintf(stderr, "error while reading \"%s\"", filename); } close(fd); lo_close(conn, lobj_fd); return lobjId; } static void pickout(PGconn *conn, Oid lobjId, int start, int len) { int lobj_fd; char *buf; int nbytes; int nread; lobj_fd = lo_open(conn, lobjId, INV_READ); if (lobj_fd < 0) fprintf(stderr, "cannot open large object %u", lobjId); lo_lseek(conn, lobj_fd, start, SEEK_SET); buf = malloc(len + 1); nread = 0; while (len - nread > 0) { nbytes = lo_read(conn, lobj_fd, buf, len - nread); buf[nbytes] = '\0'; fprintf(stderr, ">>> %s", buf); nread += nbytes; if (nbytes <= 0) break; /* no more data? */ } free(buf); fprintf(stderr, "\n"); lo_close(conn, lobj_fd); } static void overwrite(PGconn *conn, Oid lobjId, int start, int len) { int lobj_fd; char *buf; int nbytes; int nwritten; int i; lobj_fd = lo_open(conn, lobjId, INV_WRITE); if (lobj_fd < 0) fprintf(stderr, "cannot open large object %u", lobjId); lo_lseek(conn, lobj_fd, start, SEEK_SET); buf = malloc(len + 1); for (i = 0; i < len; i++) buf[i] = 'X'; buf[i] = '\0'; nwritten = 0; while (len - nwritten > 0) { nbytes = lo_write(conn, lobj_fd, buf + nwritten, len - nwritten); nwritten += nbytes; if (nbytes <= 0) { fprintf(stderr, "\nWRITE FAILED!\n"); break; } } free(buf); fprintf(stderr, "\n"); lo_close(conn, lobj_fd); } /* * exportFile - * export large object "lobjOid" to file "out_filename" * */ static void exportFile(PGconn *conn, Oid lobjId, char *filename) { int lobj_fd; char buf[BUFSIZE]; int nbytes, tmp; int fd; /* * open the large object */ lobj_fd = lo_open(conn, lobjId, INV_READ); if (lobj_fd < 0) fprintf(stderr, "cannot open large object %u", lobjId); /* * open the file to be written to */ fd = open(filename, O_CREAT | O_WRONLY | O_TRUNC, 0666); if (fd < 0) { /* error */ fprintf(stderr, "cannot open unix file\"%s\"", filename); } /* * read in from the inversion file and write to the Unix file */ while ((nbytes = lo_read(conn, lobj_fd, buf, BUFSIZE)) > 0) { tmp = write(fd, buf, nbytes); if (tmp < nbytes) { fprintf(stderr, "error while writing \"%s\"", filename); } } lo_close(conn, lobj_fd); close(fd); return; } static void exit_nicely(PGconn *conn) { PQfinish(conn); exit(1); } int main(int argc, char **argv) { char *in_filename, *out_filename; char *database; Oid lobjOid; PGconn *conn; PGresult *res; if (argc != 4) { fprintf(stderr, "Usage: %s database_name in_filename out_filename\n", argv[0]); exit(1); } database = argv[1]; in_filename = argv[2]; out_filename = argv[3]; /* * set up the connection */ conn = PQsetdb(NULL, NULL, NULL, NULL, database); /* check to see that the backend connection was successfully made */ if (PQstatus(conn) != CONNECTION_OK) { fprintf(stderr, "Connection to database failed: %s", PQerrorMessage(conn)); exit_nicely(conn); } /* Set always-secure search path, so malicious users can't take control. */ res = PQexec(conn, "SELECT pg_catalog.set_config('search_path', '', false)"); if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "SET failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } PQclear(res); res = PQexec(conn, "begin"); PQclear(res); printf("importing file \"%s\" ...\n", in_filename); /* lobjOid = importFile(conn, in_filename); */ lobjOid = lo_import(conn, in_filename); if (lobjOid == 0) fprintf(stderr, "%s\n", PQerrorMessage(conn)); else { printf("\tas large object %u.\n", lobjOid); printf("picking out bytes 1000-2000 of the large object\n"); pickout(conn, lobjOid, 1000, 1000); printf("overwriting bytes 1000-2000 of the large object with X's\n"); overwrite(conn, lobjOid, 1000, 1000); printf("exporting large object to file \"%s\" ...\n", out_filename); /* exportFile(conn, lobjOid, out_filename); */ if (lo_export(conn, lobjOid, out_filename) < 0) fprintf(stderr, "%s\n", PQerrorMessage(conn)); } res = PQexec(conn, "end"); PQclear(res); PQfinish(conn); return 0; } ]]>