diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 18:24:20 +0000 |
commit | 483eb2f56657e8e7f419ab1a4fab8dce9ade8609 (patch) | |
tree | e5d88d25d870d5dedacb6bbdbe2a966086a0a5cf /doc/radosgw/troubleshooting.rst | |
parent | Initial commit. (diff) | |
download | ceph-upstream.tar.xz ceph-upstream.zip |
Adding upstream version 14.2.21.upstream/14.2.21upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/radosgw/troubleshooting.rst')
-rw-r--r-- | doc/radosgw/troubleshooting.rst | 208 |
1 files changed, 208 insertions, 0 deletions
diff --git a/doc/radosgw/troubleshooting.rst b/doc/radosgw/troubleshooting.rst new file mode 100644 index 00000000..4a084e82 --- /dev/null +++ b/doc/radosgw/troubleshooting.rst @@ -0,0 +1,208 @@ +================= + Troubleshooting +================= + + +The Gateway Won't Start +======================= + +If you cannot start the gateway (i.e., there is no existing ``pid``), +check to see if there is an existing ``.asok`` file from another +user. If an ``.asok`` file from another user exists and there is no +running ``pid``, remove the ``.asok`` file and try to start the +process again. This may occur when you start the process as a ``root`` user and +the startup script is trying to start the process as a +``www-data`` or ``apache`` user and an existing ``.asok`` is +preventing the script from starting the daemon. + +The radosgw init script (/etc/init.d/radosgw) also has a verbose argument that +can provide some insight as to what could be the issue:: + + /etc/init.d/radosgw start -v + +or :: + + /etc/init.d radosgw start --verbose + +HTTP Request Errors +=================== + +Examining the access and error logs for the web server itself is +probably the first step in identifying what is going on. If there is +a 500 error, that usually indicates a problem communicating with the +``radosgw`` daemon. Ensure the daemon is running, its socket path is +configured, and that the web server is looking for it in the proper +location. + + +Crashed ``radosgw`` process +=========================== + +If the ``radosgw`` process dies, you will normally see a 500 error +from the web server (apache, nginx, etc.). In that situation, simply +restarting radosgw will restore service. + +To diagnose the cause of the crash, check the log in ``/var/log/ceph`` +and/or the core file (if one was generated). + + +Blocked ``radosgw`` Requests +============================ + +If some (or all) radosgw requests appear to be blocked, you can get +some insight into the internal state of the ``radosgw`` daemon via +its admin socket. By default, there will be a socket configured to +reside in ``/var/run/ceph``, and the daemon can be queried with:: + + ceph daemon /var/run/ceph/client.rgw help + + help list available commands + objecter_requests show in-progress osd requests + perfcounters_dump dump perfcounters value + perfcounters_schema dump perfcounters schema + version get protocol version + +Of particular interest:: + + ceph daemon /var/run/ceph/client.rgw objecter_requests + ... + +will dump information about current in-progress requests with the +RADOS cluster. This allows one to identify if any requests are blocked +by a non-responsive OSD. For example, one might see:: + + { "ops": [ + { "tid": 1858, + "pg": "2.d2041a48", + "osd": 1, + "last_sent": "2012-03-08 14:56:37.949872", + "attempts": 1, + "object_id": "fatty_25647_object1857", + "object_locator": "@2", + "snapid": "head", + "snap_context": "0=[]", + "mtime": "2012-03-08 14:56:37.949813", + "osd_ops": [ + "write 0~4096"]}, + { "tid": 1873, + "pg": "2.695e9f8e", + "osd": 1, + "last_sent": "2012-03-08 14:56:37.970615", + "attempts": 1, + "object_id": "fatty_25647_object1872", + "object_locator": "@2", + "snapid": "head", + "snap_context": "0=[]", + "mtime": "2012-03-08 14:56:37.970555", + "osd_ops": [ + "write 0~4096"]}], + "linger_ops": [], + "pool_ops": [], + "pool_stat_ops": [], + "statfs_ops": []} + +In this dump, two requests are in progress. The ``last_sent`` field is +the time the RADOS request was sent. If this is a while ago, it suggests +that the OSD is not responding. For example, for request 1858, you could +check the OSD status with:: + + ceph pg map 2.d2041a48 + + osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0] + +This tells us to look at ``osd.1``, the primary copy for this PG:: + + ceph daemon osd.1 ops + { "num_ops": 651, + "ops": [ + { "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)", + "received_at": "1331247573.344650", + "age": "25.606449", + "flag_point": "waiting for sub ops", + "client_info": { "client": "client.4124", + "tid": 1858}}, + ... + +The ``flag_point`` field indicates that the OSD is currently waiting +for replicas to respond, in this case ``osd.0``. + + +Java S3 API Troubleshooting +=========================== + + +Peer Not Authenticated +---------------------- + +You may receive an error that looks like this:: + + [java] INFO: Unable to execute HTTP request: peer not authenticated + +The Java SDK for S3 requires a valid certificate from a recognized certificate +authority, because it uses HTTPS by default. If you are just testing the Ceph +Object Storage services, you can resolve this problem in a few ways: + +#. Prepend the IP address or hostname with ``http://``. For example, change this:: + + conn.setEndpoint("myserver"); + + To:: + + conn.setEndpoint("http://myserver") + +#. After setting your credentials, add a client configuration and set the + protocol to ``Protocol.HTTP``. :: + + AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey); + + ClientConfiguration clientConfig = new ClientConfiguration(); + clientConfig.setProtocol(Protocol.HTTP); + + AmazonS3 conn = new AmazonS3Client(credentials, clientConfig); + + + +405 MethodNotAllowed +-------------------- + +If you receive an 405 error, check to see if you have the S3 subdomain set up correctly. +You will need to have a wild card setting in your DNS record for subdomain functionality +to work properly. + +Also, check to ensure that the default site is disabled. :: + + [java] Exception in thread "main" Status Code: 405, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: MethodNotAllowed, AWS Error Message: null, S3 Extended Request ID: null + + + +Numerous objects in default.rgw.meta pool +========================================= + +Clusters created prior to *jewel* have a metadata archival feature enabled by default, using the ``default.rgw.meta`` pool. +This archive keeps all old versions of user and bucket metadata, resulting in large numbers of objects in the ``default.rgw.meta`` pool. + +Disabling the Metadata Heap +--------------------------- + +Users who want to disable this feature going forward should set the ``metadata_heap`` field to an empty string ``""``:: + + $ radosgw-admin zone get --rgw-zone=default > zone.json + [edit zone.json, setting "metadata_heap": ""] + $ radosgw-admin zone set --rgw-zone=default --infile=zone.json + $ radosgw-admin period update --commit + +This will stop new metadata from being written to the ``default.rgw.meta`` pool, but does not remove any existing objects or pool. + +Cleaning the Metadata Heap Pool +------------------------------- + +Clusters created prior to *jewel* normally use ``default.rgw.meta`` only for the metadata archival feature. + +However, from *luminous* onwards, radosgw uses :ref:`Pool Namespaces <radosgw-pool-namespaces>` within ``default.rgw.meta`` for an entirely different purpose, that is, to store ``user_keys`` and other critical metadata. + +Users should check zone configuration before proceeding any cleanup procedures:: + + $ radosgw-admin zone get --rgw-zone=default | grep default.rgw.meta + [should not match any strings] + +Having confirmed that the pool is not used for any purpose, users may safely delete all objects in the ``default.rgw.meta`` pool, or optionally, delete the entire pool itself. |