summaryrefslogtreecommitdiffstats
path: root/doc/radosgw/troubleshooting.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-27 18:24:20 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-27 18:24:20 +0000
commit483eb2f56657e8e7f419ab1a4fab8dce9ade8609 (patch)
treee5d88d25d870d5dedacb6bbdbe2a966086a0a5cf /doc/radosgw/troubleshooting.rst
parentInitial commit. (diff)
downloadceph-upstream.tar.xz
ceph-upstream.zip
Adding upstream version 14.2.21.upstream/14.2.21upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/radosgw/troubleshooting.rst')
-rw-r--r--doc/radosgw/troubleshooting.rst208
1 files changed, 208 insertions, 0 deletions
diff --git a/doc/radosgw/troubleshooting.rst b/doc/radosgw/troubleshooting.rst
new file mode 100644
index 00000000..4a084e82
--- /dev/null
+++ b/doc/radosgw/troubleshooting.rst
@@ -0,0 +1,208 @@
+=================
+ Troubleshooting
+=================
+
+
+The Gateway Won't Start
+=======================
+
+If you cannot start the gateway (i.e., there is no existing ``pid``),
+check to see if there is an existing ``.asok`` file from another
+user. If an ``.asok`` file from another user exists and there is no
+running ``pid``, remove the ``.asok`` file and try to start the
+process again. This may occur when you start the process as a ``root`` user and
+the startup script is trying to start the process as a
+``www-data`` or ``apache`` user and an existing ``.asok`` is
+preventing the script from starting the daemon.
+
+The radosgw init script (/etc/init.d/radosgw) also has a verbose argument that
+can provide some insight as to what could be the issue::
+
+ /etc/init.d/radosgw start -v
+
+or ::
+
+ /etc/init.d radosgw start --verbose
+
+HTTP Request Errors
+===================
+
+Examining the access and error logs for the web server itself is
+probably the first step in identifying what is going on. If there is
+a 500 error, that usually indicates a problem communicating with the
+``radosgw`` daemon. Ensure the daemon is running, its socket path is
+configured, and that the web server is looking for it in the proper
+location.
+
+
+Crashed ``radosgw`` process
+===========================
+
+If the ``radosgw`` process dies, you will normally see a 500 error
+from the web server (apache, nginx, etc.). In that situation, simply
+restarting radosgw will restore service.
+
+To diagnose the cause of the crash, check the log in ``/var/log/ceph``
+and/or the core file (if one was generated).
+
+
+Blocked ``radosgw`` Requests
+============================
+
+If some (or all) radosgw requests appear to be blocked, you can get
+some insight into the internal state of the ``radosgw`` daemon via
+its admin socket. By default, there will be a socket configured to
+reside in ``/var/run/ceph``, and the daemon can be queried with::
+
+ ceph daemon /var/run/ceph/client.rgw help
+
+ help list available commands
+ objecter_requests show in-progress osd requests
+ perfcounters_dump dump perfcounters value
+ perfcounters_schema dump perfcounters schema
+ version get protocol version
+
+Of particular interest::
+
+ ceph daemon /var/run/ceph/client.rgw objecter_requests
+ ...
+
+will dump information about current in-progress requests with the
+RADOS cluster. This allows one to identify if any requests are blocked
+by a non-responsive OSD. For example, one might see::
+
+ { "ops": [
+ { "tid": 1858,
+ "pg": "2.d2041a48",
+ "osd": 1,
+ "last_sent": "2012-03-08 14:56:37.949872",
+ "attempts": 1,
+ "object_id": "fatty_25647_object1857",
+ "object_locator": "@2",
+ "snapid": "head",
+ "snap_context": "0=[]",
+ "mtime": "2012-03-08 14:56:37.949813",
+ "osd_ops": [
+ "write 0~4096"]},
+ { "tid": 1873,
+ "pg": "2.695e9f8e",
+ "osd": 1,
+ "last_sent": "2012-03-08 14:56:37.970615",
+ "attempts": 1,
+ "object_id": "fatty_25647_object1872",
+ "object_locator": "@2",
+ "snapid": "head",
+ "snap_context": "0=[]",
+ "mtime": "2012-03-08 14:56:37.970555",
+ "osd_ops": [
+ "write 0~4096"]}],
+ "linger_ops": [],
+ "pool_ops": [],
+ "pool_stat_ops": [],
+ "statfs_ops": []}
+
+In this dump, two requests are in progress. The ``last_sent`` field is
+the time the RADOS request was sent. If this is a while ago, it suggests
+that the OSD is not responding. For example, for request 1858, you could
+check the OSD status with::
+
+ ceph pg map 2.d2041a48
+
+ osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0]
+
+This tells us to look at ``osd.1``, the primary copy for this PG::
+
+ ceph daemon osd.1 ops
+ { "num_ops": 651,
+ "ops": [
+ { "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)",
+ "received_at": "1331247573.344650",
+ "age": "25.606449",
+ "flag_point": "waiting for sub ops",
+ "client_info": { "client": "client.4124",
+ "tid": 1858}},
+ ...
+
+The ``flag_point`` field indicates that the OSD is currently waiting
+for replicas to respond, in this case ``osd.0``.
+
+
+Java S3 API Troubleshooting
+===========================
+
+
+Peer Not Authenticated
+----------------------
+
+You may receive an error that looks like this::
+
+ [java] INFO: Unable to execute HTTP request: peer not authenticated
+
+The Java SDK for S3 requires a valid certificate from a recognized certificate
+authority, because it uses HTTPS by default. If you are just testing the Ceph
+Object Storage services, you can resolve this problem in a few ways:
+
+#. Prepend the IP address or hostname with ``http://``. For example, change this::
+
+ conn.setEndpoint("myserver");
+
+ To::
+
+ conn.setEndpoint("http://myserver")
+
+#. After setting your credentials, add a client configuration and set the
+ protocol to ``Protocol.HTTP``. ::
+
+ AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
+
+ ClientConfiguration clientConfig = new ClientConfiguration();
+ clientConfig.setProtocol(Protocol.HTTP);
+
+ AmazonS3 conn = new AmazonS3Client(credentials, clientConfig);
+
+
+
+405 MethodNotAllowed
+--------------------
+
+If you receive an 405 error, check to see if you have the S3 subdomain set up correctly.
+You will need to have a wild card setting in your DNS record for subdomain functionality
+to work properly.
+
+Also, check to ensure that the default site is disabled. ::
+
+ [java] Exception in thread "main" Status Code: 405, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: MethodNotAllowed, AWS Error Message: null, S3 Extended Request ID: null
+
+
+
+Numerous objects in default.rgw.meta pool
+=========================================
+
+Clusters created prior to *jewel* have a metadata archival feature enabled by default, using the ``default.rgw.meta`` pool.
+This archive keeps all old versions of user and bucket metadata, resulting in large numbers of objects in the ``default.rgw.meta`` pool.
+
+Disabling the Metadata Heap
+---------------------------
+
+Users who want to disable this feature going forward should set the ``metadata_heap`` field to an empty string ``""``::
+
+ $ radosgw-admin zone get --rgw-zone=default > zone.json
+ [edit zone.json, setting "metadata_heap": ""]
+ $ radosgw-admin zone set --rgw-zone=default --infile=zone.json
+ $ radosgw-admin period update --commit
+
+This will stop new metadata from being written to the ``default.rgw.meta`` pool, but does not remove any existing objects or pool.
+
+Cleaning the Metadata Heap Pool
+-------------------------------
+
+Clusters created prior to *jewel* normally use ``default.rgw.meta`` only for the metadata archival feature.
+
+However, from *luminous* onwards, radosgw uses :ref:`Pool Namespaces <radosgw-pool-namespaces>` within ``default.rgw.meta`` for an entirely different purpose, that is, to store ``user_keys`` and other critical metadata.
+
+Users should check zone configuration before proceeding any cleanup procedures::
+
+ $ radosgw-admin zone get --rgw-zone=default | grep default.rgw.meta
+ [should not match any strings]
+
+Having confirmed that the pool is not used for any purpose, users may safely delete all objects in the ``default.rgw.meta`` pool, or optionally, delete the entire pool itself.