summaryrefslogtreecommitdiffstats
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md131
1 files changed, 131 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..28d6faf
--- /dev/null
+++ b/README.md
@@ -0,0 +1,131 @@
+identify
+========
+
+[![Build Status](https://travis-ci.org/chriskuehl/identify.svg?branch=master)](https://travis-ci.org/chriskuehl/identify)
+[![Coverage Status](https://coveralls.io/repos/github/chriskuehl/identify/badge.svg?branch=master)](https://coveralls.io/github/chriskuehl/identify?branch=master)
+[![PyPI version](https://badge.fury.io/py/identify.svg)](https://pypi.python.org/pypi/identify)
+
+File identification library for Python.
+
+Given a file (or some information about a file), return a set of standardized
+tags identifying what the file is.
+
+
+## Usage
+### With a file on disk
+
+If you have an actual file on disk, you can get the most information possible
+(a superset of all other methods):
+
+```python
+>>> identify.tags_from_path('/path/to/file.py')
+{'file', 'text', 'python', 'non-executable'}
+>>> identify.tags_from_path('/path/to/file-with-shebang')
+{'file', 'text', 'shell', 'bash', 'executable'}
+>>> identify.tags_from_path('/bin/bash')
+{'file', 'binary', 'executable'}
+>>> identify.tags_from_path('/path/to/directory')
+{'directory'}
+>>> identify.tags_from_path('/path/to/symlink')
+{'symlink'}
+```
+
+When using a file on disk, the checks performed are:
+
+* File type (file, symlink, directory)
+* Mode (is it executable?)
+* File name (mostly based on extension)
+* If executable, the shebang is read and the interpreter interpreted
+
+
+### If you only have the filename
+
+```python
+>>> identify.tags_from_filename('file.py')
+{'text', 'python'}
+```
+
+
+### If you only have the interpreter
+
+```python
+>>> identify.tags_from_interpreter('python3.5')
+{'python', 'python3'}
+>>> identify.tags_from_interpreter('bash')
+{'shell', 'bash'}
+>>> identify.tags_from_interpreter('some-unrecognized-thing')
+set()
+```
+
+### As a cli
+
+```
+$ identify-cli --help
+usage: identify-cli [-h] [--filename-only] path
+
+positional arguments:
+ path
+
+optional arguments:
+ -h, --help show this help message and exit
+ --filename-only
+```
+
+```bash
+$ identify-cli setup.py; echo $?
+["file", "non-executable", "python", "text"]
+0
+identify setup.py --filename-only; echo $?
+["python", "text"]
+0
+$ identify-cli wat.wat; echo $?
+wat.wat does not exist.
+1
+$ identify-cli wat.wat --filename-only; echo $?
+1
+```
+
+### Identifying LICENSE files
+
+`identify` also has an api for determining what type of license is contained
+in a file. This routine is roughly based on the approaches used by
+[licensee] (the ruby gem that github uses to figure out the license for a
+repo).
+
+The approach that `identify` uses is as follows:
+
+1. Strip the copyright line
+2. Normalize all whitespace
+3. Return any exact matches
+4. Return the closest by edit distance (where edit distance < 5%)
+
+To use the api, install via `pip install identify[license]`
+
+```pycon
+>>> from identify import identify
+>>> identify.license_id('LICENSE')
+'MIT'
+```
+
+The return value of the `license_id` function is an [SPDX] id. Currently
+licenses are sourced from [choosealicense.com].
+
+[licensee]: https://github.com/benbalter/licensee
+[SPDX]: https://spdx.org/licenses/
+[choosealicense.com]: https://github.com/github/choosealicense.com
+
+## How it works
+
+A call to `tags_from_path` does this:
+
+1. What is the type: file, symlink, directory? If it's not file, stop here.
+2. Is it executable? Add the appropriate tag.
+3. Do we recognize the file extension? If so, add the appropriate tags, stop
+ here. These tags would include binary/text.
+4. Peek at the first X bytes of the file. Use these to determine whether it is
+ binary or text, add the appropriate tag.
+5. If identified as text above, try to read and interpret the shebang, and add
+ appropriate tags.
+
+By design, this means we don't need to partially read files where we recognize
+the file extension.