diff options
Diffstat (limited to '')
-rw-r--r-- | README.md | 131 |
1 files changed, 131 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..28d6faf --- /dev/null +++ b/README.md @@ -0,0 +1,131 @@ +identify +======== + +[![Build Status](https://travis-ci.org/chriskuehl/identify.svg?branch=master)](https://travis-ci.org/chriskuehl/identify) +[![Coverage Status](https://coveralls.io/repos/github/chriskuehl/identify/badge.svg?branch=master)](https://coveralls.io/github/chriskuehl/identify?branch=master) +[![PyPI version](https://badge.fury.io/py/identify.svg)](https://pypi.python.org/pypi/identify) + +File identification library for Python. + +Given a file (or some information about a file), return a set of standardized +tags identifying what the file is. + + +## Usage +### With a file on disk + +If you have an actual file on disk, you can get the most information possible +(a superset of all other methods): + +```python +>>> identify.tags_from_path('/path/to/file.py') +{'file', 'text', 'python', 'non-executable'} +>>> identify.tags_from_path('/path/to/file-with-shebang') +{'file', 'text', 'shell', 'bash', 'executable'} +>>> identify.tags_from_path('/bin/bash') +{'file', 'binary', 'executable'} +>>> identify.tags_from_path('/path/to/directory') +{'directory'} +>>> identify.tags_from_path('/path/to/symlink') +{'symlink'} +``` + +When using a file on disk, the checks performed are: + +* File type (file, symlink, directory) +* Mode (is it executable?) +* File name (mostly based on extension) +* If executable, the shebang is read and the interpreter interpreted + + +### If you only have the filename + +```python +>>> identify.tags_from_filename('file.py') +{'text', 'python'} +``` + + +### If you only have the interpreter + +```python +>>> identify.tags_from_interpreter('python3.5') +{'python', 'python3'} +>>> identify.tags_from_interpreter('bash') +{'shell', 'bash'} +>>> identify.tags_from_interpreter('some-unrecognized-thing') +set() +``` + +### As a cli + +``` +$ identify-cli --help +usage: identify-cli [-h] [--filename-only] path + +positional arguments: + path + +optional arguments: + -h, --help show this help message and exit + --filename-only +``` + +```bash +$ identify-cli setup.py; echo $? +["file", "non-executable", "python", "text"] +0 +identify setup.py --filename-only; echo $? +["python", "text"] +0 +$ identify-cli wat.wat; echo $? +wat.wat does not exist. +1 +$ identify-cli wat.wat --filename-only; echo $? +1 +``` + +### Identifying LICENSE files + +`identify` also has an api for determining what type of license is contained +in a file. This routine is roughly based on the approaches used by +[licensee] (the ruby gem that github uses to figure out the license for a +repo). + +The approach that `identify` uses is as follows: + +1. Strip the copyright line +2. Normalize all whitespace +3. Return any exact matches +4. Return the closest by edit distance (where edit distance < 5%) + +To use the api, install via `pip install identify[license]` + +```pycon +>>> from identify import identify +>>> identify.license_id('LICENSE') +'MIT' +``` + +The return value of the `license_id` function is an [SPDX] id. Currently +licenses are sourced from [choosealicense.com]. + +[licensee]: https://github.com/benbalter/licensee +[SPDX]: https://spdx.org/licenses/ +[choosealicense.com]: https://github.com/github/choosealicense.com + +## How it works + +A call to `tags_from_path` does this: + +1. What is the type: file, symlink, directory? If it's not file, stop here. +2. Is it executable? Add the appropriate tag. +3. Do we recognize the file extension? If so, add the appropriate tags, stop + here. These tags would include binary/text. +4. Peek at the first X bytes of the file. Use these to determine whether it is + binary or text, add the appropriate tag. +5. If identified as text above, try to read and interpret the shebang, and add + appropriate tags. + +By design, this means we don't need to partially read files where we recognize +the file extension. |