User guide & technical documentation
Find a file
2024-07-05 11:03:19 +04:00
LICENSE initial commit 2024-07-04 15:57:45 +04:00
README.md replace ... with … 2024-07-05 11:03:19 +04:00

Intro

Mitm-Archive consists of two parts:

  • Addon for mitmproxy intercepting and saving (archiving) all HTTP responses, written in Python
  • Server giving exactly the same responses as in an archive for corresponding method+domain+port+path+query, written in Go

"Archive" is an SQLite3 database and a directory storing headers and body for each archived response. See Format section for details.

User guide: addon

Installing mitmproxy

First, check if there is a package provided by your Linux distro and its version is 10.x (NOT 9.x or less).
If there isn't, or you are using Windows, you can download official pre-built binaries: https://mitmproxy.org/.

In case you are on a Linux distro without glibc or you don't trust official binaries (that's wise), use pipx install mitmproxy. Mitmproxy also contains native code, so the following packages are required: base-devel (includes gcc), openssl-devel, libbsd-devel, python3-devel. Note: these are package names in Void Linux repository; they may not match with yours.

Native library inside pylsqpack depends on BSD's sys/queue.h which is provided by libbsd-devel, but located in bsd/sys/queue.h. The simpliest solution is:

$ sudo ln -s /usr/include/bsd/sys/queue.h /usr/include/sys/queue.h

Now you can run pipx install mitmproxy

Configuring HTTPS proxy

Start mitmproxy or mitmweb. 1st is a CLI, 2nd provides web UI.

I'll assume that you are using Firefox (or forks). FF supports importing certificates browser-wide and it's simplier to configure HTTP proxy than in Chromium.

I recommend to create a separate browser profile, because next we'll import a TLS cert, and you must remember to remove it after creating an archive for security reasons. On Firefox, it's about:profiles in address bar > Create a New Profile. It's just an advice; if manually switching proxy off and removing mitmproxy cert is OK (you're sure you won't forget), then use your main profile, but close any active tabs that may produce extra requests that you don't want to be archived (e.g. messenger web clients like Element or Telegram Web).

Now, point your browser to the proxy on 127.0.0.1:8080. On Firefox, it's Settings > Network Settings (at the bottom) > Settings… > Manual proxy configuration > HTTP: 127.0.0.1, Port: 8080 > Checkbox "Also use this proxy for HTTPS".

Go to http://mitm.it, ignore warnings about an unencrypted connection (mitm.it is served by your local mitmproxy), click "Get mitmproxy-ca-cert.pem" below "Firefox". Import it: Settings > Privacy & Security > Certificates > View Certificates… > "Authorities" tab > Import… > Choose the downloaded cert > Checkbox "Trust this CA to identify web sites" > OK.

Archiving web sites

To get the addon, either clone the git repo:

$ git clone https://git.dc09.ru/mitm-archive/addon
$ cd addon

… or just download the script:

$ mkdir addon && cd addon
$ curl https://git.dc09.ru/mitm-archive/addon/raw/branch/main/addon.py >addon.py

Stop mitmproxy if it's still running (q and then y for mitmproxy; Ctrl+C for mitmweb), then re-launch it with the mitm-archive addon: mitmproxy -s addon.py (or mitmweb).

Each HTTP response that comes to mitmproxy is archived: metadata is in ./archive.db SQLite database, headers and body are in ./storage/{id}/headers and ./storage/{id}/body respectively.

To adjust these paths, set the environment variables:

$ export SQLITE_DB_PATH=archive.db
$ export STORAGE=storage
$ mitmproxy -s addon.py

User guide: server

// TODO

What's not implemented

  • Filter host instead of archiving everything (literally 2 lines of code, could be added soon after I figure out the best way to configure this)
  • Addon is configured with env vars, Server uses command-line options; should be unified?

Probably useful, but would overcomplicate the storage format and server logic:

  • Alphabetically sort query arguments both in addon and server (for now, if an archive contains /api?key=val&abc=def, the same request /api?abc=def&key=val gives 404, because the URLs are not exactly the same)

Harder to implement and definitely will overcomplicate the project while neither I nor anyone else need this:

  • Config option to omit some query args (if there is no /api?key=val&abc=def and it's allowed to omit abc, then search for /api?key=val)
  • Store request/response cookies in an archive
  • Config option to disable saving cookies specified by key (e.g. in case they contain credentials)
  • Config option to omit some cookies
  • Invent a custom format or find an existing one (kind of HashMap) for storing query args and cookies that will make the operations listed above more handy

For these usage screnarios, especially with cookies, it's simplier and overall better to self-host the web site server you are trying to archive or re-implement it in your favourite programming language and self-host.

Format

SQLite3 database contains data table with the following columns:

  • id - integer primary key for each archived response
  • method - string, specifies the request method, default "GET"
  • url - string, URL formatted as $scheme://$host:$port$path$query (e.g. https://dc09.ru:443/path?key=val), required
  • code - integer, HTTP response status code, default 200

INSERT query is executed with RETURNING id clause.

In file system storage, the addon creates a directory (if not exists) with the numeric ID returned by SQLite as its name, writes raw binary body data without any modifications to {id}/body file, writes headers in HTTP/1 format (name: value\r\n) to {id}/headers file.

The FS storage structure can be represented graphically this way:

storage/
|- 1/
|  |- headers
|  |- body
|
|- 2/
|  |- headers
|  |- body
|
|- {id}/
... ... ...