{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Getting Started","text":""},{"location":"#what-is-this","title":"What is this","text":"
txtdot is a proxy that requests the page by the given URL, extracts only useful data including text, links, pictures and tables, and returns it as an HTML page with a minimalistic design optimized for text reading.
txtdot increases the loading speed and reduces client's bandwidth usage since no unnecessary code and no scripts are transferred. Also, you won't see any advertisement (unless it's a static picture that is hard to detect as ads). There are no trackers too.
"},{"location":"#how-to-use-it","title":"How to use it","text":"txtdot is an open source software, so everyone can host it on his own server. The official instance is txt.dc09.ru, the list of all instances is here.
On the main page, there's a handy form where you can specify a URL, choose an engine and a format for parsed data. On the /get
page, \"Home\" button returns you to /
, \"Original page\" opens the entered URL in the same window without txtdot proxy.
The latest docs for API endpoints can be found here. For handy JSON API, use /api/parse
returning an engine result object (see below). For pure HTML response, use /api/raw-html
. Note that both API and browser endpoints on txt.dc09.ru are ratelimited to 2 requests per second.
This project exists thanks to great Mozilla's Readability.js library. The initial idea was to process HTML with it on the server so the client does not need to download and execute heavy JS, doesn't need to use an adblock.
Readability performs its work very well in most cases.
If an ?engine=
parameter wasn't passed, but txtdot found that a specific engine is assigned to the requested domain, for example, \"stackoverflow.com\": engines.StackOverflow
, it uses that engine to process the URL. Otherwise, the page is parsed with the engine assigned to *
(it's Readability).
Readability is good, but now always, so artegoser wrote the basis of the code keeping in mind that we'll extend txtdot with other engines. Back then, it was functions taking a URL as a parameter, returning an object that contains extracted HTML and plain text, page title and language. The object is rendered with ejs template (or, in /api/parse
, just sent as JSON).
But after a while it became unwieldy and we decided to create a monorepo. We created classes Engines, Middlewares that handle the necessary parts. Now you can create such functions for different domains, and routes. Also we added support for JSX for simplifying the code of plugins.
"},{"location":"#engines","title":"Engines","text":"Creation of engines is easy.
import { Engine, Route } from \"@txtdot/sdk\";\n\nconst Readability = new Engine(\n \"Readability\", // Name of the engine\n \"Engine for parsing content with Readability\", // Description\n [\"*\"] // Domains that use this engine\n);\n\nReadability.route(\"*path\", async (input, ro: Route<{ path: string }>) => {\n // ...\n\n // If any of the parameters except content is empty, txtdot will try to extract it from the page automatically\n return {\n content: parsed.content,\n title: parsed.title,\n lang: parsed.lang,\n };\n});\n
"},{"location":"#middlewares","title":"Middlewares","text":"Creation of middlewares similar to engines.
import { Middleware } from \"@txtdot/sdk\";\n\nconst Highlight = new Middleware(\n \"Highlight\",\n \"Highlights code with highlight.js only when needed\",\n [\"*\"]\n);\n\nHighlight.use(async (input, ro, out) => {\n if (out.content.indexOf(\"<code\") !== -1)\n return {\n ...out,\n content: <Highlighter content={out.content} />,\n };\n\n return out;\n});\n
"},{"location":"docker/","title":"Docker","text":"If you prefer hosting without Docker, see Self-Hosting instead.
Download docker-compose.yml and txtdot configs, edit them and then start the container:
wget https://raw.githubusercontent.com/TxtDot/txtdot/main/docker-compose.yml\nwget -O .env https://raw.githubusercontent.com/TxtDot/txtdot/main/.env.example\nnano .env\ndocker compose up -d\n
Alternatively, you can configure txtdot with the environment
section of docker-compose config (don't forget to remove .env and volumes
).
txtdot can be configured either with environment variables or with the .env
file in the working directory which has higher priority. For sample config, see .env.example
.
Default: 0.0.0.0
Host where HTTP server should listen for connections. Set it to 127.0.0.1
if your txtdot instance is behind reverse proxy, 0.0.0.0
otherwise.
Default: 8080
Port where HTTP server should listen for connections.
"},{"location":"env/#timeout","title":"Timeout","text":"Default: 0
Max response time in milliseconds. If it's reached, the request is aborted. If set to 0
, the timeout is disabled.
Default: false
Set it to true
only if your txtdot instance runs behind reverse proxy. Needed for processing X-Forwarded headers.
Default: true
Whether to allow proxying images, video, audio and everything else through your txtdot instance.
"},{"location":"env/#img_compress","title":"IMG_COMPRESS","text":"Default: true
Whether to compress images through your txtdot instance.
"},{"location":"env/#documentation","title":"Documentation","text":""},{"location":"env/#swagger","title":"SWAGGER","text":"Default: false
Whether to add /doc
route for Swagger API docs.
SearXNG base URL, if set, txtdot will use it for searching and add search form to the page with /search route.
"},{"location":"env/#webder_url","title":"WEBDER_URL","text":"Webder base URL, if set, txtdot will use it for rendering web pages.
"},{"location":"reverse/","title":"Reverse Proxy","text":""},{"location":"reverse/#nginx","title":"Nginx","text":"Basically, you just need to set the domain, TLS certificates, Host and X-Forwarded headers (so txtdot could know the hostname) and pass all requests to txtdot.
server {\n listen 443 ssl http2;\n listen [::]:443 ssl http2;\n\n # Replace the domain\n server_name txt.dc09.ru;\n\n ssl_certificate ...pem;\n ssl_certificate_key ...key;\n # More options here:\n # https://ssl-config.mozilla.org/#server=nginx&config=modern\n\n location / {\n # Replace 8080 port if needed\n proxy_pass http://127.0.0.1:8080;\n\n proxy_set_header Host $host;\n proxy_set_header X-Real-IP $remote_addr;\n proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n proxy_set_header X-Forwarded-Proto $scheme;\n }\n}\n
On the official instance, TLS is configured in the main nginx config, so we omit these options below.
Nginx serves static files faster than NodeJS, let's configure it:
server {\n ...\n\n location /static/ {\n alias /home/txtdot/src/dist/static/;\n }\n}\n
What about rate-limiting? We don't want the hackers to overload our proxy.
The config below rate-limits to 2 requests per second, allows to put up to 4 requests into the queue, sets the maximum size for zone to 10 megabytes. See the Nginx blog post for detailed explanation.
limit_req_zone $binary_remote_addr zone=txtdotapi:10m rate=2r/s;\n\nserver {\n ...\n location / {\n limit_req zone=txtdotapi burst=4;\n ...\n }\n ...\n}\n
Let's put all together. Here's our sample config:
limit_req_zone $binary_remote_addr zone=txtdotapi:10m rate=2r/s;\n\nserver {\n listen 443 ssl http2;\n listen [::]:443 ssl http2;\n\n server_name txt.dc09.ru;\n\n location / {\n limit_req zone=txtdotapi burst=4;\n proxy_pass http://127.0.0.1:8080;\n\n proxy_set_header Host $host;\n proxy_set_header X-Real-IP $remote_addr;\n proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n proxy_set_header X-Forwarded-Proto $scheme;\n }\n\n location /static/ {\n alias /home/txtdot/src/dist/static/;\n }\n}\n
"},{"location":"reverse/#apache","title":"Apache","text":"Coming soon. If you are familiar with Apache httpd and want to help, write a config here (a small explanation as above also would be great) and open a pull request.
"},{"location":"selfhost/","title":"Self-Hosting","text":"If you prefer hosting with Docker, see Docker instead.
"},{"location":"selfhost/#install-nodejs-and-npm","title":"Install nodejs and npm","text":"For Debian, Ubuntu: packages in the repository are so old, consider installing them with NodeSource. Minimal required version is NodeJS 18.
Other distros:
# CentOS\nsudo yum install nodejs\n# Arch\nsudo pacman -S nodejs npm\n# Alpine\ndoas apk add nodejs npm\n
"},{"location":"selfhost/#create-a-user-for-txtdot","title":"Create a user for txtdot","text":"Almost all distros except Alpine:
sudo useradd -r -m -s /sbin/nologin -U txtdot\nsudo -u txtdot bash\n
Alpine Linux with busybox and doas:
doas addgroup -S txtdot\ndoas adduser -h /home/txtdot -s /sbin/nologin -G txtdot -S -D txtdot\ndoas -u txtdot bash\n
"},{"location":"selfhost/#build-config-and-launch","title":"Build, config and launch","text":"Clone the git repository, cd into it:
git clone https://github.com/txtdot/txtdot.git src\ncd src\n
Copy and modify the sample config file (see the Configuring section):
cp .env.example .env\nnano .env\n
Install packages, compile TS:
npm install\nnpm run build\n
Manually start the server to check if it works (Ctrl+C to exit):
npm run start\n
Log out from the txtdot account:
exit\n
"},{"location":"selfhost/#add-txtdot-to-autostart","title":"Add txtdot to autostart","text":"Either using systemd unit file:
wget https://raw.githubusercontent.com/TxtDot/txtdot/main/config/txtdot.service\nsudo chown root:root txtdot.service\nsudo chmod 644 txtdot.service\nsudo mv txtdot.service /etc/systemd/system/\nsudo systemctl daemon-reload\nsudo systemctl enable txtdot\nsudo systemctl start txtdot\n
Or using OpenRC script:
wget -O txtdot https://raw.githubusercontent.com/TxtDot/txtdot/main/config/txtdot.init\ndoas chown root:root txtdot\ndoas chmod 755 txtdot\ndoas mv txtdot /etc/init.d/\ndoas rc-update add txtdot\ndoas rc-service txtdot start\n
Or using crontab:
sudo crontab -u txtdot -e\n# The command will open an editor\n# Add this line to the end of the file:\n@reboot sleep 10 && cd /home/txtdot/src && npm run start\n# Save the file and exit\n
"}]}