documentation/docs/index.md

# Getting Started

## What is this

_txtdot_ is a proxy that requests the page by the given URL,
extracts only useful data including text, links, pictures and tables,
and returns it as an HTML page with a minimalistic design
optimized for text reading.

_txtdot_ increases the loading speed and reduces client's bandwidth usage
since no unnecessary code and no scripts are transferred.
Also, you won't see any advertisement (unless it's a static picture that is hard to detect as ads).
There are no trackers too.

## How to use it

_txtdot_ is an open source software, so everyone can host it on his own server.
The official instance is [txt.dc09.ru](https://txt.dc09.ru),
the list of all instances is [here](https://github.com/txtdot/instances).

On the main page, there's a handy form where you can
specify a URL, choose an engine and a format for parsed data.
On the `/get` page, "Home" button returns you to `/`,
"Original page" opens the entered URL in the same window without txtdot proxy.

The latest docs for API endpoints can be found [here](https://txt.dc09.ru/doc).
For handy JSON API, use `/api/parse` returning an engine result object (see below).
For pure HTML response, use `/api/raw-html`.
Note that both API and browser endpoints on txt.dc09.ru
are ratelimited to 2 requests per second.

## How it works

This project exists thanks to great Mozilla's
[Readability.js](https://github.com/mozilla/readability) library.
The initial idea was to process HTML with it on the server
so the client does not need to download and execute heavy JS,
doesn't need to use an adblock.

Readability performs its work very well in most cases.
But not always. For example, check any StackOverflow page or Google search results.

So [artegoser](https://github.com/artegoser) wrote the basis of the code
keeping in mind that we'll extend txtdot with other _engines_.
For now, engines are functions taking a URL as a parameter,
returning an object that contains extracted HTML and plain text, page title and language.
The object is rendered with ejs template (or, in `/api/parse`, just sent as JSON).

If an `?engine=` parameter wasn't passed, but txtdot found
that a specific engine is assigned to the requested domain,
for example, `"stackoverflow.com": stackoverflow`,
it uses that engine to process the URL.
Otherwise, the page is parsed with the engine assigned to `*` (it's Readability).
MkDocs, main page 2023-08-29 19:52:15 +04:00			`# Getting Started`

Written 2 paragraphs, fixed typos I'll continue tomorrow 2023-08-29 21:52:55 +04:00			`## What is this`
MkDocs, main page 2023-08-29 19:52:15 +04:00
doc: add poetry and deploy 2023-08-30 12:22:07 +03:00			`_txtdot_ is a proxy that requests the page by the given URL,`
Written 2 paragraphs, fixed typos I'll continue tomorrow 2023-08-29 21:52:55 +04:00			`extracts only useful data including text, links, pictures and tables,`
MkDocs, main page 2023-08-29 19:52:15 +04:00			`and returns it as an HTML page with a minimalistic design`
Written 2 paragraphs, fixed typos I'll continue tomorrow 2023-08-29 21:52:55 +04:00			`optimized for text reading.`
MkDocs, main page 2023-08-29 19:52:15 +04:00
doc: add poetry and deploy 2023-08-30 12:22:07 +03:00			`_txtdot_ increases the loading speed and reduces client's bandwidth usage`
Typo in "transferred" 2023-08-31 11:13:25 +04:00			`since no unnecessary code and no scripts are transferred.`
Written 2 paragraphs, fixed typos I'll continue tomorrow 2023-08-29 21:52:55 +04:00			`Also, you won't see any advertisement (unless it's a static picture that is hard to detect as ads).`
			`There are no trackers too.`

"How to use", brief help on the API 2023-08-30 11:41:16 +04:00			`## How to use it`

doc: add poetry and deploy 2023-08-30 12:22:07 +03:00			`_txtdot_ is an open source software, so everyone can host it on his own server.`
"How to use", brief help on the API 2023-08-30 11:41:16 +04:00			`The official instance is [txt.dc09.ru](https://txt.dc09.ru),`
doc: add poetry and deploy 2023-08-30 12:22:07 +03:00			`the list of all instances is [here](https://github.com/txtdot/instances).`
"How to use", brief help on the API 2023-08-30 11:41:16 +04:00
			`On the main page, there's a handy form where you can`
			`specify a URL, choose an engine and a format for parsed data.`
			On the `/get` page, "Home" button returns you to `/`,
			`"Original page" opens the entered URL in the same window without txtdot proxy.`

			`The latest docs for API endpoints can be found [here](https://txt.dc09.ru/doc).`
			For handy JSON API, use `/api/parse` returning an engine result object (see below).
			For pure HTML response, use `/api/raw-html`.
			`Note that both API and browser endpoints on txt.dc09.ru`
			`are ratelimited to 2 requests per second.`

Written 2 paragraphs, fixed typos I'll continue tomorrow 2023-08-29 21:52:55 +04:00			`## How it works`

			`This project exists thanks to great Mozilla's`
			`[Readability.js](https://github.com/mozilla/readability) library.`
			`The initial idea was to process HTML with it on the server`
			`so the client does not need to download and execute heavy JS,`
			`doesn't need to use an adblock.`

			`Readability performs its work very well in most cases.`
			`But not always. For example, check any StackOverflow page or Google search results.`

			`So [artegoser](https://github.com/artegoser) wrote the basis of the code`
doc: add poetry and deploy 2023-08-30 12:22:07 +03:00			`keeping in mind that we'll extend txtdot with other _engines_.`
Written 2 paragraphs, fixed typos I'll continue tomorrow 2023-08-29 21:52:55 +04:00			`For now, engines are functions taking a URL as a parameter,`
			`returning an object that contains extracted HTML and plain text, page title and language.`
			The object is rendered with ejs template (or, in `/api/parse`, just sent as JSON).

Completed "how it works" paragraph 2023-08-30 10:16:44 +04:00			If an `?engine=` parameter wasn't passed, but txtdot found
			`that a specific engine is assigned to the requested domain,`
			for example, `"stackoverflow.com": stackoverflow`,
			`it uses that engine to process the URL.`
			Otherwise, the page is parsed with the engine assigned to `*` (it's Readability).