---
title: "Using gendertext"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using gendertext}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(gendertext)
```

## Introduction

The **gendertext** package provides simple, transparent tools for
identifying gendered language in text and suggesting gender neutral
alternatives. It is designed for researchers, policy analysts, editors,
and practitioners who want to assess and improve inclusive language in
documents.

The package follows a dictionary based approach. All results come from a
built in corpus of gendered terms paired with suggested neutral
replacements, so every match can be traced back to a specific dictionary
entry.

## The built in dictionary

The package ships with `gender_dictionary`, a curated dictionary of 208
gendered words and phrases. It covers occupational titles, pronouns,
forms of address, family terms, and common idioms, informed by the United
Nations guidelines for gender inclusive language and the European
Parliament guidance on gender neutral language.

```{r}
data(gender_dictionary)
head(gender_dictionary, 10)
nrow(gender_dictionary)
```

## Scoring a text

The simplest way to use gendertext is to score a character string. The
result reports how many tokens the text contains, how many of them are
gendered according to the dictionary, and the corresponding percentages.

```{r}
gender_score(
  text = "Ladies and gentlemen, the chairman said he will call the policeman."
)
```

The reported neutral percentage is a proxy: it is the share of tokens not
matched by any dictionary entry. Multi word phrases are matched before
single words and each piece of text is counted at most once, so the
phrase "ladies and gentlemen" is counted as one match spanning three
tokens, never as "ladies" plus "gentlemen" on top of the phrase.

If you only need the number of dictionary matches, use
`unit = "matches"`:

```{r}
gender_score(
  text = "The chairman and the spokesman left.",
  unit = "matches"
)
```

## Listing suggestions

`gender_suggestions()` returns the gendered terms found in a text
together with the suggested neutral replacement for each one.

```{r}
gender_suggestions(
  text = "Our chairman said he will email the mailman and the stewardess."
)
```

## Rewriting a text

`gender_replace()` applies the dictionary to the original text and
returns a rewritten version. Capitalisation follows the matched text.

```{r}
gender_replace(
  text = "The Chairman called the policeman and the FIREMAN."
)
```

Replacement is plain substitution: the function does not adjust the
surrounding grammar, so a replacement such as "they" for "he" may need a
manual touch afterwards. Treat the output as a draft.

## Using your own dictionary

Every function accepts a custom dictionary through the `dictionary`
argument: a data frame with character columns `gendered` and `neutral`.
This makes it easy to extend, restrict, or fully replace the built in
corpus.

```{r}
my_dict <- data.frame(
  gendered = c("dude", "bro"),
  neutral = c("person", "friend")
)
gender_suggestions(text = "Hey dude, thanks bro!", dictionary = my_dict)
```

## Working with files

The functions also accept a `path` argument. Plain text files are read
with base R, so no additional packages are required.

```{r}
txt <- system.file("extdata", "test.txt", package = "gendertext")
gender_score(path = txt)
head(gender_suggestions(path = txt))
```

Other document formats, such as PDF and Word, are supported through the
optional readtext package. Install it with
`install.packages("readtext")`.

```{r, eval = requireNamespace("readtext", quietly = TRUE)}
pdf <- system.file("extdata", "test.pdf", package = "gendertext")
gender_score(path = pdf)
```

Please note: PDF analysis depends on the presence of extractable text.
Scanned or image only documents may not yield readable content.

## Limitations

* Results depend on dictionary coverage; terms missing from the
  dictionary are not detected.
* The package does not attempt semantic interpretation. Words such as
  "her" are flagged even when they refer to a specific person whose
  pronouns are known and correct.
* Gender neutrality is estimated through dictionary matching, not
  linguistic inference, and the neutral share is a proxy measure.

## Conclusion

gendertext offers a lightweight and reproducible way to examine gendered
language in text. Its transparent, dictionary based design makes it
suitable for research, policy review, editorial work, and exploratory
analysis.