Parsing and normalising author names

What parse_names() is

parse_names() is an optional, standalone utility for cleaning author name strings. It does two things:

  1. Reorders names to "First Last" (or other styles).
  2. Breaks each name into components (first, last, particle, suffix), returned as the "parts" attribute.

It is not called by any reader or network builder. bibnets matches entity labels verbatim; you opt in to normalisation by calling this function yourself.

parse_names(c("Saqr, Mohammed", "Lopez-Pernas, Sonsoles"))
#> [1] "Mohammed Saqr"         "Sonsoles Lopez-Pernas"
#> attr(,"parts")
#>                 original    first         last particle suffix   type
#> 1         Saqr, Mohammed Mohammed         Saqr     <NA>   <NA> person
#> 2 Lopez-Pernas, Sonsoles Sonsoles Lopez-Pernas     <NA>   <NA> person

The three name conventions

Bibliometric exports use three incompatible conventions. parse_names() recognises all three; the rule is decided per string.

Input Convention Detected by
"Saqr, Mohammed" Last, First the comma
"WANG Y" SURNAME Initials (Scopus/bibnets) trailing uppercase 1–3 letter token
"Mohammed Saqr" First Last default for comma-less, non-initial
parse_names(c("Saqr, Mohammed", "WANG Y", "Mohammed Saqr"))
#> [1] "Mohammed Saqr" "Y WANG"        "Mohammed Saqr"
#> attr(,"parts")
#>         original    first last particle suffix   type
#> 1 Saqr, Mohammed Mohammed Saqr     <NA>   <NA> person
#> 2         WANG Y        Y WANG     <NA>   <NA> person
#> 3  Mohammed Saqr Mohammed Saqr     <NA>   <NA> person

A comma always means Last, First. For comma-less strings the surname_first argument controls interpretation:

parse_names("Wang Yong", surname_first = "yes")   # force surname-first
#> [1] "Yong Wang"
#> attr(,"parts")
#>    original first last particle suffix   type
#> 1 Wang Yong  Yong Wang     <NA>   <NA> person
parse_names("WANG Y",    surname_first = "no")    # force given-first
#> [1] "WANG Y"
#> attr(,"parts")
#>   original first last particle suffix   type
#> 1   WANG Y  WANG    Y     <NA>   <NA> person

Particles and suffixes are handled, and detection is case-insensitive so it works on bibnets’ upper-cased labels:

parse_names(c("van der Berg, Jan", "Smith, John, Jr.",
              "DE LA CRUZ, ANA", "VAN DER BERG J"))
#> [1] "Jan van der Berg" "John Smith Jr"    "ANA DE LA CRUZ"   "J VAN DER BERG"  
#> attr(,"parts")
#>            original first  last particle suffix   type
#> 1 van der Berg, Jan   Jan  Berg  van der   <NA> person
#> 2  Smith, John, Jr.  John Smith     <NA>     Jr person
#> 3   DE LA CRUZ, ANA   ANA  CRUZ    DE LA   <NA> person
#> 4    VAN DER BERG J     J  BERG  VAN DER   <NA> person

Group / corporate authors, NA, and empty strings are left untouched:

parse_names(c("WHO Collaborating Group", NA, ""))
#> [1] "WHO Collaborating Group" NA                       
#> [3] ""                       
#> attr(,"parts")
#>                  original first last particle suffix         type
#> 1 WHO Collaborating Group  <NA> <NA>     <NA>   <NA> organization
#> 2                    <NA>  <NA> <NA>     <NA>   <NA>      missing
#> 3                          <NA> <NA>     <NA>   <NA>        empty

Output styles: format

nm <- c("Saqr, Mohammed", "van der Berg, Jan", "Garcia Marquez, Gabriel Jose")
data.frame(
  first_last    = parse_names(nm),
  last_initials = parse_names(nm, format = "last_initials"),
  last          = parse_names(nm, format = "last")
)
#>                    first_last       last_initials           last
#> 1               Mohammed Saqr             Saqr M.           Saqr
#> 2            Jan van der Berg     van der Berg J.   van der Berg
#> 3 Gabriel Jose Garcia Marquez Garcia Marquez G.J. Garcia Marquez

The "parts" attribute

The parsed components ride along on every call, independent of format:

x <- parse_names(c("van der Berg, Jan", "Smith, John, Jr."))
attr(x, "parts")
#>            original first  last particle suffix   type
#> 1 van der Berg, Jan   Jan  Berg  van der   <NA> person
#> 2  Smith, John, Jr.  John Smith     <NA>     Jr person

type is one of "person", "organization", "empty", "missing".

Input shape: vector, not data frame

parse_names() works on one flat character vector. It is not a data-frame function.

bibnets readers store authors as a list-column: each paper has a variable number of authors, so the cell holds a vector, not a single string.

papers <- data.frame(id = c("P1", "P2", "P3"), stringsAsFactors = FALSE)
papers$authors <- list(
  c("Saqr, Mohammed", "Lopez, Ana"),
  c("SAQR M",         "Lopez, Ana"),
  c("Saqr, Mohammed", "Chen, Wei"))
papers$authors
#> [[1]]
#> [1] "Saqr, Mohammed" "Lopez, Ana"    
#> 
#> [[2]]
#> [1] "SAQR M"     "Lopez, Ana"
#> 
#> [[3]]
#> [1] "Saqr, Mohammed" "Chen, Wei"

Map the function over the list-column with lapply():

papers$authors <- lapply(papers$authors, parse_names,
                          format = "last_initials")
papers$authors
#> [[1]]
#> [1] "Saqr M."  "Lopez A."
#> attr(,"parts")
#>         original    first  last particle suffix   type
#> 1 Saqr, Mohammed Mohammed  Saqr     <NA>   <NA> person
#> 2     Lopez, Ana      Ana Lopez     <NA>   <NA> person
#> 
#> [[2]]
#> [1] "SAQR M."  "Lopez A."
#> attr(,"parts")
#>     original first  last particle suffix   type
#> 1     SAQR M     M  SAQR     <NA>   <NA> person
#> 2 Lopez, Ana   Ana Lopez     <NA>   <NA> person
#> 
#> [[3]]
#> [1] "Saqr M." "Chen W."
#> attr(,"parts")
#>         original    first last particle suffix   type
#> 1 Saqr, Mohammed Mohammed Saqr     <NA>   <NA> person
#> 2      Chen, Wei      Wei Chen     <NA>   <NA> person

A flat character column (or a network’s from / to) is called directly, no lapply():

parse_names(c("WANG Y", "AYALA-ROMERO JA"))
#> [1] "Y WANG"          "JA AYALA-ROMERO"
#> attr(,"parts")
#>          original first         last particle suffix   type
#> 1          WANG Y     Y         WANG     <NA>   <NA> person
#> 2 AYALA-ROMERO JA   J A AYALA-ROMERO     <NA>   <NA> person

Applying to an existing edgelist (and its hazards)

The network object is a data frame (from, to, weight, count) with an extra bibnets_network class for printing:

class(net)
#> [1] "bibnets_network" "data.frame"
is.data.frame(net)
#> [1] TRUE

You can relabel from / to directly, but parse_names() is graph-blind. Edges, pairing, weight and count are preserved, but:

net$from <- as.vector(parse_names(net$from, format = "last"))
net$to   <- as.vector(parse_names(net$to,   format = "last"))
net
#> # bibnets network: author_collaboration | 3 nodes · 2 edges | counting: full 
#>    from   to    weight  count
#> 1  LOPEZ  SAQR       2      2
#> 2  CHEN   SAQR       1      1

Use as.vector() when assigning back so the "parts" attribute is not carried on the column.

Limitations

Summary